September 09, 2011

Simplifying our JPEG2000 conversion workflow

Over the summer, we have been working to streamline our JPEG 2000 conversion workflow. With the help of software developers from Genisys - one of the Trust’s strategic IT development and support partners - we have put the LuraWave command line interface to use in automating batch conversion.

Up to now we have been using the native GUI interface that comes with the LuraWave software, manually entering parameters and initiating the conversion process for each batch of images. This was useful for us as we settled into a large-scale digitisation workflow incorporating RAW - TIFF - JP2 conversion, cleared our backlog and established our compression testing methodology (as described in previous posts on this blog). With no relevant in-house programming expertise, the GUI was essential during these early stages. 

Now that we have a firm idea of how we want to use LuraWave, where it fits into the overall workflow, and what kind of throughput we need on a day-to-day basis, it was time to set up an automated solution.

The Wellcome Trust operates in an (almost) entirely Windows environment, so we commissioned the Genisys software engineers to code a .NET wrapper script running as an executable.  The wrapper script invokes LuraWave’s command line conversion to allow us to convert images with no manual intervention. An XML configuration file that contains the following information is used to control how the wrapper script invokes LuraWave:
  • "Inbox" directory (files ready for conversion)
  • Temporary directory (files copied before conversion)
  • "Outbox" directory (converted files)
  • LuraWave command line
  • Error directory
  • List of any files to exclude from conversion
LuraWave retains the original folder structure, so the "Inbox" and "Outbox" is the top level directory, with the original folder hierarchy maintained throughout the conversion process.

Polling of the specified input folder is handled with Windows Scheduler, which can be run on a PC or on a server (we run it on a virtual server). Every 5 minutes Windows Scheduler prompts the script to check for TIFFs in the "Inbox".  Lurawave is then invoked, converting the TIFFs to JP2s that are copied out to the “Outbox”.  We’ve got some really good error handling in place so if one rogue file can’t be converted the rest of the files still get converted – essential when converting big volumes, we don’t want the first file failing and halting an overnight run of thousands of files.

Windows Scheduler does not parallel process, so folders are queued for conversion. With speeds of around 30Gb (at least 1,200 TIFFs) per hour, this is quick enough for our needs.

This implementation means that a single LuraWave license can be used for any number of input streams, and with the facility to "call" multiple definitions; it can also convert images to multiple JPEG 2000 profiles (we currently have a lossless profile and a lossy profile).

With thanks to Alastair Reid, Wellcome Trust IT Account Manager, for providing this information and reviewing this post.

June 21, 2011

Thoughts on the 2011 JP2 Summit

I attended the JP2 Summit in Washington D.C. in May (initiated and organised by Robert Buckley and Steve Puglia and hosted by the Library of Congress) representing both the Wellcome Library and the JP2K-UK Working Group. I found this event an interesting counterpart to the JPEG2000 Seminar we held here at the Wellcome Trust last year.

There were around 90 people at the Summit, most from the D.C. area and eastern seaboard cultural institutions such as the LoC; National Archives; Smithsonian libraries and archives; a range of university libraries including Yale, Harvard, U. of Virginia, UConn; NARA; and many others. The level of experience in digital imaging and preservation was generally quite high, while the understanding of JPEG2000 ranged from very little to highly informed. Nearly a mirror audience to the Wellcome Trust event, although perhaps with fewer privately funded organisations represented (although there were some, including Google).

The day began with a tutorial by Robert Buckley, and although I had heard much of this in previous presentations, or through reading up on JP2, I always find it hard to keep the details fresh in my mind. So it was useful to get a refresher, and it set the stage well for people who had little knowledge of the technical issues and background to the format.

After the tutorial, there was a series of presentations, all of which are listed on the JPEG2000 page of the FADGI website. I won't go into the details here (and you can read more on Steve Puglia's blog post), but we heard about a range of practical issues around use of JP2 for newspaper digitisation, digital video, special collections and Google books; technical developments around implementing JP2 as part of a workflow including quality assurance and issues of long-term preservation; and the results of a survey of use and attitudes toward JP2 in libraries and archives.

In the library and archive community JP2 is being adopted mainly for mass digitisation with storage costs being the primary driver - there is no denying that. What was clear here - as with the presentations given last year - was that while JP2 is not yet the most practical solution in terms of usability, it is becoming more and more widely accepted for its flexibility and robustness as well as for its space-saving intelligent compression. With increasing knowledge of the format practitioners are now coming to see JP2 in the context of these other important features, and investigating - even demanding - ways to use these other features more easily.

Of course, not everyone is 100% convinced that JP2 can meet the needs of digital archiving, or digital image delivery. Many concerns seem to have been appeased by the presentations and tutorial - simply by finding out how many people are using the format, and how much value they get from it. There are still barriers to people taking up JP2 more enthusiastically - mainly around the lack of adoption by digital cameras and browsers, loss of information in lossy compression, risk that there still isn't a wide enough take-up in the community to maintain the currency of the format in the longer term, and the small range of tools for implementing the format that simply can't meet their needs.

The second day of the Summit finished off with a small-group discussion session around JP2 implementation. For me, the most interesting part of this discussion was around community building.

While we may never see digital cameras natively producing JP2s, for example, some barriers can be broken down by simply sharing. Information on and results of testing, tools and ways to use them, workflow advice, and preservation technologies are all important and can easily be shared. Use of JP2 doesn't always boil down to technical reassessment however. There is also revisiting certain aspects of digital preservation strategy such as defining significant properties/data, predicting migration scenarios and what that really entails, determining what the use of the digital content really is. It is also recognising emotional responses to preservation risks and the fact that these decisions have a long-term effect, shaping the legacy of entire collections. The leap to JP2 is best done in collaboration, and moral support should not be discounted!

June 15, 2011

The JP2K-UK wiki has moved

The wiki created as part of the JP2K-UK working group has been moved to a dedicated space on the Open Planets wiki. The content has now been transferred and is in the process of being updated and added to. We welcome contributions - all you have to do is log into the OPF wiki.

May 27, 2011

ICC profiles and LuraWave

Johan van der Knijff's long-awaited D-Lib paper JPEG 2000 for long term preservation: JP2 as a preservation format, has now come out. In this paper he mentions the various ways LuraWave has handled colour profile information, and I thought it was a good time to elaborate some on the developments we have commissioned from Luratech regarding this issue.

As Johan mentions in the paper, when we started using LuraWave and carrying out JHOVE testing to determine whether the files were compliant with the standard, we found that where an ICC display profile was included in the TIFF (and this was virtual standard across our image set) LuraWave automatically encoded the file as JPX in a JP2 wrapper. This ensured compliance with the standard, but we were not happy with using JPX. So we asked Luratech to modify LuraWave to include an additional command that allowed us to tell the application to ignore the ICC profile completely. This meant that we got a 100% JP2 file, but the colour profile information was then stripped out.

We wanted to include a colour profile in our digital image files. This prevents ambiguity when decoding the images in an image editor or image viewer. We were left with only one option - convert everything to sRGB and allow LuraWave to include the numerical value of sRGB in the file, which is allowed by the standard. Adobe RBG 1998, as Johan explains in detail in his article, is allowed only as an input profile, and our images did not include an input profile (and we didn't know how we could go about adding an input profile to our images).

We knew that it wouldn't matter to us, to the user, or to the decoding programme, how the profile was labelled - as long as it was there. It mattered only to the standard. So we asked Luratech to modify LuraWave yet again in order to read the display profile in our TIFFs and embed it into the JP2 file as an input profile. It is not an input profile. But we were limited by the standard, and this was our best option within those limitations to ensure we could include colour information without having to limit ourselves to sRGB - and without having to add in a workflow step to convert all our legacy images to sRGB.

This is the version of LuraWave that we currently use (2.1.22.10 - which includes other enhancements around improving performance, as reported in an earlier blog post). However - since Johan has succeeded in raising awareness of the deficient colour space provision in the standard, leading to agreement in the JPEG Committee to change the standard to accommodate real use scenarios such as our own, we can envisage requesting further changes to the LuraWave command tool once this is finalised.

April 28, 2011

Guest post: Color in JP2

Rob Buckley, colour imaging expert and author of JPEG 2000 as a Preservation and Access Format for the Wellcome Library, writes about the implementation of colour space metadata in the JP2 format and planned changes to the specification to better accommodate this information.

When I talk about JPEG 2000, I point out that most if not all still image applications that use JPEG 2000, especially in the cultural heritage community, can be satisfied with the JP2 file format. JP2 is the basic file format defined in Part 1 of the JPEG 2000 standard, along with the core decoder. Part 2 of the standard defines extended versions of both the file format and decoder, offering features aimed at specialized or advanced applications.

One point of confusion about the use of JP2 has had to do with its support for color spaces. When we were developing JP2 in the late 1990’s (JPEG 2000 was intended to come out in 2000), the application that most influenced the design was digital photography—JP2 was expected to be the next digital camera format. So support for sRGB was built in, along with support for the YCC and grayscale versions of sRGB. Other RGB color spaces used for image capture would be supported by using ICC input profiles, leaving aside display and output profiles. However, not all ICC input profiles were allowed: support was restricted to the ones needed for grayscale and RGB image data. Not supported and considered too complex for applications without a full color management engine was the input profile type that used a full multi-dimensional lookup-table. So users had the choice of specifying color in a JP2 file by name as sRGB (or sYCC or sGray) or via a simple ICC input profile.

After the release of the JPEG 2000 standard, two things happened. First digital cameras kept exporting the JPEG Baseline format; when they added a new export format, it was Raw and not JP2. The drive was toward more creative control rather than better compression when what they had was good enough.

The second thing was that most people ended up using ICC display profiles for RGB spaces rather than input profiles. A small thing you’d think, especially when the only difference between the display profiles they used and the input profiles supported by JP2 was the profile class value in the profile’s header: except for that, the data content of the two profile types is identical for RGB color spaces. As a result, I could take a JP2 file containing an RGB display profile (which technically makes the JP2 file illegal) change the profile class from display to input (by changing four bytes in the profile header and leaving everything else the same) and produce a legal JP2 file. It turns out that most readers ignore this value anyway and read the file fine either way. Using the extended file format was no help because it only extended color support to all types of input profiles, plus some other named and vendor-specified color spaces.

This confusion needed to be addressed as more and more institutions are using JP2 as a long-term preservation format, where predictability and clarity are prized. The solution is straightforward: amend the JP2 file format specification, aligning it with current practice so that it supports ICC display profiles as well as the set of input profiles it supports now.

And this is what is happening. Richard Clark and I led an activity that culminated in the JPEG 2000 committee approving a new activity to amend JP2 when it met this past February in Tokyo. This means that JP2 will support a wide range of RGB color spaces, which was the original intent, via both ICC input and display profiles. Since the JP2 spec was first issued, the ICC spec has undergone a major revision from V2 to V4 and been issued as an ISO standard. While this revision hardly affects the profiles used for RGB color spaces, it will also be addressed as part of the amendment. (The amendment will also address the ambiguity in the JP2 definition of resolution that Johan van der Knijff has brought up on this blog.)

The final outcome of all this will be a JP2 file format standard that aligns with current practice; supports RGB spaces such as Adobe RGB 1998, ProPhoto RGB and eci RGB v2; and provides a smooth migration path from TIFF masters as JP2 increasingly becomes used as an image preservation format.

January 28, 2011

TIFF to JPEG 2000 backlog, losslessness, and a perplexing speed issue

In October 2010 we initiated our "TIFF to JPEG 2000 backlog project", an endeavor to convert all the legacy images that make up our current image archive (Wellcome Images), as well as around 120,000 images that had been created during the Archives digitisation project. Over 450,000 images comprise the backlog, saved in a multitude of folders, on different servers on our Pillar SAN storage system. Converting the Wellcome Images TIFFs to lossless JPEG 2000 will save us around 12 Tb of storage space alone.

Why lossless, you ask? We have indeed expounded on the merits of lossy compression for large image sets created as a result of digitisation projects. But there is a significant difference with regards to the backlog project. While digitisation projects are usually carried out on collections of material that have fairly similar physical formats (modern printed books, paper documents, Arabic manuscripts, etc.), lending themselves to a generalised approach to compression determined via testing, this backlog project has no overall commonality (other than that they are all TIFFs of one flavour or another). Wellcome Images is populated one image at a time, or by small sets of images, including born digital photography and represent a cross-section of hundreds of different content types. There was no feasible way to group these images into sets that could be assessed for compression tolerance. The decision was made, therefore, to convert the entire Wellcome Images backlog to lossless JP2 files, thus removing any doubt whether the compression levels were appropriate.

During the initial stages of this project, we tested our installation of the LuraWave conversion tool (v.2.1.21.10) with high volumes of images stored on our network storage (as all the archived TIFFs are). What we found surprised us - instead of 20 min or so we expected for a batch of around 600 25Mb images, it was taking all night (around 6 hours). Was it a bandwidth issue? With the support of our IT team we carried out tests over the 1Gb network area. It was still unacceptably slow, showing that bandwith was not the issue. We moved the same batch of images onto the local hard drive of the machine that LuraWave was installed on, and confirmed that, yes, LuraWave can convert those images in around 20 min when they are colocated.

We turned to our suppliers, LuraTech, who quickly ferreted out the problem. LuraWave was programmed to convert images in parallel, to speed up the process, but it also buffers images in parallel. This buffering process, when carried out across our 100Mb network cable, slowed down considerably due to the parallel running. LuraTech modified the programme to cache each image onto the local disk first, individually, before then buffering and converting in parallel as usual. This brought the overall time down by 80%. The version we are currently using is 2.1.22.10.

In practice our approach has been tailored to suit individual sets of images within our backlog. A balance has to be struck between ease of use and the practicalities of applying multiple processing stages to files over a 100Mb network. Some image sets are copied locally to external hard drives, taking advantage of the speed gains this gives, whereas others that are more straightforward can be processed directly over the network using the much improved processing speeds. The combined effeciencies made converting our entire backlog feasible within the timeframe we had to spend on it.

We are now about a third of the way through the conversion backlog, and on track to become virtually TIFF-free by May 2011. What I haven't mentioned is the colour profile embedding issues that cropped up, the legacy colour space problems, and the work LuraTech did in addressing these issues - the topic of a future blog post.

December 20, 2010

Guest post: LoC response to discussion on long-term preservation of JPEG 2000

Carl Fleischhauer, Program Officer at NDIIPP, Library of Congress, responds to recent posts from Johan van der Knijff and the Wellcome Library regarding long-term preservation of JPEG 2000. Both posts mentioned the need to rate the JPEG 2000 format for long-term sustainability using criteria drawn up by the Library of Congress and the National Archives, UK (we have helpfully created an openly available/editable Google doc to make this a collaborative effort).

Thanks for provocative blogs

Thanks to Johan van der Knijff and Dave Thompson for the helpful blog postings here that frame some important questions about the sustainability of the JPEG 2000 format. Caroline Arms and I were flattered to see that our list of format-assessment factors was cited, along with the criteria developed at the UK National Archives. We certainly agree that many of these factors have a theoretical turn and that judgments about sustainability must be leavened by actual experience.

We also call attention to the importance of what we call Quality and Functionality factors (hereafter Q&F factors). It is possible that some formats will "score" high enough on these factors as to outweigh perceived shortcomings on the Sustainability Factor front.

As I drafted this response, I benefited from comments from Caroline and Michael Stelmach, the Library of Congress staffer who chairs the Federal Agencies Still Image Digitization Guidelines Working Group.

Colorspace (as it relates to the LoC's Q&F factor Color Maintenance)

We agree that the JPEG 2000 specification would be improved by the ability to use and declare a wider array of color spaces and/or ICC profile categories. We join you in endorsing Rob Buckley's valuable work on a JP2 extension to accomplish that outcome.

When Michael and I were chatting about this topic, he said that he been doing some informal evaluations of the spectra represented in printed matter at the Library of Congress. This is an informal investigation (so far) and his comment was off the cuff, but he said he had been surprised to see that the colors he had identified in a wide array of original items could indeed be represented within the sRGB color gamut, one of the enumerated color spaces in part 1 of the JPEG 2000 standard.

Michael added that he knew that some practitioners favor scRGB - not included in the JPEG 2000 enumerated list - either because of scRGB's increased gamut and/or perhaps because it allows for linear-to-intensity representations of brightness rather than only gamma-corrected representations. The extended gamut - compared to sRGB - will be especially valuable when reproducing items like works of fine art. And we agree with Johan van der Knijff's statement that there will be times when we will wish to go beyond input-class ICC profiles and embrace 'working' color spaces. All the more reason to support Rob Buckley's effort.

Adoption (the LoC Sustainability criteria includes adoption as a factor)

This is an area in which we all have mixed feelings: there is adoption of JPEG 2000 in some application areas but we wish there were more. Caroline pointed to one positive indicator: many practitioners who preserve and present high-pixel-count images like scanned maps, have embraced JPEG 2000 in part because of its support for efficient panning and zooming. The online presentation of maps at the Library of Congress is one good example (for a given map you see an 'old' JPEG in the browser, generated from JPEG 2000 data under the covers).

Caroline adds that the geospatial community uses JPEG 2000 as a standard (publicly documented, non-proprietary) alternative to the proprietary MrSID. Both formats continue to be used. LizardTech tools now support both equally. Meanwhile, GeoTIFF is used a lot too. Caroline notes that LizardTech re-introduced a free stand-alone viewer for JPEG2000/MrSID images last year in response to customer demand. And a new service for solar physics from NASA, Helioviewer, is based on JPEG2000. NASA includes a justification for using the format on their website.

For my part, I can report encountering some JPEG 2000 uptake in moving image circles, ranging from its use in the digital cinema's 'package' specification (see a slightly out of date summary) to its inclusion in Front Porch Digital's SAMMA device, used to reformat videotapes in a number of archives, including the Library of Congress.

Meanwhile, Michael recalled seeing papers that explored the use of JPEG 2000 compression in medical imaging (where JPEG 2000 is an option in the DICOM standard), with findings that indicated that diagnoses were just as successful in JPEG 2000 compressed images as they were when radiologists consulted uncompressed images. An online search using a set of terms like "JPEG2000, medical imaging, radiology" will turn up a number of relevant articles on this topic, including Juan Paz et al, 2009, "Impact of JPEG 2000 compression on lesion detection in MR imaging," in Medical Physics, which provides evidence to this effect.

On the other hand - negative indicators, I guess - we have the example of non-adoption by professional still photographers. On the creation-and-archiving side, their fondness for retaining sensor data motivates them to retain raw files or to wrap that raw data in DNG. I was curious about the delivery side, and looked at the useful dpBestFlow website and book, finding that the author-photographer Richard Anderson reports that he and his professional brethren deliver the following to their customers: RGB or CMYK files (I assume in TIFF or one of the pre-press PDF wrappers), "camera JPEGs" (old style), "camera TIFFs," or DNGs or raw files. There is no question that the lack of uptake of JPEG 2000 by professional photographers hampers the broader adoption of JPEG 2000.

Software tools (their existence is part of the Sustainability Factor of Adoption; their misbehavior is, um, misbehavior)

It was very instructive to see Johan van der Knijff's report on his experiments with LuraTech, Kakadu, PhotoShop, and ImageMagick. If he is correct, these packages do misbehave a bit and we should all encourage the manufacturers to fix what is broken. There is of course a dynamic between the application developers and adoption by their customers. If there is not greater uptake in realms like professional photography, will the software developers like Adobe take the time to fix things or even continue to support the JPEG 2000 side of their products?

Caroline, Michael, and I pondered Johan van der Knijff's suggestion that "the best way to ensure sustainability of JPEG 2000 and the JP2 format would be to invest in a truly open JP2 software library." We found ourselves of two minds about this. On the one hand, such a thing would be very helpful but, on the other, building such a package is definitely a non-trivial exercise. What level of functionality would be desired? The more we want, the more difficult to build. Johan van der Knijff's comments about JasPer remind us that some open source packages never receive enough labor to produce a product that rivals commercial software in terms of reliability, robustness, and functional richness. Would we be happy with a play-only application, to let us read the files we created years earlier with commercial packages that, by that future time, are defunct? In effect such an application would be the front end of a format-migration tool, restoring the raster data so that it can be re-encoded into our new preferred format. As we thought about this, we wondered if people would come forward to continue to update the software for new programming languages and operating systems, to keep them in operation to ensure that they are still working.

As a sidebar, Johan van der Knijff summarizes David Rosenthal's argument that "preserving the specifications of a file format doesn’t contribute anything to practical digital preservation" and "the availability of working open-source rendering software is much more important." We would like to assert that you gotta have 'em both: it would be no good to have the software and not the spec to back it up.

Error resilience

Preamble to this point: In drafting this, I puzzled over the fit of error resilience to our Sustainability and Quality/Functionality factors. In our description of JPEG 2000 core coding we mention error resilience in the Q&F slot Beyond Normal. But this might not be the best place for it. Caroline points out that error resilience applies beyond images and she notes that it may conflict with transparency (one of our Sustainability Factors). We find ourselves wishing for a bit of discussion of this sub-topic. Should error resilience be added as a Sustainability Factor, or expressed within one of the existing factors? Meanwhile, how important is transparency as a factor?

Here's the point in the case of JPEG 2000: Johan van der Knijff's blog does not comment on the error resilience elements in the JPEG 2000 specification. These are summarized in annex J, section 7, of the specification (pages 167-68 in the 2004 version), where the need for error resilience is associated with the "delivery of image data over different types of communication channels." We have heard varying opinions about the potential impact of these elements on long term preservation but tend to feel, "it can't be bad."

Here are a few of the elements, as outlined in annex J.7:
  • The entropy coding of the quantized coefficients is done within code-blocks. Since the encoding and decoding of the code-blocks are independent, bit errors in the bit stream of a code-block will be contained within that code-block.
  • Termination of the arithmetic coder is allowed after every coding pass. Also, the contexts may be reset after each coding pass. This allows the arithmetic coder to continue to decode coding passes after errors.
  • The optional arithmetic coding bypass style puts raw bits into the bit stream without arithmetic coding. This prevents the types of error propagation to which variable length coding is susceptible.
  • Short packets are achieved by moving the packet headers to the PPM (Packed Packet headers, Main header marker) or PPT (Packed packet header, Tile-part header marker) segments. If there are errors, the packet headers in the PPM or PPT marker segments can still be associated with the correct packet by using the sequence number in the SOP (Start of Packet marker).
  • A segmentation symbol is a special symbol. The correct decoding of this symbol confirms the correctness of the decoding of this bit-plane which allows error detection.
  • A packet with a resynchronization marker SOP allows spatial partitioning and resynchronization. This is placed in front of every packet in a tile with a sequence number stating at zero. It is incremented with each packet.
Conclusion

Thanks to the Wellcome Library for helping all of us focus on this important topic. We look forward to a continuing conversation.