OPF News & Overview

Skip to end of metadata
Go to start of metadata

This page gives an overview of the content across all the different wiki sites hosted by the OPF.

Popular Tags

Open Planets Foundation News

Open Planets Foundation
(The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.)
In defence of migration

There is a trend in digital preservation circles to question the need for migration.  The argument varies a little from proponent to proponent but in essence, it states that software exists (and will continue to exist) that will read (and perform requisite functions, e.g., render) old formats.  Hence, proponents conclude, there is no need for migration.  I had thought it was a view held by a minority but at a recent workshop it became apparent that it has been accepted by many.

 

 

 

 

However, I’ve never thought this is a very strong argument.  I’ve always seen a piece of software that can deal with not only new formats but also old formats as really just a piece of software that can deal with new formats with a migration tool seamlessly bolted onto the front of it.  In essence, it is like saying I don’t need a migration tool and a separate rendering tool because I have a combined migration and rendering tool.  Clearly that’s OK but it does not mean you’re not performing a migration?

 

As I see it, whenever a piece of software is used to interpret a non-native format it will need to perform some form of transformation from the information model inherent in the format to the information model used in the software.  It can then perform a number of subsequent operations, e.g., render to the screen or maybe even save to a native format of that software.  (If the latter happens this would, of course, be a migration.) 

 

Clearly the way software behaves is infinitely variable but it seems to me that it is fair to say that there will normally be a greater risk of information loss in the first operation (the transformation between information models) than in subsequent operations that are likely to utilise the information model inherent in the software (be it rendering or saving in the native format).  Hence, if we are concerned with whether or not we are seeing a faithful representation of the original it is the transformation step that should be verified. 

 

This is where using a separate migration tool comes into its own (at least in principle).  The point is that it allows an independent check to be made of the quality of the transformation to take place (by comparing the significant properties of the files before and after).  Subsequent use of the migrated file (e.g., by a rendering tool) is assumed to be lossless (or at least less lossy) since you can choose the migrated format so that it is the native format of the tool you intend to use subsequently (meaning when the file is read no transformation of information model is required). 

However, I would concede that there are some pragmatic things to consider...

 

First of all, migration either has a cost (if it requires the migrated file to be stored) or is slow (if it is done on demand).  Hence, there are probably cases where simply using a combined migration and rendering tool is a more convenient solution and might be good enough.

 

Secondly, is migration validation worth the effort?  Certainly it is worth simply testing, say, a rendering tool with some example files before deciding to use it which should be sufficient to determine that the tool works without detailed validation most of the time.  However, we have cases where we detect uncommon issues in common migration libraries so migration validation does detect issues that would go unnoticed if the same libraries are used in a combined migration and rendering tool. 

 

Thirdly, is migration validation comprehensive enough?  The answer to this depends on the formats but for some (even common) formats it is clear that better, more comprehensive tools would do a better job.  Of course the hope is that this will continually improve over time. 

 

So, to conclude, I do see migration as a valid technique (and in fact a technique that almost everyone uses even if they don’t realise it).  I see one of the aims of the digital preservation community should be to provide an intellectually sound view of what constitutes a high quality migration (e.g., through a comprehensive view of significant properties across a wide range of object types).  It might be that real-life tools provide some pragmatic approximation to this idealistic vision (potentially using short cuts like using a combined migration and rendering tool) but we should at least understand and be able to express what these short cuts are.

 

I hope this post helps to generate some useful debate.

 

Rob

 

 

 

 
Six ways to decode a lossy JP2

Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.

A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.

An experiment

To find out, I ran a simple experiment:

  1. Encode a TIFF image to JP2.
  2. Decode the JP2 back to TIFF using different decoders.
  3. Compare the decode results using some similarity measure.

Codecs used

I used the following codecs:

Note that GraphicsMagick still uses the JasPer library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews's JPEG 2000 plugin is made by LuraTech.

Creating the JP2

First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:

opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64

Decoding the JP2

Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:

CodecCommand line
opj20
opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif
kakadu
kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif
kakadu-precise
kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise
irfanUsed GUI
im
convert krant_oj_4.jp2 krant_oj_4_im.tif
gm
gm convert krant_oj_4.jp2 krant_oj_4_gm.tif

This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".

Overall image quality

As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:

DecoderPSNR
opj2048.08
kakadu48.01
kakadu-precise48.08
irfan48.08
im48.08
gm48.07

So relative to the source image these results are only marginally different.

Similarity of decoded images

But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:

Decoderopj20kakadukakadu-preciseirfanimgm
opj20-57.5278.5379.1796.3564.43
kakadu57.52-57.5157.5257.5257.23
kakadu-precise78.5357.51-79.0078.5364.52
irfan79.1757.5279.00-79.1864.44
im96.3557.5278.5379.18-64.43
gm64.4357.2364.5264.4464.43-

Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:

  • Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
  • Group B: all remaining decoder combinations, with a PSNR of < 65 dB.

What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.

I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.

What does this say about decoding quality?

It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).

I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.

Conclusions

OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!

For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.

This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!

Link

William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)


  1. Identical in terms of pixel values; for this analysis I didn't look at things such as embedded ICC profiles, which not all encoders/decoders handle well

 

Taxonomy upgrade extras: 
Preservation Topics: 
Tool highlight: SCAPE Online Demos

Now that we are entering the final days of the SCAPE project, we would like to highlight some SCAPE Quality Assurance tools that have an online demonstrator.

 

See http://scape.demos.opf-labs.org/ for the following  tools:

 

Pagelyzer: Compares web pages

Monitor your web content.

 

Jpylyzer: Validates images

JP2K validator and properties extractor.

 

Xcorr-sound: Compares audio sounds

Improve your digital audio recordings.

 

Flint: Validates different files and formats

Validate PDF/EPUB files against an institutional policy

 

Matchbox: Compares documents (following soon)

Duplicate image detection tool.

 

For more info on these and other tools and the SCAPE project, see http://www.scape-project.eu/tools.

Preservation Topics: 
Interview with a SCAPEr - Ed Fay

Ed Fay

Who are you?

My name is Ed Fay, I’m the Executive Director of the Open Planets Foundation.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

OPF has been involved in technical and take-up work all the way through the project, but right now we’re focused on sustainability – what happens to all the great results that have been produced after the end of the project.

Why is your organisation involved in SCAPE?

OPF has been responsible for leading the sustainability work and will provide a long-term home for the outputs, preserving the software and providing an ongoing collaboration of project partners and others on best practices and other learning. OPF members include many institutions who have not been part of SCAPE but who have an interest in continuing to develop the products, and through the work that has been done - for example on software maturity and training materials - OPF can help to lower barriers to adoption by these institutions and others.

What are the biggest challenges in SCAPE as you see it?

The biggest challenge in sustainability is identifying a collaboration model that can persist outside of project funding. As cultural heritage budgets are squeezed around the world and institutions adapt to a rapidly changing digital environment the community needs to make best use of the massive investment in R&D that has been made, by bodies such as the EC in projects such as SCAPE. OPF is a sustainable membership organisation which is helping to answer these challenges for its members and provide effective and efficient routes to implementing the necessary changes to working practices and infrastructure. In 20 years we won’t be asking how to sustain work such as this – it will be business as usual for memory institutions everywhere – but right now the digital future is far from evenly distributed.

But from the SCAPE perspective we have a robust plan which encompasses many different routes to adoption, which is of course the ultimate route to sustainability – production use of the outputs by the community for which they were intended. The fact that many outputs are already in active use – as open-source tools and embedded into commercial systems – shows that SCAPE has produced not only great research but mature products which are ready to be put to work in real-world situations.

What do you think will be the most valuable outcome of SCAPE?

This is very difficult for me to answer! Right now OPF has the privileged perspective of transferring everything that has matured during the project into our stewardship - from initial research, through development, and now into mature products which are ready for the community. So my expectation is that there are lots of valuable outputs which are not only relevant in the context of SCAPE but also as independent components. One particular product has already been shortlisted for the Digital Preservation Awards 2014 which is being co-sponsored by OPF this year while others have won awards at DL2014. These might be the most visible in receiving accolades, but there are many other tools and services which provide the opportunity to enhance digital preservation practice within a broad range of institutions. I think the fact that SCAPE is truly cross-domain is very exciting – working with scientific data, cultural heritage, web harvesting – it shows that digital preservation is truly maturing as a discipline.

If there could be one thing to come out of this, it would be a understanding of how to continue the outstanding collaboration that SCAPE has enabled to sustain cost-effective digital preservation solutions that can be adopted by institutions of all sizes and diversity.

Contact information

ed@openplanetsfoundation.org

twitter.com/digitalfay

Preservation Topics: 
SCAPE Project Ends on the 30th of September

It is difficult to write that headline. After nearly four years of hard work, worry, setbacks, triumphs, weariness, and exultation, the SCAPE project is finally coming to an end.

I am convinced that I will look back at this period as one of the highlights of my career. I hope that many of my SCAPE colleagues will feel the same way.

I believe SCAPE was an outstanding example of a successful European project, characterised by

  • an impressive level of trouble-free international cooperation;
  • sustained effort and dedication from all project partners;
  • high quality deliverables and excellent review ratings;
  • a large number of amazing results, including more software tools than we can demonstrate in one day!

I also believe SCAPE has made and will continue to make a significant impact on the community and practice of digital preservation. We have achieved this impact through

I would like to thank all the people who contributed to the SCAPE project, who are far too numerous to name here. In particular I would like to thank our General Assembly members, our Executive Board/Sub-project leads, the Work Package leads, and the SCAPE Office, all of whom have contributed to the joy and success of SCAPE.

Finally, I would like to thank the OPF for ensuring that the SCAPE legacy will continue to live and even grow long after the project itself is just a fond memory.

It's been a pleasure folks. Well done!

Preservation Topics: 
Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation

We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochrane – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.

We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.

Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.

For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.

We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.

Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:

Image Type:     MFM Sector Image
Start Track:    At least 0
End Track:      At most 83
Side Mode:      Side 0
Sector Size:    256 Bytes
Sector Count:   Any
Track Distance: 40 Tracks
Target RPM:     By Image type
Flippy Mode:    Off

We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.

Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.

Exploding the Disk Images

With the disk images in hand we found ourselves facing our biggest challenge. The images, although clearly well-formed, i.e. not corrupt, would not mount with Virtual Floppy Disk or mount in Linux.

Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.

The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:

 

*************************


*     MIC-501  V1.6     *


*   62K CP/M  VERS 2.2  *


*************************


COPYRIGHT  1983, MULTITECH BIOS VERS 1.6

 

CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…

With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.

Commands for the different utilities look like the following:

That resulted in a command line to generate a file listing like this:

Creating a directory listing:

C:> cpmls –f bw12 disk-images\disk-one.img

This will list the user number (a CP/M concept), and the files objects belonging to that user.

E.g.:

0:
   File1.txt
   File2.txt

Extracting objects based on user number:

C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir

This will extract all objects collected logically under user 0: and put them into an output directory.

Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.

One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:

# Bondwell 12 and 14 disk images in IMD raw binary format

diskdef bw12
  seclen 256
  tracks 40
  sectrk 18
  blocksize 2048
  maxdir 64
  skew 1
  boottrk 2
  os 2.2
end

The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.

WordStar and other bits ‘n’ pieces

Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.

Example text, read verbatim might look like:

thå  southerî coasô = the southern coast

At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.

Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.

The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.

Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.

Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.

We await further news from DoC about what they’re planning on doing with the extracts next.

Conclusions

The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.

On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.

The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.

Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.

Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line. 

Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.

Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.

We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.

What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next. 

How trustworthy is the SCAPE Preservation Environment?

Over the last three and a half years, the SCAPE project worked in several directions in order to propose new solutions for digital preservation, as well as improving existing ones. One of the results of this work is the SCAPE preservation environment (SPE). It is a loosely coupled system, which enables extending existing digital repository systems (e.g. RODA) with several components that cover collection profiling (i.e. C3PO), preservation monitoring (i.e. SCOUT) and preservation planning (i.e. Plato). Those components address key functionalities defined in the Open Archival Information System (OAIS) functional model.

Establishing trustworthiness of digital repositories is a major concern of the digital preservation community as it makes the threats and risks within a digital repository understandable. There are several approaches developed over recent years on how to address trust in digital repositories. The most notable is Trustworthy Repositories Audit and Certification (TRAC), which has later been promoted to an ISO standard by the International Standards Organization (ISO 16363, released in 2012). The standard comprises of three pillars: organizational infrastructure, digital object management, and infrastructure and security management and for each of these it provides a set of requirements and the expected evidence needed for compliance.

A recently published whitepaper reports on the work done to validate the SCAPE Preservation Environment against the ISO 16363 – a framework for Audit and Certification of Trustworthy Digital Repositories. The work aims to demonstrate that a preservation ecosystem composed of building blocks as the ones developed in SCAPE is able to comply with most of the system-related requirements of the ISO 16363.

From a total of 108 metrics included in the assessment, the SPE fully supports 69 of them. 31 metrics were considered to be “out of scope” as they refer to organisational issues that cannot be solved by technology alone nor can they be analysed outside the framework of a breathing organisation, leaving 2 metrics to be considered “partially supported” and 6 metrics to be considered “not supported”. This gives an overall compliancy level of roughly 90% (if the organisational oriented metrics are not taken into account).

This work also enabled us to identify the main weak points of the SCAPE Preservation Environment that should be addressed in the near future. In summary the gaps found were:

  • The ability to manage and maintain contracts or deposit agreements through the repository user interfaces;
  • Support for tracking intellectual property rights;
  • Improve technical documentation, especially on the conversion of Submission Information Packages (SIP) into Archival Information Packages (AIP);
  • The ability to aid the repository manager to perform better risk management.

Our goal is to ensure that the SCAPE Preservation Environment fully supports the system-related metrics of the ISO 16363. In order to close the gaps encountered, additional features have been added to the roadmap of the SPE.

To get your hands on the full report, please go to http://www.scape-project.eu/wp-content/uploads/2014/09/SCAPE_MS63_KEEPS-V1.0.pdf

 

Digital Preservation Sustainability on the EU Policy Level - a workshop report

On Monday 8 September 2014 APARSEN and SCAPE together hosted a workshop, called ‘Digital Preservation Sustainability on the EU Policy Level’. The workshop was held in connection with the conference Digital Libraries 2014 in London.

The room for the workshop was ‘The Great Hall’ at City University London – a lovely, old, large room with a stage at one end and lots of space for the 12 stalls featuring the invited projects and  the 85 attendees.

The first half of the workshop was dedicated to a panel session. The three panellists each had 10-15 minutes to present their views on both the achievements and future of digital preservation, followed by a discussion moderated by Hildelies Balk from the Royal Library of the Netherlands, with real time visualisations made by Elco van Staveren.

‘As a community we have failed’

With these words David Giaretta, Director of APARSEN (see presentation and visualisation), pinpointed the fact that there will be no EU funding for digital preservation research in the future and that the EU expects to see some result from the 100 M € already distributed. The EU sees data as the new gold, and we should start mining it! A big difference between gold and data is that gold does not perish whereas data is not imperishable.

The important thing to do is to create some results – ‘A rising tide floats all boats’ – if we can at least show something that can be used, that will help funding the rest of the preservation.

Let’s climb the wall!

David Giaretta was followed by Ross King, Project Coordinator of SCAPE (see presentation and visualisation), who started his presentation with a comparison between the two EU projects Planets and SCAPE - the latter being a follow-up project from the first. Many issues already addressed in Planets were further explored and developed in SCAPE, the biggest difference being scalability – how to handle large volumes, scalability in planning processes, more automation etc. – which was the focal point of SCAPE.

To Ross King there were three lessons learned from working with Planets and SCAPE:

  • there is still a wall between Production on one side and Research & Development on the other, 
  • the time issue – although libraries, archives etc. work with long term horizons, most business have a planning horizon of five years or less,
  • format migration  may not be as important as we thought it was.

Who will pay?

Ed Fay, director of Open Planets Foundation (see presentation and visualisation), opened with the message that by working with digital preservation we have a great responsibility of helping to define the future of information management. With no future EU funded projects community collaboration on all levels is more needed than ever. Shared services and infrastructure are essential.

The Open Planets Foundation was founded after the Planets project to help sustaining the results of this project. Together with SCAPE and other projects OPF is now trying to mature tools so they can be widely adopted and sustained (See SCAPE Final Sustainability Plan).

There are a lot of initiatives and momentum, from DPC, NDIPP or JISC to OPF or APA - but how will the future look like? How do we ensure that initiatives are aligned up to the policy level?

Sustainability is about working out who pays – and when…

If digital preservation was delivering business objectives we wouldn’t be here to talk about sustainability - it would just be embedded in how organisations work - we are not there yet!

A diverse landscape with many facets

The panellist’s presentations were followed by questions from the audience, mostly concerned about risk approach. During the discussion it was stated that although the three presenters see the digital landscape from different views they all agree on its importance. People do need to preserve and to get digital value from that. The DP initiatives and organisations are the shopping window, members have lots of skills that the market could benefit from.

The audience were asked if they find it important to have a DP community - apparently nobody disagreed! And it seemed that almost everyone were members of OPF, APARSEN or other similar initiatives.

There are not many H2020 digital preservation bids. In earlier days everybody had several proposals running in these rounds, but this is not catastrophic – good research has been made and now we want the products to be consolidated. We would like to reach a point where digital preservation is an infrastructure service as obvious as your email. But we are not there yet!

Appraisal and ingest is still not solved - we need to choose the data to be preserved, especially when talking about petabytes!

The wrap-up of the discussion was done by discussing the visualisation made by Elco van Staveren.

An overall comment was that even though there are no money directed towards digital preservation, there is still lots of money for problems that can be solved by digital preservation. It is important that the community of digital preservation thinks of itself NOT as the problem but as part of the solution. And although the visualisation is mostly about sustainability, risks still play an important part. If you cannot explain the risk of doing nothing you cannot persuade anyone to pay!

Clinic with experts

After the panel and one minute project elevator pitches there was a clinic session at which all the different projects could present themselves and their results at different stalls. A special clinic table was in turn manned by experts from different areas of digital preservation.

The projects involved in the clinic were:

This was the time to meet a lot of different people from the Digital Preservation field, to catch up and build new relations.  For a photo impression of the workshop see: http://bit.ly/1u7Lmnq.

Preservation Topics: 
AttachmentSize
IMG_9928.JPG1.75 MB
Elcovs discussion.jpg134.17 KB
IMG_3361.JPG311.72 KB
IMG_3325.JPG409.54 KB
And the winner is....

 

Which message do YOU want to send to the EU for the future of Digital Preservation projects?

 

At closing time of the workshop the winner and two runner up tweets were announced. Three very different messages to the EU altogether. One runner up tweet was urging the EU to allow for a small sustainability budget for at least 5 years after a project formally ends. The other runner up tweet included the question: 'Will this tweet be preserved?" which -very appropiate- by now is already deleted and thus seemingly lost forever.  

But we are proud to announce:

 

The winner! : The words of Galadriel "Much that once was is lost, for none now live who remember it" must not come true

 

More about the workshop in the offical SCAPE/ APARSEN workshop blogs- soon to be published!

 

 

 

 

Preservation Topics: 
Our digital legacy: shortlist announced for the Digital Preservation Awards 2014
Created in 2004 to raise awareness about digital preservation, the Digital Preservation Awards are the most prominent celebration of achievement for those people and organisations that have made significant and innovative contributions to ensuring our digital memory is accessible tomorrow.
 
‘In its early years, the Digital Preservation Award was a niche category in the Conservation Awards’, explained Laura Mitchell, chair of the DPC. ‘But year on year the judges have been impressed by the increasing quality, range and number of nominations.’ 
 
‘I’m delighted to report that, once again, we have had a record number of applications which demonstrate an incredible depth of insight and subtlety in approach to the thorny question of how to make our digital memory accessible tomorrow. ’
 
The judges have shortlisted thirteen projects in 4 categories:
 
The OPF Award for Research and Innovation which recognises excellence in practical research and innovation activities.
  • Jpylyzer by the KB (Royal Library of the Netherlands) and partners
  • The SPRUCE Project by The University of Leeds and partners
  • bwFLA Functional Long Term Archiving and Access by the University of Freiburg and partners
 
The NCDD Award for Teaching and Communications, recognising excellence in outreach, training and advocacy. 
  • Practical Digital Preservation: a how to guide for organizations of any size by Adrian Brown
  • Skilling the Information Professional by the Aberystwyth University
  • Introduction to Digital Curation: An open online UCLeXtend Course by University College London
 
The DPC Award for the Most Distinguished Student Work in Digital Preservation, encouraging and recognising student work in digital preservation. 
  • Voices from a Disused Quarry by Kerry Evans, Ann McDonald and Sarah Vaughan, University of Aberystwyth
  • Game Preservation in the UK by Alasdair Bachell, University of Glasgow
  • Emulation v Format Conversion by Victoria Sloyan, University College London

 

The DPC Award for Safeguarding the Digital Legacy, which celebrates the practical application of preservation tools to protect at-risk digital objects. 

  • Conservation and Re-enactment of Digital Art Ready-Made, by the University of Freiburg and Partners
  • Carcanet Press Email Archive, University of Manchester
  • Inspiring Ireland, Digital Repository of Ireland and Partners
  • The Cloud and the Cow, Archives and Records Council of Wales
‘The competition this year has been terrific’, said Louise Lawson of Tate, chair of the judges. ‘Very many strong applications, which would have won the competition outright in previous years, have not even made the shortlist this time around.’
 
The Digital Preservation Awards have been celebrating excellence for 10 years now and is being supported by some leading organisations in the field including the NCDD and Open Planets Foundation. Hosted by the Wellcome Trust, their newly refurbished London premises will add to the glamour of the awards ceremony on Monday 17th November.
 
The finalists will attract significant publicity and a deserved career boost, both at organisation and individual level. Those who walk away with a Digital Preservation Award on the night can be proud to claim to be amongst the best projects and practitioners within a rapidly growing and international field.
 
‘Our next step is to open the shortlist to public scrutiny’, explained William Kilbride of the DPC. ‘We will be giving instructions shortly on how members of the DPC can vote for their favourite candidates. 
 
‘We have decided not to shortlist for the ‘The DPC Award for the Most Outstanding Digital Preservation Initiative in Industry’. Although the field was strong the judges didn’t feel it was competitive enough. We will be making a separate announcement about that in due course.
 
Notes:
For more about the Digital Preservation Awards see: http://www.dpconline.org/advocacy/awards
For more about the Digital Preservation Coalition see: http://www.dpconline.org/
For press interviews contact William Kilbride on (william_at_dpconline.org)

 

Preservation Topics: 
My first Hackathon - Hacking on PDF Files

Preserving PDF - identify, validate, repair

22 participants from 8 countries - the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.

Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).

The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.

I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.

Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.

We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".

We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all - and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.

As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.

Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.

During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error - which formally would be the right thing to do - or not disturb the user unnessecarily?

Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-proof.

By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.

A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.

There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.

For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.

Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.

When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.

And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.

Twitter Hashtag #OPDFPDF

Preservation Topics: 
User-Driven Digital Preservation

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is that we need to gather better information about which resources are difficult for users to use, and which formats they would prefer, so that we can use this data to drive our preservation work.

The prototype also provides a convenient way to run Apache Tika and DROID on any URL, and exposes the contents of its internal 'format registry' as a set of web pages that you can browse through (e.g. here's what it knows about text/plain). It only supports a few preservation actions right now, but it does illustrates what might be possible if we can find a way to build a more comprehensive and sustainable system.

When (not) to migrate a PDF to PDF/A

It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.

PDF/A is a profile

First, it's important to stress that each of the PDF/A standards (A-1, A-2 and A-3) are really just profiles within the PDF format. More specifically, PDF/A-1 offers a subset of PDF 1.4, whereas PDF/A-2 and PDF/A-3 are based on the ISO 32000 version of PDF 1.7. What these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in 'regular' PDF. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. This can be illustrated with the following simple Venn diagram below, which shows the feature sets of the aforementioned PDF flavours:

PDF Venn diagram

Here we see how PDF/A-1 is a subset of PDF 1.4, which in turn is a subset of PDF 1.7. PDF A/2 and PDF A/3 (aggregated here as one entity for the sake of readability) are subsets of PDF 1.7, and include all the features of PDF A/1.

Keeping this in mind, it's easy to see that migrating an arbitrary PDF to PDF/A can result in problems.

Loss, alteration during migration

Suppose, as an example, that we have a PDF that contains a movie. This is prohibited in PDF/A, so migrating to PDF/A will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a PDF/A document must be embedded. But what happens if the source PDF uses non-embedded fonts that are not available on the machine on which the migration is run? Will the migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?

Complexity and effect of errors

Also, migrations like these typically involve a complete re-processing of the PDF's internal structure. The format's complexity implies that there's a lot of potential for things to go wrong in this process. This is particularly true if the source PDF contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don't really have any tools for detecting such errors (i.e. a sufficiently reliable PDF validator), these cases can be difficult to deal with. Some further considerations can be found here (the context there is slightly different, but the risks are similar).

Digitised vs born-digital

The origin of the source PDFs may be another thing to take into account. If PDFs were originally created as part of a digitisation project (e.g. scanned books), the PDF is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such PDFs to PDF/A is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in PDF/A. At the same time, this also means that the benefits of migrating such files to PDF/A are pretty limited, since the source PDFs weren't problematic to begin with!

The potential benefits PDF/A may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there's just a lot more that can go wrong (see also here for some additional considerations).

Conclusions

Although migrating PDF documents to PDF/A may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source PDFs that weren't problematic to begin with, which belies the very purpose of migrating to PDF/A. For specific cases, migration to PDF/A may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of PDFs (both source and destination!), it would also seem prudent to always keep the originals.

Taxonomy upgrade extras: 
Meet SCAPE, APARSEN and many more….

SCAPE and APARSEN have joint forces and are hosting a free workshop, ‘Digital Preservation Sustainability on the EU Policy Level’ in connection with the upcoming DL2014 conference in London.

The first part of the workshop will be a panel session at which David Giaretta (APARSEN), Ross King (SCAPE), and Ed Fay (OPF) will be discussing digital preservation.

After this a range of digital preservation projects will be presented at different stalls. This part will begin with an elevator pitch session at which each project will have exactly one minute to present their project.

Everybody is invited to visit all stalls and learn more about the different projects, their results and thoughts on sustainability. At the same time there will be a special ‘clinic’ stall at which different experts will be ready to answer any questions you have on their specific topic – for instance PREMIS metadata or audit processes.

The workshop takes place at City University London, 8 September 2014, 1pm to 5pm.

Looking forward to meeting you!

 

Read more about the workshop

Register for the workshop (please notice! Registration for this workshop should not be done via the DL registration page)

Read more about DL2014

Oh, did I forget? We also have a small competition going on… Read more.

 

Preservation Topics: 
When is a PDF not a PDF? Format identification in focus.

In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.

Assumptions

I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately  as possible then pass the file to format specific tools for deeper analysis.

I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that's a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

  • the fine free file utility (also known simply as file),
  • DROID,
  • FIDO, and
  • Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.

What's happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:

<match value="%PDF-" type="string" offset="0"/>

What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:

%PDF-1.2

meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID="125" Specificity="Specific">
    <ByteSequence Reference="BOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="0" SubSeqMinOffset="0">
            <Sequence>255044462D312E32</Sequence>
            <DefaultShift>9</DefaultShift>
            <Shift Byte="25">8</Shift>
            <Shift Byte="2D">4</Shift>
            <Shift Byte="2E">2</Shift>
            <Shift Byte="31">3</Shift>
            <Shift Byte="32">1</Shift>
            <Shift Byte="44">6</Shift>
            <Shift Byte="46">5</Shift>
            <Shift Byte="50">7</Shift>
        </SubSequence>
    </ByteSequence>
    <ByteSequence Reference="EOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="1024" SubSeqMinOffset="0">
            <Sequence>2525454F46</Sequence>
            <DefaultShift>-6</DefaultShift>
            <Shift Byte="25">-1</Shift>
            <Shift Byte="45">-3</Shift>
            <Shift Byte="46">-5</Shift>
            <Shift Byte="4F">-4</Shift>
        </SubSequence>
    </ByteSequence>
</InternalSignature>
Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here.
 
The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.

So who's right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker,
%%EOF. (See implementation note 15 in Appendix H.)
The referenced implementation note reads:
3.4.4, “File Trailer”
15. Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file. 

So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.

What's next?

I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.

Preservation Topics: 
AttachmentSize
It looks like a PDF to me....44.06 KB
Win an e-book reader!

On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.

The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.

Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:

 

Which message do YOU want to send to the EU for the future of Digital Preservation projects?

 

You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.

The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.

 

There are only a few places left for the workshop.  Registration for the workshop is FREE and must be completed by filling out the form here http://bit.ly/DPSustainability. Please don’t register for this workshop on the DL2014 registration page, since this workshop  is free of charge!

 

Preservation Topics: 
Coming to "Preserving PDF - identify, validate, repair" in Hamburg?

The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.

Pre-reading

The Wikipedia entry on PDF provides a readable overview of the formats history with some technical details. Adobe provide a brief PDF 101 post that avoids technical detail.

Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:

This MacTech article is still a reasonable introduction to PDF for developers. Finally, if you really want a detailed look you could try the Adobe specification page but it's heavy weight reading.

Tools

Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.

JHOVE

JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.

Apache Tika

The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.  

Apache PDFBox

Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.

These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.

Test Data

These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:

PDFs from GovDocs selected dataset

The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.

Isator PDF/A test suite

The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words: 

This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

More information about the suite can be found on the PDF Association's website along with a download link.

PDFs from OPF format corpus

The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.

Software

If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.

Vagrant downloads page:

Oracle VirtualBox downloads page:

Be sure to install the VirtualBox extensions also, it's the same download for all platforms.

What next?

I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.

EaaS: Image and Object Archive — Requirements, Implementation and Example Use-Cases
bwFLA's Emulation-as-a-Service makes emulation widely available for non-experts and could prove emulation as a valuable tool in digital preservation workflows. Providing these emulation services to access preserved and archived digital objects poses further challenges to data management. Digital artifacts are usually stored and maintained in dedicated repositories and object owners want to or are required to stay in control over their intellectual property. This article discusses the problem of managing virtual images, i.e. virtual harddisks bootable by an emulator, and derivatives thereof but the solution proposed can be applied to any digital artifact.

Requirements

Once a digital object is stored in an archive and an appropriate computing environment has been created for access, this environment should be immutable and should not be modified except explicitly by an administrational interface. This guarantees that a memory institution's digital assets are unaltered by the EaaS service and remain available in the future. Immutability, however, is not easy to handle for most emulated environments. Just booting the operating system may change an environment in unpredictable ways. When the emulated software writes parts of this data and reads it again, however, it probably expects the read data to represent the modifications. Furthermore, users that want to interact with the environment should be able to change or customize it. Therefore, data connectors have to provide write access for the emulation service while they cannot write the data back to the archive.
 
The distributed nature of the EaaS approach requires an  efficient network transport of data to allow for immediate data access and usability. However, digital objects stored in archives can be quite large in size. When representing a hard disk image, the installed operating system together with installed software can easily grow up to several GBs in size. Even with today's network bandwidths, copying these digital objects in full to the EaaS service may take minutes and affects the user experience.
 
While the archived amount of data is usually large, the data that is actually accessed frequently can be very small. In a typical emulator scenario, read access to virtual hard disk images is block-aligned and only very few blocks are actually read by the emulated system. Transferring only these blocks instead of the whole disk image file is typically more efficient, especially for larger files.
 
Therefore, the network transport protocol has to support random seeks and sparse reads without the need for actually copying the whole data file. While direct filesystem access provides these features if a digital object is locally available to the EaaS service, such access it is not available in the general case of separate emulation and archive servers that are connected via the internet.

Implementation

The Network Block Device (NBD) protocol provides a simple client/server architecture that allows direct access to single digital objects as well as random access to the data stream within these objects. Furthermore, it can be completely implemented in userspace and does not require a complex software infrastructure to be deployed to the archives. 
 
In order to access digital objects, the emulation environment needs to reference these objects in the emulation environment. Individual objects are identified in the NBD server by using unique export names. While the NBD URL schema directly identifies the digital object and the archive where the digital object can be found, the data references are bound to the actual network location. In a long-term preservation scenario, where emulation environments, once curated, should last longer than a single computer system that acts as the NBD server, this approach has obvious drawbacks. Furthermore, the cloud structure of EaaS allows for interchanging any component that participates in the preservation effort, thus allowing for load balancing and fail-safety. This advantage of distributed systems is offset by static, hostname-bound references.

Handle It!

To detach the references from the object's network location, the Handle System is used as persistent object identifier throughout our reference implementation. The Handle System provides a complete technological framework to deal with these identifiers (or "Handles'' (HDL) in the Handle System) and constitutes a federated infrastructure that allows the resolution of individual Handles using decentralized Handle Services. Each institution that wants to participate in the Handle System is assigned a prefix and can host a Handle Service. Handles are then resolved by a central resolver by forwarding requests to these services according to the Handle's prefix. As the Handle System, as a sole technological provider, does not pose any strict requirements to the data associated with Handles, this system was used as a PI technology.

Persistent User Sessions and Derivatives

As digital objects (in this case the virtual disk image) are not to be modified directly in the archive by the EaaS service, a mechanism to store modifications locally  while reading unchanged data from the archive has to be implemented. Such a transparent write mechanism can be achieved using a copy-on-write access strategy. While NBD allows for arbitrary parts of the data to be read upon request, not requiring any data to be provided locally, data that is written through the data connector is tracked and stored in a local data structure. If a read operation requests a part of data that is already in this data structure, the previously changed version of the data should be returned to the emulation component. Similarly, parts of data that are not in this data structure were never modified and must be read from the original archive server. Over time, a running user session has its own local version of the data, but only those parts of data that were written are actually copied.
 
We used the qcow2 container format from the QEMU project to keep track of local changes to the digital object. Besides supporting copy-on-write, it features an open documentation as well as a widely used and tested reference implementation with a comprehensive API, the QEMU Block Driver. The qcow2 format allows to store all changed data blocks and the respective metadata for tracking these changes in a single file. To define where the original blocks (before copy-on-write) can be found, a backing file definition is used. The Block Driver API provides a continuous view on this qcow2 container,  transparently choosing either the backing file or the copy-on-write data structures as source.
 
This mechanism allows modifications of data to be stored separately and independent from the original digital object during an EaaS user session, allowing to keep every digital object in its original state as it was preserved  Once the session has finished, these changes can be retrieved from the emulation component and used to create a new, derived data object. As any Block Driver format is allowed in the backing file of a qcow2 container, the backing file can also be a qcow2 container again. This allows „chaining" a series of modifications as copy-on-write files that only contain the actually modified data. This greatly facilitates efficient storage of derived environments as a single qcow2 container can directly be used in a binding without having to combine the original data and the modifications to a  consolidated stream of data. However, this makes such bindings rely not only on the availability of the qcow2 container with the modifications, but also on the original data the qcow2 container refers to. Therefore, consolidation is still possible and directly supported by the tools that QEMU provides to handle qcow2 files.
 
Once the data modifications and the changed emulation environment are retrieved after a session, both can be stored again in an archive to make this derivate environment available. Only those chunks of data that actually  were changed by the user have to be retrieved. These, however, reference and  remain dependent on the original, unmodified digital object. The derivate can then be accessed like any other archived environment. Since all derivate environments contain (stable) references to their backing files, modifications can be stored in  a different image archive, as long as the backing file is available. Therefore, each object owner is in charge for providing storage for individualized system environments but is also  able to protect its modification without loosing the benefits of shared base images.

Examples and Use-Cases

To provide a better understanding of the image archive implementation, the following three use-cases demonstrate how the current implementation works. Firstly, a so called derivative is created, a tailored system environment suitable to render a specific object. In a second scenario, a container object (CD-ROM) is injected into the environment which is then modified for object access, i.e. installation of a  viewer application and adding the object to the autostart folder. Finally, an existing harddisk image (e.g. an image copy of a real machine) is ingested into the system. This last case requires, besides the technical configuration of the hardware environment, private files to be removed before public access.

Derivatives Tailored Runtime Environments

Typically, an EaaS provider provides a set of so-called base images. These images contain a basic OS installation which has been configured to be run on a certain emulated platform. Depending on the user's requirements, additional software and/or configuration may be required, e.g. the installation of certain software frameworks or text processing or image manipulation software. This can be done by uploading or making available a software installation package. On our current demo instance this is done either by uploading individual files or a CD ISO image. Once the software is installed the modified environment can be saved and made accessible for object rendering or similar purposes.
 

Object Specific Customization

In case of complex CD-ROM objects with rich multimedia content from the 90s and early 2000s, e.g. encyclopedias and teaching software, typically a custom viewer application has to be installed to be able to render its content. For these objects, an already prepared environment (installed software, autostart of the application) would be useful and would surely improve the user experience during access as „implicit“ knowledge on using an outdated environment is not required anymore to make use of the object. Since the number of archived media is large, duplicating for instance a Microsoft Windows environment for every one of them would add a few GBs of data to each object. Usually, neither the object’s information content nor the current or expected user demand justify these extra costs. Using derivatives of base images, however, only a few MBs are required for each customized environment since only changed parts of the virtual image are to be stored for each object. In the case of the aforementioned collection of multimedia CD-ROMs, the derivate size varies between 348KBs and 54MBs. 

 

Authentic Archiving and Restricted Access to Existing Computers

Sometimes it makes sense to preserve a complete user system like the personal computer of Vilèm Flusser in the Vilèm Flusser Archive. Such complete system environments usually can be achieved by creating a hard disk image of the existing computer and use this image as the virtual hard disk for EaaS. Such hard disk images can, however, contain personal data of the computer's owner. While EaaS aims at providing interactive access to complete software environments, it is impossible to restrict this "interactiveness", e.g. to forbid access to a certain directory directly from the user interface. Instead, our approach to this problem is to create a derivative work with all the personal data being stripped from the system. This allows users with sufficient access permissions (e.g. family or close friends) to access the original system including personal data, while the general public access only sees a computer with all the personal data removed.

Conclusion

With our distributed architecture and an efficient network transport protocol, we are able to provide Emulation as a Service quite efficiently while at the same time allowing owners of digital objects to remain in complete control over their intellectual property. Using copy-on-write technology it is possible to create a multitude of different configurations and flavors of the same system with only minimal storage requirements. Derivatives and their respective "parent" system can be handled completely independent from each other and withdrawing access permissions for a parent will automatically invalidate all existing derivatives. This allows for a very efficient and flexible handling of curation processes that involve the installation of (licensed) software, personal information and user customizations.

Open Planets members can test aforementioned features using the bwFLA demo instance. Get the password here: http://wiki.opf-labs.org/display/PT/bwFLA+test+demo+instance

Taxonomy upgrade extras: 
Preservation Topics: 
A VM4C3PO

We have just set up a vagrant environment for C3PO. It starts a headless vm where the C3PO related functionalities (Mongodb, Play, a downloadable commandline jar) are managable from the host's browser. Further, the vm itself has all relevant processes configured at start-up independently from vagrant, so it can be, once created, downloaded and used as a stand-alone C3PO vm. We think this could be a scenario applicable to other SCAPE projects as well. The following is a summary of the ideas we've had and the experiences we've made.

The Result

The Vagrantfile and a directory containing all vagrant-relevant files live directly in the root directory of the C3PO repository. So after installing Vagrant and cloning the repository a simple 'vagrant up' should do all the work, as downloading the base box, installing the necessary software and booting the new vm.

After a few minutes one should have a running vm that is accessible from the hosts browser at localhost:8000. This opens a central welcome page that contains information about the vm-specific aspects and links to the playframework's url (localhost:9000) and the Mongodb admin interface (localhost:28017). It also provides a download link for the command-line jar, which has to be used in order to import data. This can be used from the outside of the vm as the Mongodb port is mapped as well. So I can import and analyse data with C3PO without having to fiddle through the setup challenges myself, and, believe me, that way can be long and stony.

The created image is self-contained in that sense that, if I put it on a server, anyone who has Virtualbox installed can download it and use it, without having to rely on vagrant working on their machine.

General Setup

The provisioning script has a number of tasks:

  • it downloads all required dependencies for building the C3PO environment
  • it installs a fresh C3PO (from /vagrant, which is the shared folder connection between the git repository and the vm) and assembles the command-line app
  • it installs and runs a Mongodb server
  • it installs and runs the Playframework
  • it creates a port-forwarded static welcome page with links to all the functionalities above
  • it adds all above to the native ubuntu startup (using /etc/rc.local, if necessary), so that an image of the vm can theoretically be run independently from the vagrant environment

These are all trivial steps, but it can make a difference not having to manually implement all of them.

Getting rid of proxy issues

In case you're behind one of those very common NTLM company proxies, you'll really like that the only thing you have to provide is a config script with some some details around your proxy. If the setup script detects this file, it will download the necessary software and configure maven to use it. Doing it in this way has been actually the first time I got maven running smoothly on a linux VM behind our proxy.

Ideas for possible next steps

There is loads left to do, here are a few ideas:

  • provide interesting initial test-data that ships with the box, so that people can play around with C3PO without having to install/import anything at all.
  • why not having a vm for more SCAPE projects? we could quickly create a repository for something like a SCAPE base vm configuration that is useable as a base for other vms. The central welcome page could be pre-configured (SCAPE branded) as well as all the proxy- and development-environment-related stuff mentioned above.
  • I'm not sure about the sustainablity of shell provisioning scripts with increasing complexity of the bootstrap process. Grouping the shell commands in functions is certainly an improvement, it might be worth though to check out other, more dynamic provisioners. One I find particularly interesting is Ansible.
  • currently there's no way of testing that the vm works with the current development trunk; a test environment that runs the vm and tests for all the relevant connection bits would be handy

 

Preservation Topics: 
CSV Validator version 1.0 release

Following on from my previous brief post announcing the beta release of the CSV Validator, http://www.openplanetsfoundation.org/blogs/2014-03-21-csv-validator-beta-releases, today we've made the formal version 1.0 release of the CSV Validator and the associated CSV Schema Language.  I've described this in more detail on The NAtional Archives' blog, http://blog.nationalarchives.gov.uk/blog/csv-validator-new-digital-preservation-tool/

Preservation Topics: 
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.