OPF News & Overview

Skip to end of metadata
Go to start of metadata

This page gives an overview of the content across all the different wiki sites hosted by the OPF.

Popular Tags

Open Planets Foundation News

Open Planets Foundation
(The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.)
Running archived Android apps on a PC: first impressions

Earlier this week I had a discussion with some colleagues about the archiving of mobile phone and tablet apps (iPhone/Android), and, equally important, ways to provide long-term access. The immediate incentive for this was an announcement by a Dutch publisher, who recently published a children's book that is accompanied by its own app. Also, there are already several examples of Ebooks that are published exclusively as mobile apps. So, even though we're not receiving any apps in our collections yet, we'll have to address this at some point, and it's useful to have an initial idea of the challenges that may lie ahead.

The scope of this blog is not to provide any in-depth coverage of the long-term preservation of mobile apps. Instead, I was just curious about two specific aspects:

  • Is it possible to run a phone app on a regular PC, and, if yes, how?
  • How can you use this to run an archived copy of an app?

I spent a few afternoons running some preliminary tests. Since I'm pretty sure other institutions must be looking into this as well, I thought I might as well share the results, as well as some useful resources I came across along the way. For now I limited myself to the Android platform (iOS presents additional challenges because of its restrictive license).

The Android app format

First of all it is helpful to know a bit more about the Android app format. Basically, an app (.apk) file is just a ZIP archive with a specific file and directory structure. It is based on Java's Jar format. For more information see this entry on Archive Team's file format wiki:

http://fileformats.archiveteam.org/wiki/APK

Here you can also find how to download local copies of Android app files (e.g. to a PC), which is something that is not possible directly from the Google Play store.

Running Android on a PC

If you want to run Android on a regular (Linux or Windows) PC, several options exist. This article gives a good general overview. The "best" solution according to its author is a third-party developed app player. However, that player only works under Windows, it is proprietary, and non-free: to use it, you either pay a monthly fee, or put up with "sponsored apps". Google's Android SDK also includes an emulator, which is mainly targeted at app developers. I didn't look into that now, mainly because of its alleged poor performance. Instead, I went for a third option.

Android on VirtualBox

The Android-x86 Project has created a port of the Android Open Source Project that runs on X86-based architectures. This opens op the possibility to run Android on an ordinary PC, either as the main operating system, or in a virtual machine. So, I decided to take the latter route and installed Android on a virtual machine using VirtualBox. This is relatively straightforward, and several excellent step-by-step descriptions on how to do this exist, for instance:

The latest ISO images from Android-x86 can be found here (for this test I used version 4.4):

http://sourceforge.net/projects/android-x86/files/

Running and waking up Android

Setting up the virtual machine was easy enough, and Android appeared to work well straight away. One thing that can be confusing for first-time users is the way VirtualBox deals with mouse input: once you click your mouse in the screen area that is occupied by the virtual machine, VirtualBox shows a dialog asking whether it should "capture" the mouse. Once you click Capture, the mouse can only be used inside the virtual machine (and not for any other applications that are running on the host machine). You can uncapture the mouse at any time by pressing the right-hand Ctrl key. Another thing that initially puzzled me, is that Android enters sleep mode after several minutes of inactivity, resulting in a black screen. Once in sleep mode, it is not very obvious how to wake it up again (even rebooting the VM didn't do the trick). After some searching I found that the solution here is to press the Menu key on the keyboard (located next to the right-hand Ctrl key on most keyboards), which instantly brings the machine back to life.

Moving an app to the virtual machine

To install an app in Android, you would normally go to the Google Play store. In an archival setting it is more likely that you already have an archived copy stored somewhere, so what we need here is the ability to install from a local APK file. This is also known as "side loading", and this article gives general instructions on how to do this with a physical device. Since we're running Android on a virtual machine here, things are a bit different, and ideally we should be able to share a folder between the host machine and the (virtual) guest device. In theory this is all possible in VirtualBox, but as it turns out it doesn't work because Android-86 doesn't support VirtualBox Guest Additions. As a workaround, I ended up uploading my the APK to DropBox, and then opened DropBox in Android's web browser to download the file.

Installing the app

The downloaded APK is now located in the Download folder, which is accessible using Android's file browser. After clicking on it, the following security warning popped up:

This is because, by default, apps from unknown sources (i.e. other than the Google Play store) are blocked by Android. The solution here is to click on Settings, which opens up the security settings dialog:

Here, check the Unknown sources option 1. Then go back to the Downloads folder and click on the APK file again. It will now install the app.

Final thoughts

In this blog I provided some basic information about Android's APK format, how to run Android in VirtualBox, and how to install an archived app. I tested this myself with a handful of apps. One thing I noticed was that some apps didn't quite work as expected on my virtual Android machine, but as I didn't have access to a 'real' (physical) device it's impossible to tell whether this was due to the virtualisation or just a shortcoming of those apps. This would obviously need more work. Nevertheless, considering that I only spent a few odd afternoons on this, this approach looks quite promising.


  1. Note that this will also enable you to install apps that are possibly harmful; use with care! 

Taxonomy upgrade extras: 
Preservation Topics: 
Maturity levels & Preservation Policies

Recently at the iPRES 2014 conference in Melbourne I gave a presentation on the SCAPE Preservation Policies.  Not only I explained  the SCAPE  Preservation Policy Model , but I also summarized my findings after analysing 40 real life preservation policies. You can read the detailed information in my article (to be published soon). Basically I think that organisations not seldom overstretch themselves in formulating preservation policies that are not in line with their maturity. And I propose to extend the SCAPE Catalogue of Preservation Policy Elements with information indicating in which maturity level  this policy element is relevant.  The 5 levels are based on the Maturity Model of C. Dollar and L. Ashley.

The SCAPE project is finished, and that is why  I can use your input. The current wiki on the Open Preservation Foundation will be open to OPF account holders and  you will be able to help by adding this maturity level to the preservation policies. This way it will reflect a collaborative view, rather than my own opinion. 

Currently the OPF website is undergoing  some changes, but when this is finished, I'll remind you!

Preservation Topics: 
Open Planets Foundation is becoming the Open Preservation Foundation
We are pleased to announce that we are changing our company name to the Open Preservation Foundation. The name change reflects the foundation's core purpose and vision in the field of digital preservation while retaining its widely-known acronym, OPF. 
 
During 2014 there have been several new additions to the foundation. Ed Fay was appointed as the new Executive Director in February 2014 and the National Archives of Estonia and the Poznan Supercomputing and Networking Centre joined as new members.
 
'We feel it is the right time to change the name', explained Dr. Ross King, Chair of the OPF. 'It aligns with the new 2015-2018 strategy, which will be published in November, and makes it clear what the organisation is about now and its future intent. 
 
'The history behind the old name refers to the Planets Project, an EU-funded digital preservation project which closed in 2010. The Open Planets Foundation was established to sustain the results from the project'.
 
'The name change is also an important part of the launch of our new website', said Ed Fay, Executive Director of the OPF. 'We want to make our mission, and information about our technology and best practice more easily accessible to our members and the community. At the same time we will open a survey to establish trends in tools and approaches across the digital preservation landscape'.
 
The change of name will come in to full effect by mid-November when the new brand and website will be unveiled.
 
 
Notes
The OPF is an international membership organisation sustaining technology and knowledge for the long-term management of digital cultural heritage, providing its members with reliable solutions to the challenges of digital preservation.

 

Preservation Topics: 
Scape Demonstration: Migration of audio using xcorrSound

As part of the scape project, we did a large-scale experiment and evaluation of audio migration using the xcorrSound tool waveform-compare for content comparison in the quality assurance.

I did a presentation of the results at the demonstration day at the State and University Library, see the SCAPE Demo Day at Statsbiblioteket blog post by Jette G. Junge.

And now I present the screencast of this demonstration:

Screencast: scape demonstration of audio migration using xcorrsound in qa

The brief summary is:

  • New tool: using xcorrSound waveform-compare, we can automate audio file content comparison for quality assurance
  • Scalability: using Hadoop we can migrate our 20TB radio broadcast mp3 collection to the wav file format in a month (on the current SB Hadoop cluster set-up) rather than in years :)

And just a few notes:

  • the large scale experiment did not include property extraction and comparison, but we are confident (based on earlier experiment) that we can do this effectively using FFprobe
  • the large scale experiment did also not include file format validation. We made an early decision not to use JHOVE2 based on performance. The open question is if we are satisfied with the "pseudo validation" that the ffprobe property extraction and the xcorrSound waveform-compare cross correlation algorithm were both able to read the file...

Oh, and the slides are also on Slideshare: Migration of audio files using Hadoop.

 

Preservation Topics: 
Siegfried - a PRONOM-based, file format identification tool

Ok. I know what you're thinking. Do we really need another PRONOM-based, file format identification tool?

A year or so I might have said "no" myself. In DROID and FIDO, we are already blessed with two brilliant tools. In my workplace, we're very happy users of DROID. We trust it as the reference implementation of PRONOM, it is fast, and it has a rich GUI with useful filtering and reporting options. I know that FIDO has many satisified users too: it is also fast, great for use at the command line, and, as a Python program, is easy to integrate with digital preservation workflows (such as Archivematica). The reason I wrote Siegfried wasn't to displace either of these tools, it was simply to scratch an itch: when I read the blog posts announcing FIDO a few years ago, I was intrigued at the different matching strategies used (FIDO's regular expressions and DROID's Boyer-Moore-Horspool string searching) and wondered what other approaches might be possible. I started Siegfried simply as a hobby project to explore whether a multiple-string search algorithm, Aho Corasick, could perform well at matching signatures.

Having dived down the file format identification rabbit hole, my feeling now is that, the more PRONOM-based, file format identification tools we have, the better. Multiple implementations of PRONOM make PRONOM itself stronger. For one thing, having different algorithms implement the same signatures is a great way of validating those signatures. Siegfried is tested using Ross Spencer's skeleton suite (a fantastic resource that made developing Siegfried much, much easier). During development of Siegfried, Ross and I were in touch about a number of issues thrown up during that testing, and these issues led to a small number of updates to PRONOM. I imagine the same thing happened for FIDO. Secondly, although many institutions use PRONOM, we all have different needs, and different tools suit different use cases differently. For example, for a really large set of records, with performance the key consideration, your best bet would probably be Nanite (a Hadoop implementation of DROID). For smaller projects, you might favour DROID for its GUI or FIDO for its Archivematica integration. I hope that Siegfried might find a niche too, and it has a few interesting features that I think commend it.

Simple command-line interface

I've tried to design Siegfried to be the least intimidating command-line tool possible. You run it with:

sf FILE

sf DIR

There are only two other commands `-version` and `-update` (to update your signtures). There aren't any options: directory recursion is automatic, no default size on search buffers, and output is YAML only. Why YAML? It is a structured format, so you can do interesting things with it, and it has a clean syntax that doesn't look horrible in a terminal.

YAML Output

Good performance, without buffer limits

I'm one of those DROID users that always sets the buffer size to -1, just in case I miss any matches. The trade-off is that this can make matching a bit slower. I understand the use of buffers limits (options to limit the bytes scanned in a file) in DROID and FIDO - the great majority of signatures are found close to the beginning or end of the file and IO has a big impact on performance - but you need to be careful with them. Buffer limits can confuse users ("I can see a PRONOM signature for PDF/A, why isn't it matching?"). The use of buffer limits also need to be documented if you want to accurately record how puids were assigned. This is because you are effectively changing the PRONOM signatures by overriding any variable offsets. In other words, you can't just say, "matched 'fmt/111' with DROID signatures v 77", but now need to say, "matched 'fmt/111' with DROID signatures v 77 and with a maximum BOF offset of 32000 and EOF offset of 16000".

Siegfried is designed so that it doesn't need buffer limits for good performance. Instead, Siegfried searches as much, or as little, of a file as it needs to in order to satisfy itself that it has obtained the best possible match. Because Siegfried matches signatures concurrently, it can apply PRONOM's priority rules during the matching process, rather than at the end. The downside of this approach is that while average performance is good, there is variability: Siegfried slows down for files (like PDFs) where it can't be sure what the best match is until much, or all, of the file has been read.

Detailed basis information

As well as telling you what it matched, Siegfried will also report why it matched. Where byte signatures are defined, this "basis" information includes the offset and length of byte matches. While many digital archivists won't need this level of justification, this information can be useful. It can be a great debugging tool if you are creating new signatures and want to test how they are matching. It might also be useful for going back and re-testing files after PRONOM signature updates: if signatures change and you have an enormous quanitity of files that need to have their puids re-validated, then you could use this offset information to just test the relevant parts of files. Finally, by aggregating this information over time, it may also be possible to use it to refine PRONOM signatures: for example, are all PDF/A's matching within a set distance from the EOF? Could that variable offset be changed to a fixed one?

Where can I get my hands on it?

You can download Siegfried here. You can also try Siegfried, without downloading it, by dragging files onto the picture of Wagner's Siegfried on that page. The source is hosted on Github if you prefer to compile it yourself (you just need Go installed). Please report any bugs or feature requests there. It is still in beta (v 0.5.0) and probably won't get a version one release until early next year. I wouldn't recommend using it as your only form of file format identification until then (unless you are brave!). But please try it and send feedback.

Finally, I'd like to say thanks very much to the TNA for PRONOM and DROID and to Ross Spencer for his skeleton suite(s).

Preservation Topics: 
AttachmentSize
YAML output20.64 KB
Try Siegfried30.24 KB
In defence of migration

There is a trend in digital preservation circles to question the need for migration.  The argument varies a little from proponent to proponent but in essence, it states that software exists (and will continue to exist) that will read (and perform requisite functions, e.g., render) old formats.  Hence, proponents conclude, there is no need for migration.  I had thought it was a view held by a minority but at a recent workshop it became apparent that it has been accepted by many.

 

 

 

 

However, I’ve never thought this is a very strong argument.  I’ve always seen a piece of software that can deal with not only new formats but also old formats as really just a piece of software that can deal with new formats with a migration tool seamlessly bolted onto the front of it.  In essence, it is like saying I don’t need a migration tool and a separate rendering tool because I have a combined migration and rendering tool.  Clearly that’s OK but it does not mean you’re not performing a migration?

 

As I see it, whenever a piece of software is used to interpret a non-native format it will need to perform some form of transformation from the information model inherent in the format to the information model used in the software.  It can then perform a number of subsequent operations, e.g., render to the screen or maybe even save to a native format of that software.  (If the latter happens this would, of course, be a migration.) 

 

Clearly the way software behaves is infinitely variable but it seems to me that it is fair to say that there will normally be a greater risk of information loss in the first operation (the transformation between information models) than in subsequent operations that are likely to utilise the information model inherent in the software (be it rendering or saving in the native format).  Hence, if we are concerned with whether or not we are seeing a faithful representation of the original it is the transformation step that should be verified. 

 

This is where using a separate migration tool comes into its own (at least in principle).  The point is that it allows an independent check to be made of the quality of the transformation to take place (by comparing the significant properties of the files before and after).  Subsequent use of the migrated file (e.g., by a rendering tool) is assumed to be lossless (or at least less lossy) since you can choose the migrated format so that it is the native format of the tool you intend to use subsequently (meaning when the file is read no transformation of information model is required). 

However, I would concede that there are some pragmatic things to consider...

 

First of all, migration either has a cost (if it requires the migrated file to be stored) or is slow (if it is done on demand).  Hence, there are probably cases where simply using a combined migration and rendering tool is a more convenient solution and might be good enough.

 

Secondly, is migration validation worth the effort?  Certainly it is worth simply testing, say, a rendering tool with some example files before deciding to use it which should be sufficient to determine that the tool works without detailed validation most of the time.  However, we have cases where we detect uncommon issues in common migration libraries so migration validation does detect issues that would go unnoticed if the same libraries are used in a combined migration and rendering tool. 

 

Thirdly, is migration validation comprehensive enough?  The answer to this depends on the formats but for some (even common) formats it is clear that better, more comprehensive tools would do a better job.  Of course the hope is that this will continually improve over time. 

 

So, to conclude, I do see migration as a valid technique (and in fact a technique that almost everyone uses even if they don’t realise it).  I see one of the aims of the digital preservation community should be to provide an intellectually sound view of what constitutes a high quality migration (e.g., through a comprehensive view of significant properties across a wide range of object types).  It might be that real-life tools provide some pragmatic approximation to this idealistic vision (potentially using short cuts like using a combined migration and rendering tool) but we should at least understand and be able to express what these short cuts are.

 

I hope this post helps to generate some useful debate.

 

Rob

 

 

 

 
Six ways to decode a lossy JP2

Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.

A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.

An experiment

To find out, I ran a simple experiment:

  1. Encode a TIFF image to JP2.
  2. Decode the JP2 back to TIFF using different decoders.
  3. Compare the decode results using some similarity measure.

Codecs used

I used the following codecs:

Note that GraphicsMagick still uses the JasPer library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews's JPEG 2000 plugin is made by LuraTech.

Creating the JP2

First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:

opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64

Decoding the JP2

Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:

CodecCommand line
opj20
opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif
kakadu
kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif
kakadu-precise
kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise
irfanUsed GUI
im
convert krant_oj_4.jp2 krant_oj_4_im.tif
gm
gm convert krant_oj_4.jp2 krant_oj_4_gm.tif

This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".

Overall image quality

As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:

DecoderPSNR
opj2048.08
kakadu48.01
kakadu-precise48.08
irfan48.08
im48.08
gm48.07

So relative to the source image these results are only marginally different.

Similarity of decoded images

But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:

Decoderopj20kakadukakadu-preciseirfanimgm
opj20-57.5278.5379.1796.3564.43
kakadu57.52-57.5157.5257.5257.23
kakadu-precise78.5357.51-79.0078.5364.52
irfan79.1757.5279.00-79.1864.44
im96.3557.5278.5379.18-64.43
gm64.4357.2364.5264.4464.43-

Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:

  • Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
  • Group B: all remaining decoder combinations, with a PSNR of < 65 dB.

What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.

I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.

What does this say about decoding quality?

It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).

I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.

Conclusions

OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!

For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.

This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!

Link

William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)


  1. Identical in terms of pixel values; for this analysis I didn't look at things such as embedded ICC profiles, which not all encoders/decoders handle well

 

Taxonomy upgrade extras: 
Preservation Topics: 
Tool highlight: SCAPE Online Demos

Now that we are entering the final days of the SCAPE project, we would like to highlight some SCAPE Quality Assurance tools that have an online demonstrator.

 

See http://scape.demos.opf-labs.org/ for the following  tools:

 

Pagelyzer: Compares web pages

Monitor your web content.

 

Jpylyzer: Validates images

JP2K validator and properties extractor.

 

Xcorr-sound: Compares audio sounds

Improve your digital audio recordings.

 

Flint: Validates different files and formats

Validate PDF/EPUB files against an institutional policy

 

Matchbox: Compares documents (following soon)

Duplicate image detection tool.

 

For more info on these and other tools and the SCAPE project, see http://www.scape-project.eu/tools.

Preservation Topics: 
Interview with a SCAPEr - Ed Fay

Ed Fay

Who are you?

My name is Ed Fay, I’m the Executive Director of the Open Planets Foundation.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

OPF has been involved in technical and take-up work all the way through the project, but right now we’re focused on sustainability – what happens to all the great results that have been produced after the end of the project.

Why is your organisation involved in SCAPE?

OPF has been responsible for leading the sustainability work and will provide a long-term home for the outputs, preserving the software and providing an ongoing collaboration of project partners and others on best practices and other learning. OPF members include many institutions who have not been part of SCAPE but who have an interest in continuing to develop the products, and through the work that has been done - for example on software maturity and training materials - OPF can help to lower barriers to adoption by these institutions and others.

What are the biggest challenges in SCAPE as you see it?

The biggest challenge in sustainability is identifying a collaboration model that can persist outside of project funding. As cultural heritage budgets are squeezed around the world and institutions adapt to a rapidly changing digital environment the community needs to make best use of the massive investment in R&D that has been made, by bodies such as the EC in projects such as SCAPE. OPF is a sustainable membership organisation which is helping to answer these challenges for its members and provide effective and efficient routes to implementing the necessary changes to working practices and infrastructure. In 20 years we won’t be asking how to sustain work such as this – it will be business as usual for memory institutions everywhere – but right now the digital future is far from evenly distributed.

But from the SCAPE perspective we have a robust plan which encompasses many different routes to adoption, which is of course the ultimate route to sustainability – production use of the outputs by the community for which they were intended. The fact that many outputs are already in active use – as open-source tools and embedded into commercial systems – shows that SCAPE has produced not only great research but mature products which are ready to be put to work in real-world situations.

What do you think will be the most valuable outcome of SCAPE?

This is very difficult for me to answer! Right now OPF has the privileged perspective of transferring everything that has matured during the project into our stewardship - from initial research, through development, and now into mature products which are ready for the community. So my expectation is that there are lots of valuable outputs which are not only relevant in the context of SCAPE but also as independent components. One particular product has already been shortlisted for the Digital Preservation Awards 2014 which is being co-sponsored by OPF this year while others have won awards at DL2014. These might be the most visible in receiving accolades, but there are many other tools and services which provide the opportunity to enhance digital preservation practice within a broad range of institutions. I think the fact that SCAPE is truly cross-domain is very exciting – working with scientific data, cultural heritage, web harvesting – it shows that digital preservation is truly maturing as a discipline.

If there could be one thing to come out of this, it would be a understanding of how to continue the outstanding collaboration that SCAPE has enabled to sustain cost-effective digital preservation solutions that can be adopted by institutions of all sizes and diversity.

Contact information

ed@openplanetsfoundation.org

twitter.com/digitalfay

Preservation Topics: 
SCAPE Project Ends on the 30th of September

It is difficult to write that headline. After nearly four years of hard work, worry, setbacks, triumphs, weariness, and exultation, the SCAPE project is finally coming to an end.

I am convinced that I will look back at this period as one of the highlights of my career. I hope that many of my SCAPE colleagues will feel the same way.

I believe SCAPE was an outstanding example of a successful European project, characterised by

  • an impressive level of trouble-free international cooperation;
  • sustained effort and dedication from all project partners;
  • high quality deliverables and excellent review ratings;
  • a large number of amazing results, including more software tools than we can demonstrate in one day!

I also believe SCAPE has made and will continue to make a significant impact on the community and practice of digital preservation. We have achieved this impact through

I would like to thank all the people who contributed to the SCAPE project, who are far too numerous to name here. In particular I would like to thank our General Assembly members, our Executive Board/Sub-project leads, the Work Package leads, and the SCAPE Office, all of whom have contributed to the joy and success of SCAPE.

Finally, I would like to thank the OPF for ensuring that the SCAPE legacy will continue to live and even grow long after the project itself is just a fond memory.

It's been a pleasure folks. Well done!

Preservation Topics: 
Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation

We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochrane – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.

We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.

Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.

For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.

We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.

Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:

Image Type:     MFM Sector Image
Start Track:    At least 0
End Track:      At most 83
Side Mode:      Side 0
Sector Size:    256 Bytes
Sector Count:   Any
Track Distance: 40 Tracks
Target RPM:     By Image type
Flippy Mode:    Off

We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.

Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.

Exploding the Disk Images

With the disk images in hand we found ourselves facing our biggest challenge. The images, although clearly well-formed, i.e. not corrupt, would not mount with Virtual Floppy Disk or mount in Linux.

Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.

The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:

 

*************************


*     MIC-501  V1.6     *


*   62K CP/M  VERS 2.2  *


*************************


COPYRIGHT  1983, MULTITECH BIOS VERS 1.6

 

CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…

With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.

Commands for the different utilities look like the following:

That resulted in a command line to generate a file listing like this:

Creating a directory listing:

C:> cpmls –f bw12 disk-images\disk-one.img

This will list the user number (a CP/M concept), and the files objects belonging to that user.

E.g.:

0:
   File1.txt
   File2.txt

Extracting objects based on user number:

C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir

This will extract all objects collected logically under user 0: and put them into an output directory.

Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.

One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:

# Bondwell 12 and 14 disk images in IMD raw binary format

diskdef bw12
  seclen 256
  tracks 40
  sectrk 18
  blocksize 2048
  maxdir 64
  skew 1
  boottrk 2
  os 2.2
end

The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.

WordStar and other bits ‘n’ pieces

Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.

Example text, read verbatim might look like:

thå  southerî coasô = the southern coast

At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.

Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.

The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.

Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.

Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.

We await further news from DoC about what they’re planning on doing with the extracts next.

Conclusions

The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.

On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.

The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.

Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.

Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line. 

Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.

Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.

We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.

What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next. 

How trustworthy is the SCAPE Preservation Environment?

Over the last three and a half years, the SCAPE project worked in several directions in order to propose new solutions for digital preservation, as well as improving existing ones. One of the results of this work is the SCAPE preservation environment (SPE). It is a loosely coupled system, which enables extending existing digital repository systems (e.g. RODA) with several components that cover collection profiling (i.e. C3PO), preservation monitoring (i.e. SCOUT) and preservation planning (i.e. Plato). Those components address key functionalities defined in the Open Archival Information System (OAIS) functional model.

Establishing trustworthiness of digital repositories is a major concern of the digital preservation community as it makes the threats and risks within a digital repository understandable. There are several approaches developed over recent years on how to address trust in digital repositories. The most notable is Trustworthy Repositories Audit and Certification (TRAC), which has later been promoted to an ISO standard by the International Standards Organization (ISO 16363, released in 2012). The standard comprises of three pillars: organizational infrastructure, digital object management, and infrastructure and security management and for each of these it provides a set of requirements and the expected evidence needed for compliance.

A recently published whitepaper reports on the work done to validate the SCAPE Preservation Environment against the ISO 16363 – a framework for Audit and Certification of Trustworthy Digital Repositories. The work aims to demonstrate that a preservation ecosystem composed of building blocks as the ones developed in SCAPE is able to comply with most of the system-related requirements of the ISO 16363.

From a total of 108 metrics included in the assessment, the SPE fully supports 69 of them. 31 metrics were considered to be “out of scope” as they refer to organisational issues that cannot be solved by technology alone nor can they be analysed outside the framework of a breathing organisation, leaving 2 metrics to be considered “partially supported” and 6 metrics to be considered “not supported”. This gives an overall compliancy level of roughly 90% (if the organisational oriented metrics are not taken into account).

This work also enabled us to identify the main weak points of the SCAPE Preservation Environment that should be addressed in the near future. In summary the gaps found were:

  • The ability to manage and maintain contracts or deposit agreements through the repository user interfaces;
  • Support for tracking intellectual property rights;
  • Improve technical documentation, especially on the conversion of Submission Information Packages (SIP) into Archival Information Packages (AIP);
  • The ability to aid the repository manager to perform better risk management.

Our goal is to ensure that the SCAPE Preservation Environment fully supports the system-related metrics of the ISO 16363. In order to close the gaps encountered, additional features have been added to the roadmap of the SPE.

To get your hands on the full report, please go to http://www.scape-project.eu/wp-content/uploads/2014/09/SCAPE_MS63_KEEPS-V1.0.pdf

 

Digital Preservation Sustainability on the EU Policy Level - a workshop report

On Monday 8 September 2014 APARSEN and SCAPE together hosted a workshop, called ‘Digital Preservation Sustainability on the EU Policy Level’. The workshop was held in connection with the conference Digital Libraries 2014 in London.

The room for the workshop was ‘The Great Hall’ at City University London – a lovely, old, large room with a stage at one end and lots of space for the 12 stalls featuring the invited projects and  the 85 attendees.

The first half of the workshop was dedicated to a panel session. The three panellists each had 10-15 minutes to present their views on both the achievements and future of digital preservation, followed by a discussion moderated by Hildelies Balk from the Royal Library of the Netherlands, with real time visualisations made by Elco van Staveren.

‘As a community we have failed’

With these words David Giaretta, Director of APARSEN (see presentation and visualisation), pinpointed the fact that there will be no EU funding for digital preservation research in the future and that the EU expects to see some result from the 100 M € already distributed. The EU sees data as the new gold, and we should start mining it! A big difference between gold and data is that gold does not perish whereas data is not imperishable.

The important thing to do is to create some results – ‘A rising tide floats all boats’ – if we can at least show something that can be used, that will help funding the rest of the preservation.

Let’s climb the wall!

David Giaretta was followed by Ross King, Project Coordinator of SCAPE (see presentation and visualisation), who started his presentation with a comparison between the two EU projects Planets and SCAPE - the latter being a follow-up project from the first. Many issues already addressed in Planets were further explored and developed in SCAPE, the biggest difference being scalability – how to handle large volumes, scalability in planning processes, more automation etc. – which was the focal point of SCAPE.

To Ross King there were three lessons learned from working with Planets and SCAPE:

  • there is still a wall between Production on one side and Research & Development on the other, 
  • the time issue – although libraries, archives etc. work with long term horizons, most business have a planning horizon of five years or less,
  • format migration  may not be as important as we thought it was.

Who will pay?

Ed Fay, director of Open Planets Foundation (see presentation and visualisation), opened with the message that by working with digital preservation we have a great responsibility of helping to define the future of information management. With no future EU funded projects community collaboration on all levels is more needed than ever. Shared services and infrastructure are essential.

The Open Planets Foundation was founded after the Planets project to help sustaining the results of this project. Together with SCAPE and other projects OPF is now trying to mature tools so they can be widely adopted and sustained (See SCAPE Final Sustainability Plan).

There are a lot of initiatives and momentum, from DPC, NDIPP or JISC to OPF or APA - but how will the future look like? How do we ensure that initiatives are aligned up to the policy level?

Sustainability is about working out who pays – and when…

If digital preservation was delivering business objectives we wouldn’t be here to talk about sustainability - it would just be embedded in how organisations work - we are not there yet!

A diverse landscape with many facets

The panellist’s presentations were followed by questions from the audience, mostly concerned about risk approach. During the discussion it was stated that although the three presenters see the digital landscape from different views they all agree on its importance. People do need to preserve and to get digital value from that. The DP initiatives and organisations are the shopping window, members have lots of skills that the market could benefit from.

The audience were asked if they find it important to have a DP community - apparently nobody disagreed! And it seemed that almost everyone were members of OPF, APARSEN or other similar initiatives.

There are not many H2020 digital preservation bids. In earlier days everybody had several proposals running in these rounds, but this is not catastrophic – good research has been made and now we want the products to be consolidated. We would like to reach a point where digital preservation is an infrastructure service as obvious as your email. But we are not there yet!

Appraisal and ingest is still not solved - we need to choose the data to be preserved, especially when talking about petabytes!

The wrap-up of the discussion was done by discussing the visualisation made by Elco van Staveren.

An overall comment was that even though there are no money directed towards digital preservation, there is still lots of money for problems that can be solved by digital preservation. It is important that the community of digital preservation thinks of itself NOT as the problem but as part of the solution. And although the visualisation is mostly about sustainability, risks still play an important part. If you cannot explain the risk of doing nothing you cannot persuade anyone to pay!

Clinic with experts

After the panel and one minute project elevator pitches there was a clinic session at which all the different projects could present themselves and their results at different stalls. A special clinic table was in turn manned by experts from different areas of digital preservation.

The projects involved in the clinic were:

This was the time to meet a lot of different people from the Digital Preservation field, to catch up and build new relations.  For a photo impression of the workshop see: http://bit.ly/1u7Lmnq.

Preservation Topics: 
AttachmentSize
IMG_9928.JPG1.75 MB
Elcovs discussion.jpg134.17 KB
IMG_3361.JPG311.72 KB
IMG_3325.JPG409.54 KB
And the winner is....

 

Which message do YOU want to send to the EU for the future of Digital Preservation projects?

 

At closing time of the workshop the winner and two runner up tweets were announced. Three very different messages to the EU altogether. One runner up tweet was urging the EU to allow for a small sustainability budget for at least 5 years after a project formally ends. The other runner up tweet included the question: 'Will this tweet be preserved?" which -very appropiate- by now is already deleted and thus seemingly lost forever.  

But we are proud to announce:

 

The winner! : The words of Galadriel "Much that once was is lost, for none now live who remember it" must not come true

 

More about the workshop in the offical SCAPE/ APARSEN workshop blogs- soon to be published!

 

 

 

 

Preservation Topics: 
Our digital legacy: shortlist announced for the Digital Preservation Awards 2014
Created in 2004 to raise awareness about digital preservation, the Digital Preservation Awards are the most prominent celebration of achievement for those people and organisations that have made significant and innovative contributions to ensuring our digital memory is accessible tomorrow.
 
‘In its early years, the Digital Preservation Award was a niche category in the Conservation Awards’, explained Laura Mitchell, chair of the DPC. ‘But year on year the judges have been impressed by the increasing quality, range and number of nominations.’ 
 
‘I’m delighted to report that, once again, we have had a record number of applications which demonstrate an incredible depth of insight and subtlety in approach to the thorny question of how to make our digital memory accessible tomorrow. ’
 
The judges have shortlisted thirteen projects in 4 categories:
 
The OPF Award for Research and Innovation which recognises excellence in practical research and innovation activities.
  • Jpylyzer by the KB (Royal Library of the Netherlands) and partners
  • The SPRUCE Project by The University of Leeds and partners
  • bwFLA Functional Long Term Archiving and Access by the University of Freiburg and partners
 
The NCDD Award for Teaching and Communications, recognising excellence in outreach, training and advocacy. 
  • Practical Digital Preservation: a how to guide for organizations of any size by Adrian Brown
  • Skilling the Information Professional by the Aberystwyth University
  • Introduction to Digital Curation: An open online UCLeXtend Course by University College London
 
The DPC Award for the Most Distinguished Student Work in Digital Preservation, encouraging and recognising student work in digital preservation. 
  • Voices from a Disused Quarry by Kerry Evans, Ann McDonald and Sarah Vaughan, University of Aberystwyth
  • Game Preservation in the UK by Alasdair Bachell, University of Glasgow
  • Emulation v Format Conversion by Victoria Sloyan, University College London

 

The DPC Award for Safeguarding the Digital Legacy, which celebrates the practical application of preservation tools to protect at-risk digital objects. 

  • Conservation and Re-enactment of Digital Art Ready-Made, by the University of Freiburg and Partners
  • Carcanet Press Email Archive, University of Manchester
  • Inspiring Ireland, Digital Repository of Ireland and Partners
  • The Cloud and the Cow, Archives and Records Council of Wales
‘The competition this year has been terrific’, said Louise Lawson of Tate, chair of the judges. ‘Very many strong applications, which would have won the competition outright in previous years, have not even made the shortlist this time around.’
 
The Digital Preservation Awards have been celebrating excellence for 10 years now and is being supported by some leading organisations in the field including the NCDD and Open Planets Foundation. Hosted by the Wellcome Trust, their newly refurbished London premises will add to the glamour of the awards ceremony on Monday 17th November.
 
The finalists will attract significant publicity and a deserved career boost, both at organisation and individual level. Those who walk away with a Digital Preservation Award on the night can be proud to claim to be amongst the best projects and practitioners within a rapidly growing and international field.
 
‘Our next step is to open the shortlist to public scrutiny’, explained William Kilbride of the DPC. ‘We will be giving instructions shortly on how members of the DPC can vote for their favourite candidates. 
 
‘We have decided not to shortlist for the ‘The DPC Award for the Most Outstanding Digital Preservation Initiative in Industry’. Although the field was strong the judges didn’t feel it was competitive enough. We will be making a separate announcement about that in due course.
 
Notes:
For more about the Digital Preservation Awards see: http://www.dpconline.org/advocacy/awards
For more about the Digital Preservation Coalition see: http://www.dpconline.org/
For press interviews contact William Kilbride on (william_at_dpconline.org)

 

Preservation Topics: 
My first Hackathon - Hacking on PDF Files

Preserving PDF - identify, validate, repair

22 participants from 8 countries - the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.

Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).

The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.

I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.

Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.

We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".

We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all - and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.

As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.

Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.

During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error - which formally would be the right thing to do - or not disturb the user unnessecarily?

Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-proof.

By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.

A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.

There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.

For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.

Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.

When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.

And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.

Twitter Hashtag #OPDFPDF

Preservation Topics: 
User-Driven Digital Preservation

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is that we need to gather better information about which resources are difficult for users to use, and which formats they would prefer, so that we can use this data to drive our preservation work.

The prototype also provides a convenient way to run Apache Tika and DROID on any URL, and exposes the contents of its internal 'format registry' as a set of web pages that you can browse through (e.g. here's what it knows about text/plain). It only supports a few preservation actions right now, but it does illustrates what might be possible if we can find a way to build a more comprehensive and sustainable system.

When (not) to migrate a PDF to PDF/A

It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.

PDF/A is a profile

First, it's important to stress that each of the PDF/A standards (A-1, A-2 and A-3) are really just profiles within the PDF format. More specifically, PDF/A-1 offers a subset of PDF 1.4, whereas PDF/A-2 and PDF/A-3 are based on the ISO 32000 version of PDF 1.7. What these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in 'regular' PDF. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. This can be illustrated with the following simple Venn diagram below, which shows the feature sets of the aforementioned PDF flavours:

PDF Venn diagram

Here we see how PDF/A-1 is a subset of PDF 1.4, which in turn is a subset of PDF 1.7. PDF A/2 and PDF A/3 (aggregated here as one entity for the sake of readability) are subsets of PDF 1.7, and include all the features of PDF A/1.

Keeping this in mind, it's easy to see that migrating an arbitrary PDF to PDF/A can result in problems.

Loss, alteration during migration

Suppose, as an example, that we have a PDF that contains a movie. This is prohibited in PDF/A, so migrating to PDF/A will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a PDF/A document must be embedded. But what happens if the source PDF uses non-embedded fonts that are not available on the machine on which the migration is run? Will the migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?

Complexity and effect of errors

Also, migrations like these typically involve a complete re-processing of the PDF's internal structure. The format's complexity implies that there's a lot of potential for things to go wrong in this process. This is particularly true if the source PDF contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don't really have any tools for detecting such errors (i.e. a sufficiently reliable PDF validator), these cases can be difficult to deal with. Some further considerations can be found here (the context there is slightly different, but the risks are similar).

Digitised vs born-digital

The origin of the source PDFs may be another thing to take into account. If PDFs were originally created as part of a digitisation project (e.g. scanned books), the PDF is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such PDFs to PDF/A is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in PDF/A. At the same time, this also means that the benefits of migrating such files to PDF/A are pretty limited, since the source PDFs weren't problematic to begin with!

The potential benefits PDF/A may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there's just a lot more that can go wrong (see also here for some additional considerations).

Conclusions

Although migrating PDF documents to PDF/A may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source PDFs that weren't problematic to begin with, which belies the very purpose of migrating to PDF/A. For specific cases, migration to PDF/A may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of PDFs (both source and destination!), it would also seem prudent to always keep the originals.

Taxonomy upgrade extras: 
Meet SCAPE, APARSEN and many more….

SCAPE and APARSEN have joint forces and are hosting a free workshop, ‘Digital Preservation Sustainability on the EU Policy Level’ in connection with the upcoming DL2014 conference in London.

The first part of the workshop will be a panel session at which David Giaretta (APARSEN), Ross King (SCAPE), and Ed Fay (OPF) will be discussing digital preservation.

After this a range of digital preservation projects will be presented at different stalls. This part will begin with an elevator pitch session at which each project will have exactly one minute to present their project.

Everybody is invited to visit all stalls and learn more about the different projects, their results and thoughts on sustainability. At the same time there will be a special ‘clinic’ stall at which different experts will be ready to answer any questions you have on their specific topic – for instance PREMIS metadata or audit processes.

The workshop takes place at City University London, 8 September 2014, 1pm to 5pm.

Looking forward to meeting you!

 

Read more about the workshop

Register for the workshop (please notice! Registration for this workshop should not be done via the DL registration page)

Read more about DL2014

Oh, did I forget? We also have a small competition going on… Read more.

 

Preservation Topics: 
When is a PDF not a PDF? Format identification in focus.

In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.

Assumptions

I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately  as possible then pass the file to format specific tools for deeper analysis.

I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that's a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

  • the fine free file utility (also known simply as file),
  • DROID,
  • FIDO, and
  • Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.

What's happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:

<match value="%PDF-" type="string" offset="0"/>

What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:

%PDF-1.2

meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID="125" Specificity="Specific">
    <ByteSequence Reference="BOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="0" SubSeqMinOffset="0">
            <Sequence>255044462D312E32</Sequence>
            <DefaultShift>9</DefaultShift>
            <Shift Byte="25">8</Shift>
            <Shift Byte="2D">4</Shift>
            <Shift Byte="2E">2</Shift>
            <Shift Byte="31">3</Shift>
            <Shift Byte="32">1</Shift>
            <Shift Byte="44">6</Shift>
            <Shift Byte="46">5</Shift>
            <Shift Byte="50">7</Shift>
        </SubSequence>
    </ByteSequence>
    <ByteSequence Reference="EOFoffset">
        <SubSequence MinFragLength="0" Position="1"
            SubSeqMaxOffset="1024" SubSeqMinOffset="0">
            <Sequence>2525454F46</Sequence>
            <DefaultShift>-6</DefaultShift>
            <Shift Byte="25">-1</Shift>
            <Shift Byte="45">-3</Shift>
            <Shift Byte="46">-5</Shift>
            <Shift Byte="4F">-4</Shift>
        </SubSequence>
    </ByteSequence>
</InternalSignature>
Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here.
 
The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.

So who's right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker,
%%EOF. (See implementation note 15 in Appendix H.)
The referenced implementation note reads:
3.4.4, “File Trailer”
15. Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file. 

So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.

What's next?

I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.

Preservation Topics: 
AttachmentSize
It looks like a PDF to me....44.06 KB
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.