OPF News & Overview

Skip to end of metadata
Go to start of metadata

This page gives an overview of the content across all the different wiki sites hosted by the OPF.

Popular Tags

Open Planets Foundation News

Open Planets Foundation
(The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.)
Sound Challenge: And the Easter Egg goes to ...

Before Easter we planned to do a correctness benchmark for Audio Migration QA, specifically targeting the new tool xcorrSound waveform-compare, see http://openplanetsfoundation.org/blogs/2012-07-09-xcorrsound-waveform-compare-new-audio-quality-assurance-tool. The migration tool used was FFmpeg (version 0.10). The tools used in the QA were FFprobe, JHove2 (version 2.0.0) and xcorrSound waveform-compare (version 0.1.0). The tool used for the workflow was Taverna, and the workflow is available from myExperiment.

Audio QA Workflow

The challenge in a correctness baseline test set for audio migration quality assurance is that audio migration errors are rare. We thus wanted to create a "simulated annotated data set", where each entry consist of a "test file" with a possible migration error, a "control file" without any error that we can use for comparison, an "annotation" telling us about the test file, and a "similar" attribute = true or false.

In connection with large scale experiments in November 2012 (http://wiki.opf-labs.org/display/SP/EVAL-LSDR6-1), we did succeed in finding a migration error using waveform-compare. It turned out that this was caused by a bug in the old version of FFmpeg (0.6-36_git20101002), which we had used. This bug had been fixed in the 0.10 version. The error was a short bit of noise. We of course made a test file testing for this type of error.

We also experienced that different conversions tools added different (not audible) short pieces of silence to the end of the files. The waveform-compare tool reported these as 'not similar', but we decided that these files are similar, and the tool was updated accordingly. We also created test files with short bursts of silence in different places to test this.

We then had 5 different test files based on one original, a snippet of "Danmarks Erhvervsradio" (Danish Business Radio) from a show called "Børs & Valuta med Amagerbanken" from 8th of January 2002 from 8:45 till 9 am. This file is available from https://github.com/statsbiblioteket/xcorrsound-test-files/raw/master/DER259955_ffmpeg.wav.

The Challenge

We thought our test data set was meager and decided that we needed a more diverse annotated dataset for the correctness benchmarking. This was accomplished by issuing an internal challenge at the Digital Preservation Technologies Department at SB. The challenge was, given a correct wav file, to introduce a simulated migration error, that our workflow would miss. The given original file was a snippet of a Danish Radio P3 broadcast from October 31st 1995 approximately 8:13 till 8:15 am with the end of a song, then a short bit of talking, and then the beginning of the new song. You can download the file from here https://github.com/statsbiblioteket/xcorrsound-test-files/raw/master/original.mp3 and listen to it. The reward was a chocolate Easter egg to anyone who succeeded.

This resulted in 23 new and very different test files. The full simulated annotated dataset used for the correctness benchmarking thus consists of 28 test files and 2 comparison files along with annotations and is available from Github https://github.com/statsbiblioteket/xcorrsound-test-files.

Experiments and Results

The experiences from the correctness benchmarking showed that the classification of files into similar and not similar is certainly debatable. We decided to reward everyone with a small Easter egg for participating, and we then chose a few remarkable contributions and awarded them bigger Easter eggs :)

The first big Easter egg went to the challenge-pmd.wav test file, which has a hidden jpg image in the least significant bits in the wave file. The difference in challenge-pmd.wav and challenge.wav is not audible to the human ear, and is only discovered by the waveform-compare tool if the match threshold is set to at least 0.9999994 (the default is 0.98). We think these files are similar! This means that our tool does not always 'catch' hidden images. For the fun story of how to hide an image in a sound file, see http://theseus.dk/per/blog/2013/04/07/hiding-stuff/.

back-to-the-future

The second big Easter egg went to challenge-TE-2.wav test file, which was made setting Audacity Compressor 10:1 and amplify -5, which is similar to radio broadcast quality loss. The difference between challenge-TE-2.wav and challenge.wav is audible, but only discoverable with threshold>=0.99. The question is whether to accept these as similar. They certainly are similar, but this test file represents a loss of quality, and if we accept this loss of quality in a migration once, what happens if this file is migrated 50 times? The annotation is similar=true, and this is also our test result with default threshold=0.98, but perhaps the annotation should be false and the default threshold should be 0.99?

And then there is challenge-KFC-3.wav, where one channel is shifted a little less than 0.1 second, the file then cut to original length and both file and stream header updated to correct length. The difference here is certainly audible, and the test file sounds awful. The waveform-compare tool however only compares one channel (default channel 0) and outputs success with offset 0. The correctness benchmark result is thus similar=true, which is wrong. If waveform-compare is set to compare channel 1, it again outputs success, but this time with offset 3959 samples (82 millisecond, as the sample rate is 48kHz). This suggests that the tool should be run on both (all) channels, and the offsets compared. This may be introduced in future versions of the workflows. Unfortunately this entry was late, so no Easter egg was awarded :(

A Few Notes

Some settings are also relevant for the checks of 'trivial' properties. We for instance have a slack for the duration. This was introduced as we earlier have experienced that tools insert a short bit of silence to the end of the file in a migration. The default slack used in the benchmark is 10 milliseconds, but this may still be too little slack depending on the migration tool.

The full test results are attached :)

Preservation Topics: 
Adventures in setting up access controls in a Fedora Commons repository

We have been evaluating the use of the latest Fedora Commons, version 3.6.2, as a test repository.  Having followed the straightforward installation process we were left with a repository with one preconfigured user – fedoraAdmin. 

There are two APIs – API-A for access and API-M for management.  For our test instance API-A was configured on installation to require a log in, but it can be configured to require no log in.  It appeared that whilst the REST API for API-A was restricted, the SOAP API for API-A was not, this was corrected by using the example policy, below.  Investigations of how to configure multiple users are also detailed.

Access controls/users

Fedora Commons security makes use of XACML, a verbose access control policy language written in XML.  It is possible to define a user type/profile in XACML and create users with that profile in fedora-users.xml.  A policy writing guide and vocabulary are available.

XACML will soon be/has been superseded by the Fedora Security Layer (FeSL), which uses “an alternative XACML policy enforcement engine”.  FeSL authentication is the default authentication mechanism, however, FeSL authorisation is currently marked experimental.

The Fedora XACML Policy Writing Guide states “It should be noted that to help users who do not wish to learn native XACML, a Policy Authoring Client is currently under development that will provide an easy graphical user interface for creating XACML policies for Fedora.”  However, when searching for the “Policy Authoring Client” I found an unanswered question sent to the Fedora Commons development mailing list querying its status, in April 2007, so I am unsure of where to find it.

An example policy that restricts API-A to users with “administrator” or “professor” roles can be downloaded from deny-apia-if-not-tomcat-role.xml.  A change of the type from “professor” to “apiauser”, and adding a user in fedora-users.xml with that fedoraRole value will allow access to API-A for that user.

There may well be other aspects of the repository that need securing, too.

Other configuration issues

We chose to configure Fedora Commons to be reverse proxied through an Apache web server.  This leads to some issues about configuring the “Base URL” for the repository.  Only the IP address/hostname can be set, the externally facing http transport type and port cannot be configured in the Fedora Commons settings.  This required a small kludge where Tomcat was configured to run on port 80.

If there are any corrections to the above, questions, or suggestions of other ways to configure the policies (e.g. a GUI) please do comment!

Preservation Topics: 
Apache Tika File Mime Type Identification and the Importance of Metadata

 

Tika File Mime Type Identification and the Importance of Metadata

An evaluation was recently carried out to determine how well Apache Tika was able to identify the mime types of a corpus of test files, described in the ‘Data Set’ section. The purpose of the evaluation was to determine:

1.      if the performance* of Tika has changed between versions 1.0 and the current version, 1.3 and,

2.      how the provision of metadata, in the form of the file name, affects the performance of Tika.

In order to address the first point, the evaluation was carried out four times, once for each of the four available versions of Tika 1.0, 1.1, 1.2 and 1.3.

The second point was address by running the evaluation twice for each version of Tika; the first test passed only a file input stream to the Tika ‘detect’ method, the second test passed both the file input stream and the file name.

In total eight tests were carried out, the results are shown in the Results section below.

* For the purposes of this evaluation the performance of Tika is measured by the number of file mime types identified correctly when compared against a ground truth, described in the ‘Data Set section.

Data Set

The set of test files consists of a Govdocs corpus of almost 1 million files, freely available from http://digitalcorpora.org/corpora/files. The ground truth for these files has been provided by Forensic Innovations, Inc. available from http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip.

Platform

The evaluation was run as a Cloudera Hadoop map/reduce process on a  HP ProLiant DL 385p Gen8 host with 32 CPUs,  224 Gb of RAM and a clock rate of 2.295 Ghz, using ESXi to run 32 virtual machines.  The Hadoop configuration is  a Cloudera  (cdh4.2.0) Hadoop 30 node cluster  consisting of a manager node, a master node and 28 slave nodes located at the British Library in the UK. Each node runs on its own virtual machine with 1 core, 500Gb of storage and 6Gb of RAM.

Results

In total the evaluator process was run eight times on the Govdocs corpus.

The table below shows the number of files processed by Apache Tika, the number correctly identified, the number that were incorrectly identified and the percentage identified correctly. The ‘Filename Used?’ column indicates whether the Tika detect method was pass only the file input stream (‘N’), or passed both the file input stream and file name (‘Y’).

 

Test

Tika Version

Files Processed

Files Identified Correctly

Files Identified Incorrectly

Files Correctly Identified (%)

Filename Used ?

1

1.0

973693

757326

216367

77.779

N

2

1.1

973693

757240

216453

77.770

N

3

1.2

973693

758549

215144

77.904

N

4

1.3

973693

758557

215136

77.905

N

5

1.0

973693

945555

28138

97.110

Y

6

1.1

973693

945516

28177

97.106

Y

7

1.2

973693

938138

35555

96.348

Y

8

1.3

973693

938148

35545

96.349

Y

Table 1 – Files mime types identified correctly/incorrectly by Apache Tika

Observations

The results in Table 1 show that, when used with a file input stream only, the performance of Tika improves slightly between versions 1.0 and 1.3. However, when Tika is used with both a file name and a file input stream the performance degrades between versions 1.0 and 1.3.

Further investigation shows that the files that were identified correctly in Tika version 1.0 but identified incorrectly in version 1.3 were of the following types :-

 

Tika v1.3 Mime Type

Number of Files

 application/msword                    

1

 application/octet-stream              

61

 application/rss+xml                   

4

 application/x-tika-msworks-spreadsheet

2

 application/zip                       

2

 message/x-emlx                        

10

 text/plain                            

6

 text/x-log                            

8107

 text/x-matlab                         

2

 text/x-perl                           

2

 text/x-python                         

4

 text/x-sql                            

295

       Table 2 – Number of files identified correctly in version 1.0 but incorrectly in version 1.3

 

 

Further investigation, carried out into files identified by Tika 1.3 as ‘text/x-log’, shows that these are text files with a file extension of ‘.log’. These files were identified by Tika versions 1.0 and 1.1 as having a mime type of ‘text/plain’, which matches the ground truth mime type. Similarly, Tika versions 1.2 and 1.3, when used with just an input stream, also identified these files as ‘text/plain’, again matching the groundtruth.

However, when Tika versions 1.2 and 1.3 were provided with the filename, they identified .log files as having a mime type of ‘text/x-log’.  As the ‘plain/text’ group of files encompasses a large and diverse set of file types, including logs, source code, properties/config files, data files etc, this could be considered an improvement as it provides greater differentiation between the different file types.

Possible Future Work

The results of the tests show that Apache Tika relies heavily on the filename when carrying out file identification. In the future this work could be extended to investigate how easily Tika can be fooled into identifying a file wrongly after being provided with incorrect/misleading file extension as part of the filename.  

Preservation Topics: 
C3PO is ready for you

As you may or may not know, C3PO is a content profiling tool for preservation analysis.

It reads in characterisation meta data and gives you the possibility to aggregate
it and/or to visualise it.
 
The first versions of C3PO generated quite a lot of interest within the digital preservation community, which was a clear sign to me that it has the potential to become a valuable asset to the standard digital preservation tool belt.
These first versions were more of a prototypical nature, where the problem was explored as well as integration interfaces with other tools were defined.
 
Thanks to the SPRUCE Project and the award I won, I had the chance to spend a month of work on the codebase and to improve many issues. The goal was to create a stable version of the codebase with clearly defined interfaces and guides, as well as default implementations as examples. This should lower the entry barrier for third party developers, so that the community can take the code base and develop/extend it in any desired direction.
 
Today, I am releasing version 0.4 of the core and command line of C3PO. The codebase is completely documented and waiting to be starred and forked. If you just want to download it and use it, you can use this bin tray download link here.
As of today C3PO is released under the Apache 2.0 license.
 
Although, the only change you can directly see is the new logo:
c3po-logo
this new version offers significant improvements in the core of the framework. I would like to give you a short overview of some of them and why they are important.
 
The most significant change is the abstraction of the persistence layer. C3PO uses a Mongo database as a default persistence. Many developers in this community have expressed their concerns about the dependency of the Mongo Database and the tangled code. Well, version 0.4 completely abstracts the persistence layer. If you want to use a different backend, you have to implement a single class and plug it in - so if you are an HBase expert, please consider contributing :). 
 
The second improvement are the new filtering capabilities. With the new enhancements, the users will be able to create a bit more flexible filters, which should enable them to find out even more interesting aspects of their data. Once the Web application is updated to make use of these new changes, then a significant improvement in the UI responsiveness will be achieved, due to filter and result caching as well as a number of bug fixes.
 
The third major improvement is that C3PO now allows consolidation of meta data coming from different sources. This means that if you have characterisation data coming from e.g. FITS and TIKA for the same digital objects (with the same identifier), C3PO will automatically consolidate the data. This will allow to reduce the sparsity that we currently see through many different data sets.
 
The new release includes also numerous other improvements and bug fixes in the core.
 
How does the future look like?
Well, I will continue to maintain the repository and will make sure the web application catches up with all these new changes of the core framework within the next months.
 
My former colleagues from the University of Technology in Vienna and partners from the SCAPE project (Thank you guys, you all rock) will continue to develop and maintain the codebase in order to overcome the next scalability boundaries.
 
A Roadmap for the foreseeable future can be found here. It will be updated in the coming weeks.
 
What can you do?
If you are a user and you think C3PO is or can be valuable to you or your institution, please try it out, give feedback (it is very important!), report issues and contribute to the ROADMAP.
 
If you are a developer, please star the repository on github and give the tool a try.
If you want to contribute, it is easy. You can start by reading the dev guide here. You can report issues here. Take a look at the open issues or at the Roadmap and pick up something you find useful.
For example, writing a new meta data adaptor is the easiest and requires implementing one method. If you have more time and knowledge about HBase, consider providing an HBase Persistence Layer. 
 
I hope that this will make C3PO more useful and that the community will not hesitate to take the tool, use it, tear it apart and shape it according to the current needs. If you have any questions or feedback, please drop me a line at petar@creativepragmatics.com
 
Last but not least, I want to thank the SPRUCE project and especially Paul Wheatley and Carl Wilson for the opportunity and for their help and support!

P.S. on the 31 of May, there will be a webinar on C3PO hosted by the OPF - if you are interested, please check it out. At the end, I will join and try to answer your questions.

Preservation Topics: 
We don’t do migration for the future; we do it for the present: Emulation and an ever so slightly unsatisfying success story

 

#Migration: No one does it for the future; they do it (need to do it) for the now.” - https://twitter.com/beet_keeper/status/327968228276060160

 

Recently I was asked by a colleague to look at some files he’d been sent by Hutt City Council in New Zealand; an unknown format from a 1995 vintage IBM operating system – a format as yet unidentified by popular format identification tools.

As with most of these attempts to identify a format we ran the files through DROID, ExifTool and the Unix File Command. With neither identifying the files the search really begins with a Google search of the file’s magic bytes:

2B 41 2B 56 2B 43 2B       +A+V+C+

A single result at the time provided little to go on; it confirmed someone had once asked the same question on a computer graphics forum. A few clues in the bitstream e.g. a potential font size and title, ‘Roman Bold 26’, and a few more Google searches meant that we could say these files were potentially proprietary to an IBM system as opposed to a file with a more open specification. Confirmation with the content provider gave us the original environment as OS/2.

But that was it. We were staring at an obsolete format; definition: “A format, which, within our limited resourced world view at the time, we could no longer use.” 

Our final point of call was to see if we could put the format back into its original environment to observe it in its natural state.

From here, the process became much simpler. As it turned out, an OS/2 installation running on VirtualBox knew what to do with these files. It was able to render them natively in an application for handling IBM AVC (Advanced Video Connection) content. Even better than that, the context menu for these images gave us the option ‘Convert To’ with the following options available:

  • BMP (OS/2 Bitmap)
  • DIB (RIFF DIB Image)
  • GIF (GIF Image Compressed)
  • JPG (Baseline JPG)
  • PCX (PCX Image Compressed)
  • TGA (Truevision TARGA)
  • TIF (Tag Image File Format)
  • VID (IBM MMotion Still Video Image)

Variants existed under BMP, TGA and TIFF, for example OS/2 1.3 and 2.0 BMP and Motorola or Intel, Compressed or Uncompressed TIFF.

The context menu option also allowed for the bulk conversion of these images, so a single click gave us uncompressed TIFF images suitable for export.

Simple is of course a relative term, and although we had the images we wanted, there was a problem retrieving them from the emulated environment. Unable to successfully set up a shared drive to enable our Host OS to interact with VirtualBox, and unable to attach any form of writeable media, we were stuck.

The virtual machine was connected to the Internet but Netscape unable to interact with modern websites particularly well. Also we were unable to use FTP successfully, at least given the self-imposed timeframe we were working to.

Our final option was email. SMTP saved the day. Taking the images, Zipping them using the still available Info-Zip tool and emailing them from a Gmail account back to itself using the OS-provided Netscape Messenger email client enabled the images to be retrieved which immediately made them useable in a modern environment.

And that was it, job done!

But there is still more to this story.

Time Travel

It’s 1996. I boot up my OS/2 Warp 4.2 box. It’s being packed away today, ready for the new Pentium machines running Windows being rolled out by our IT department. Windows… *sigh* but my IT department wax lyrical about the improvements in performance and security. It’s just work, I’ve got a fishing trip at the weekend so I’ve other things to keep my mind off the IBM vs. Microsoft debate. Wait! I’d better make sure I’ve got all my files. Ah, those IM files I was looking at last year. Neat images; could come in handy again. Windows doesn’t support the format though. Hmm, right-click, convert. 300 files; IM to TIF - that’s going to take a few floppy disks! - Should be able to access them in a few applications though. Good!

What we did by grabbing hold of an OS/2 installation and VirtualBox was not create a solution we want to take into the future. It was us stepping back into 1996 for one time only. To create a version of a file we could take into 1997, and beyond, on a different platform. It is 1996 again and we’ve now got 300 TIFF files. As things move forward in 2013 we might start thinking about converting them to a new standard, PNG maybe to capitalize on space savings provided by lossless compression and also to make use of them on the web. Being an open standard (like TIF) might help to avoid a similar situation to our IM files in future as well. Whatever mechanism is best. It should be lossless and should give us the greatest potential for use moving forward.

Outside of the time travel context, with our images converted and the original provider of the materials happy with the work, we’re left with a success story, but an incomplete solution… an unsatisfying one.

An unsatisfying solution

At the end of this process we’re still left with a file format we don’t fully understand. I can’t migrate this format in a modern environment using modern tools. I can’t render it; I can’t really identify it with complete certainty. I can’t help matters and create a signature for it without really knowing more about where it came from and what its specification looks like. I do have enough examples from a single system to take apart some of the header and look for consistencies but is this precise enough for what we’re attempting to achieve in Digital Preservation? Maybe, for an experimental DROID signature file.

As for the completed migration, with no validation tools available I can’t look at the internals of this format and guarantee I know what was lost between the conversions from IM to TIFF – I do know I’ve lost something though – what were those references to font? They’re no longer in the TIF output, and what other plain text did I spot in the bitstream that might mean something? A part of the bitstream annotated as ‘TEXT’, another ‘HEAD’- fields pertaining to the DB/2 conversion described by the provider?

In short:

  • We can’t validate the success of the conversion beyond the rendered image
  • We haven’t isolated a specification for this format
  • We haven’t an ability to express a signature in current production identification systems
  • We cannot render IM files in a modern environment
  • The mechanism of transfer from the emulated environment to our Host OS was certainly not a preferred route

As many a school report might say – could do better.  The end result of this process is that we have some images that can now be reused by the original content provider. We can also say with a little confidence that we know what format these images were originally: IBM AVC Still Video Image.  I’ll leave it up to the comments section of this blog to suggest ways forward from here. The main message for me, however, is that for this to be considered a satisfactory result for digital preservation, one, or more of these issues would have been solved as part of the process – a file format signature would be something, some idea about what the header says would be good, and a deeper analysis about the format structure even better. What would be really nice is an understanding of whether it might be possible to create a migration tool for this format in future, with some idea about what the original specification for the format suggests about the feasibility of being able to do that.

Other Solutions

Before I conclude, we did consider two other options which with further investigation might help us in the short term.

  • eComStation is a modern operating system, based on OS/2. In an emulated environment this might give us better methods of extracting the files, for example USB support, better access to file upload websites, and even the opportunity to set up a shared drive between it and the Host OS. We did try to convert the images using eComStation and found that it worked, and even provided PNG as an export format – what had been lost in translation, however, was the bulk processing capability – this left us wondering whether we’d need to create a MS-DOS based Batch Script to do this routine, or even use REXX – IBM’s own interpreted programming language native to the environment. 
  • Exporting the OS/2 executable for the native image viewer or even converter into Windows may have worked providing they had originally been written to be compatible with Windows and not just OS/2. Highly unlikely but we did have success running an Aldus PhotoStyler executable found in the user directories sitting alongside the original image files.

Migration for the Now

This was an interesting use case. It was nice to have the time to look at a problem outside of the context of the government records we’re expected to look after at Archives New Zealand. There was no expectation of this result, just some files to play around with and see what we could do.

There were a number of lessons alluded to above – goals that we should strive for in digital preservation.

For me, despite this solution relying wholly on emulation, what I really learned was the value of migration. Stepping back into 1996 allowed me to migrate my files to a format I could still use in 2013. I believe the same of file formats now. Any file formats that I have a doubt about, be them proprietary, be that an objective, or otherwise, measurement of over complexity, or simply because it’s not a widely adopted format - I should be thinking about migrating them. It might be the difference between a future Digital Preservation Analyst having to emulate my XP environment and finding an obscure way to transfer files from it, and the alternative, of simply being able to render them natively within their own modern OS. 

Preservation Topics: 
Developing in the Open

This is my first, long overdue blog post since starting my new role as Software Configuration Manager for OPF at the start of the year.  Truth be told that between the SCAPE end of year and review, a weeks holiday, and working out what to do it doesn't feel like four months since I started.  I'm the OPF's first full time technical team member and will be dividing my time between:

  • improving the quality of software developed under the OPF's GitHub Organisation's banner, and other open source tools used by members.
  • offering guidance and assistance to developers working OPF software with best practices, and use of online tools.

  • helping to improve user documentation of software so that it’s easier to find and use.

  • showing members how to help shape future development of tools they use, by helping to convert requirements into developer tasks and automated tests.

  • engaging with the developers of open source digital preservation projects in order to share ideas, and software.

  • providing technical expertise and meeting members at OPF Hackathons and other events.

  • contributing to external projects the OPF is involved in, e.g. SCAPE & SPRUCE.

The OPF GitHub page currently lists 50 public projects. To put that in perspective I could afford a week of effort a year per project, if I did nothing else and took no holidays. In reality it would be no more than 2 days a year per project.  The projects are in varying states of activity, are written in different programming languages e.g. Java, Ruby, Python, PHP, and some aren't software projects at all.  Between other tasks I've started to update the OPF's current development guidelines, and added some guidance on the OPF’s GitHub policy. This includes standard practises that should be adopted by all OPF GitHub projects. The main concerns for new projects are:

  • create a descriptive (preferably GitHub markdown) README file.

  • clearly state the license terms of the project in a LICENSE file.

  • create a small YAML file listing some basic project metadata.

  • Adding this information makes it easy for somebody to find out what the project does, if they have permission to use it, and contact somebody if they have problems.

I’ve also written a little code that uses the GitHub API to create a web page that gives an overview of the OPF’s GitHub projects, providing warnings where projects don’t follow the OPF’s GitHub policy.  The generated page can be found here and is currently updated once a day.

I’m now working on guidelines for using Travis-CI, the online continuous integration service, and hosting binary packages on BinTray.  As I complete new sections I’ll also create a blog post giving a few more details. A recent OPF webinar tries to give the full picture, the slides are available on the Wiki.

I’ll wrap up this post by saying that I’m happy to take further suggestions, and answer questions, just drop me an email or IM me.  I’m happy to provide direct assistance to members who require it. It’s also nice to meet members in person, I’ll be attending all OPF events where possible, starting with the Hackathon in Copenhagen next week.  Oh, and I promise to blog a little more often......

Preservation Topics: 
Getting FITS into shape

The Harvard Library developed FITS, the File Information Tool Set, as part of the ingest processing of its Digital Repository Service (DRS). This was mostly Spencer McEwen's work. It's a "Swiss army knife," running a number of different tools to identify formats and provide metadata information about files. It was put up on Google Code as open source, and a number of other institutions have started using it.

Harvard hasn't had the time to update it to a more broadly useful project, but thanks to a SPRUCE Award, I've been spending April making various updates and fixes to it, with the results currently available on Github. That repository is a temporary way station for it; these changes will be merged into an institutionally maintained repository, though just where hasn't been determined yet.

The first task I undertook was adding Apache Tika as a new tool. The work on this started at the OPF Hackathon in Leeds. The advantage of Tika is that not only does it already cover a lot of formats, but it's actively maintained, so we can expect support for more formats in future releases. FITS is a Java application, and Tika is a reasonably well-documented Java library, so getting it to work wasn't very hard. The main complication was that Tika's output vocabulary is sprawling and undocumented, so there's no good way to tell what properties it might report in previously untested cases. This makes it more difficult to translate Tika terms into standard FITS output.

Several of the tools FITS uses were out of date. JHOVE hadn't been brought up to its latest version because attempts to do so produced less metadata than version 1.5 did. This turned out to be because JHOVE had updated to the current MIX 2.0 schema, and FITS was still trying to interpret it as MIX 0.2. Once the problem was found, the fix was obvious.

DROID was a more difficult case. FITS was using DROID 3, and DROID 6 was vastly changed, to the point that FITS got numerous compilation errors after dropping in DROID 6. DROID has no public API documentation, making things difficult. Matt Palmer, who has worked on DROID development, provided vital help in figuring out how to call the current version.

Some issues in efficiency turned up. DROID uses an XML signature file to identify files. It's big, and parsing it took over 13 seconds on my computer. If FITS is run on a large directory, the time cost is spread out over a lot of files, but this is a problem if it's run on one file or a small directory. Hopefully there will be optimizations, perhaps a persistent serialized cache, in future versions of DROID.

The National Library of New Zealand's metadata tool was more problematic. An attempt to bring it up from version 3.4GA to 3.5GA ran into problems similar to the ones with DROID, with classes having been changed. Apparently this tool isn't being actively maintained, and I wasn't able to get the information needed to do the update. It's staying at 3.4GA in FITS.

Another task was improving the metadata vocabulary for video. FITS output isn't much more than a flat set of properties, so it wasn't possible to adopt any other schema full-blown, but ideas were used from a number of sources, including MediaInfo, Archivematica, and PBCore. Exiftool is currently the best of the tools for reporting video properties, so the output was shaped by what it can produce. Hopefully other tools, such as Tika, will produce more information on video files in future versions.

Documentation is an important part of any open source project, but one that often gets low priority. I did some work on the Javadoc and added documentation in the wiki pages of the Github repository. In particular, there are instructions on how to add a new tool to FITS.

Hopefully this work will make FITS a more useful tool, both for Harvard and for its other users.

Preservation Topics: 
Hack to preserve: increasing your organisational competence

While the digital preservation challenge is caused by technology, it is not solved by technology. Many research projects started out with the ambition to devise a technology solution (migration, emulation, encapsulation, etc.) and many memory institutions thought it would suffice to apply the R&D results: the methods and associated tools. However, it has become clear that such all encompassing solutions do not exist. In addition, many tools and approaches have not survived the R&D stage. So, while R&D remains important to conduct research in specific, well-defined problem areas, it is not the main driving force behind digital preservation.

Although OPF originates from a research project and continues to foster R&D, its philosophy of digital preservation concentrates less on technology as a solution and more on growing digital competence as a long-term approach to digital preservation. In previous blogs I gave some background on this philosophy, which aims to

1)    foster learning by doing as a means to develop skills and expertise in an area where best practices and standards have not yet matured and where research plays an important supportive role;

2)     cultivate a community of experts and skilled people who embrace the values of active learning and professional sharing, values which assume a certain degree of organisational readiness on the part of memory institutions.

In this blog I will explain how the OPF hackathons are supporting these aims and why preservation managers should send their staff to OPF-hackathons.

 

What are OPF hackathons?

Our hackathons are 3 day-events organised around a specific digital preservation topic or challenge and bring together curators (those who understand the content and value of their collections) and software engineers (those who understand the underlying digital nature of these collections). In OPF-speak, we bring together the “practitioners” and the “developers”, which is a practical way to distinguish between 2 different roles: 1) the role of the practitioners who collect digital materials and can come with real examples and real, day-to-day problems they encounter when managing these materials; 2) the role of the developers which is the equivalent to that of the “conservators” in the analogue domain: they examine the digital materials (the files and the bit streams underlying the digital objects); suggest methods for storing, displaying, treating and processing them; research new techniques; etc.

In bringing these 2 roles together we are creating fruitful synergies, which not only result in practical solutions but more importantly, in cultivating a community of experts who share and develop professional practices together. The concept is simple: practitioners bring troubled data and developers “hack” with existing tools and develop practical approaches. Usually the problems and solutions are very much hands-on. They are neither about state of the art R&D nor about building future frameworks or digital sustainability platforms. They are not about risk assessment or risk management. We talk about the day-to-day operations and the use of tools such as Apache Tika and DROID, in real practice. We talk about integration of tools in workflows and compare practices. In this way we are building a shared practice, based on learning by doing.

 

Why is it important for memory institutions to send their people to OPF hackathons?

Institutions with a mission to preserve society’s digital heritage need to develop competence and confidence in digital preservation. It is OPF’s conviction that the best way to do so is by investing in staff development. OPF hackathons are better substitutes to (and cheaper than) training programmes. They help your staff to develop the knowledge, skills and abilities needed to perform their daily tasks. Through participation they can rely on peer support from the OPF community and vice versa derive job satisfaction from contributing to the community.

 

Next OPF Hackathons

A Practical Approach to Disk Images and Digital Forensics, 15-17 May, Copenhagen

Tackling Real-World Collection Challenges with Digital Forensics Tools and Methods, 3-5 June, Chapel Hill

Preservation Topics: 
Adventures in Debian packaging

About a year ago, work started on packaging SCAPE tools. Jpylyzer was the first SCAPE tool that was turned into a Debian package. Some time later, the OPF set up a couple of machine images at Amazon Web Services, which can be used to create packages repeatedly using a virtual machine. Even though I've used the Amazon service a couple of times myself, I really know next to nothing about Debian packages, and it's safe to say that the underlying build process has been more or less a complete mystery to me.

To get a better understanding of the process for building Debian packages, I had a try at packaging jpylyzer on my local machine (which runs on Linux Mint 14). Some time ago Dave Tarrant and Rui Castro wrote a nice step-by-step guide on building Debian packages on the OPF Wiki, so I tried to follow the instructions there. While working on this, I made some notes, mainly to remind myself of what I was doing. Then I realised that some of this might be useful to others as well, so I decided to turn it into a blog post.

Objectives

The objectives of this exercise were:

  • to get more more familiar with the packaging process myself;
  • to provide some input on how useful the guide on the OPF Wiki is from the perspective of someone who is largely ignorant of the packaging procedure;
  • to identify any problems in jpylyzer's packaging procedure.

I did two experiments: first, I did a very limited test where I tried to create a template directory structure using debhelper, which would be the first step when starting from scratch. Since for jpylyzer all the files in the debian directory already exist, I then moved on to building jpylyzer using the existing files.

Test 1: creating the directory structure from scratch

For this, I first installed all the required packages listed in the Pre-Requisites section of the guide using:

sudo apt-get install build-essential dh-make devscripts debhelper lintian

Subsequently I followed the instructions in the Getting Started section. For this I simply created an empty directory:

mkdir debtest_1.0.0

And then:

cd debtest_1.0.0

Then I ran dh_make:

dh_make

This resulted in an error message, telling me that the package name and its version number should be separated by a dash ('-') instead of an underscore ('_'), or, alternatively, that the -p flag should be used. So I changed the directory name:

mv debtest_1.0.0 debtest-1.0.0

Re-running dh_make, it now accepted the directory name, but it complained about a missing tarball (which I purposefully didn't make in this test). However, as dh_make offered the suggestion to use the --createorig option (which creates a tarball) I tried this:

dh_make --createorig

This resulted in the creation of a debian directory with file templates, and an (empty) tarball debtest_1.0.0.orig.tar.gz which was created in the parent (debtest) directory.

So, apart from the dash/underscore mix-up this is all pretty straightforward.

Test 2: building jpylyzer

In this second test I tried to build jpylyzer using the already existing files in the debian folder of jpylyzer's Git repository. First I cloned the repository to my local machine:

git clone https://github.com/openplanets/jpylyzer.git

Then I went into the jpylyzer directory:

cd jpylyzer

From there I tried to build jpylyzer directly, using the command given in the guide's Building your package section:

dpkg-buildpackage -tc

Missing changelog

The above command resulted in an error message about a missing changelog file in the debian folder. The changelog section in the OPF guide does mention an OPF-hosted GitHub 2 Changelog service, which is supposed to be callable from the rules file. But I don't see any reference to it in jpylyzer's rules file, so I don't really know how this is supposed to work! To to keep going I simply grabbed the default changelog that was created by debhelper in an earlier experiment. After this I ran the command again.

Unknown commands in makefile

This time, dpkg-buildpackage exited with the following errors:

pymakespec --onefile jpylyzer.py
make[1]: pymakespec: Command not found
make[1]: *** [build] Error 127
make[1]: Leaving directory `/home/johan/debtest/jpylyzer'
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status

These errors arise from the following lines in jpylyzer's makefile:

build:
    pymakespec --onefile jpylyzer.py
    pyinstaller jpylyzer.spec
    @echo "Built in dist/jpylyzer"

The pymakespec and pyinstaller commands above are most likely shell scripts that launch the Makespec.py and pyinstaller.py scripts that are both part of PyInstaller (these are used for building an executable from the source code). However, neither the shell scripts nor any references to them are included in jpylyzer's repository (my best guess is that they exist only on a specific machine instance - perhaps the Amazon virtual machines?), so the makefile simply won't work.

I was able to fix this by changing the references to the shell scripts to this (using PyInstaller 1.5):

python /home/johan/pyinstall1.5/Makespec.py --onefile jpylyzer.py
python home/johan/pyinstall1.5/pyinstaller.py jpylyzer.spec

For PyInstaller 2 these two lines should be substituted by:

python /home/johan/pyinstall/pyinstaller.py --onefile jpylyzer.py   

Note here that PyInstaller has no default installation location, and the file paths will vary from machine to machine!

After making these changes I was able to run dpkg-buildpackage without any problems:

dpkg-buildpackage -tc

Result: the following files were created in the repo's parent directory:

  • jpylyzer_1.9.0_amd64.changes
  • jpylyzer_1.9.0_amd64.deb
  • jpylyzer_1.9.0.dsc
  • jpylyzer_1.9.0.tar.gz

Tarball schmarball

One thing that confused me at first: the Getting Started section in the OPF guide mentions the need for building a native package before starting the Debian packaging:

If you have got here and you don't have any already packaged code (a tar ball with makefile etc) then you will need to build a native package.

So, I initially thought I would need to create a tarball of my repo first. As it turns out this is not the case: the tarball is created automatically once you run dpkg-buildpackage. So this is one thing less to worry about!

Verifying the package with lintian

As a final step I used lintian to verify my package:

lintian jpylyzer_1.9.0_amd64.deb

This resulted in the following output (using PyInstaller 1.5):

E: jpylyzer: unstripped-binary-or-object usr/bin/jpylyzer
W: jpylyzer: hardening-no-fortify-functions usr/bin/jpylyzer
W: jpylyzer: wrong-bug-number-in-closes l3:#nnnn
E: jpylyzer: debian-changelog-file-contains-invalid-email-address johan@unknown
E: jpylyzer: helper-templates-in-copyright

With PyInstaller 2 I got this additional warning:

W: jpylyzer: hardening-no-relro usr/bin/jpylyzer

I still need to give these errors and warnings an in-depth look. At least one error is related to the bogus changelog file I used. Some others (e.g. unstripped-binary-or-object) appear to be related to the build process of the binaries.

Conclusions

Using the Building Your Debian Package guide on the OPF Wiki I was able to create a rudimentary skeleton structure for Debian packaging. I was also able to build a Debian package for jpylyzer. The exercise revealed some problems with the Debian setup for jpylyzer. The most important ones are:

  • It's unclear how jpylyzer's changelog file is supposed to be generated. Perhaps there's a dependency on some external service (the OPF Github 2 Changelog service?), but I cannot find any documentation on how to make this work!
  • The makefile calls PyInstaller in a non-standard an undocumented way. This is easy to fix locally if you are familiar with PyInstaller, but not so otherwise. Also, the interfaces of versions 1.5 and 2 of PyInstaller are different, and depending of what version you are running this may require additional changes to the makefile.
  • Even though I was able to build a Debian package for jpylyzer, it still ended up with some lintian errors.

I also came across a few minor errors in the OPF guide. I left a short comment on this here (scroll to bottom). Overall, I found the guide really helpful, and it provides an accessible and relatively painless introduction to the packaging process.

Reference

Building Your Debian Package (OPF Wiki)

Post scriptum

Proof again that it's always a bad idea to come up with a clever title for a blog post without Googling it first: after writing this post I found out that the Mid Hudson Valley Linux and Open Source Users Group will be organising a meeting called Adventures in Debian Packaging later this year in Poughkeepsie, NY. Completely unrelated to this blog, of course, but it's only fair to give it a mention. Well, there you go.

Re-tailoring FITS

File Information Tool Set (FITS) is the Harvard Library's "Swiss army knife" for file characterization. Created originally for use with the library's Digital Repository System (DRS), it's been made available as open source, and several other institutions have made use of it. The OPF online hackathon last November included some work on it, and recently the Google Code repository (https://code.google.com/p/fits/) which is the official home of Harvard's FITS was cloned to a Github repository (https://github.com/harvard-lts/fits) as a possible step toward more community participation. There was more work on FITS at the March hackathon in Leeds, including initial work on integrating Apache Tika.

I've started work under a SPRUCE grant to continue improvements on FITS and have forked it to another Github repository (https://github.com/gmcgath/fits-mcgath/) for the duration of this work. (The older "openfits" repository which I created in November should now be considered deprecated; the new one is a fresh fork.) Part of this project is to get community input on what will improve FITS and, if time allows, to work it in. Among other things, I'm looking for input into what FITS video metadata should look like. There's already been some discussion of this on my own blog (http://fileformats.wordpress.com/2013/04/01/mfits/). Feel free to try out the changes as they're committed to the repository and to comment on any aspect of the project.

I'm a former software developer for the Harvard Library and currently have some sort of status as an inactive temp employee, but all remarks here are my own and not those of any part of Harvard University.

Preservation Topics: 
Software Archiving for EaaS

The typical digital artefact or complex object does not function (render, execute, ...) without a certain software environment. Emulation-as-a-Service (EaaS) provides original environments running in platform emulators. Depending on the (complex) object to be handled, several software components are required to reproduce an original environment. Often, these components are proprietary and require a software license. The software itself and the licenses need to be preserved to enable the reproduction of the original environments. There are a couple of issues linked to software licenses. The issue can change over time definitely influence EaaS as licenses (and software "patents") expire or local and remote license servers become unavailable. Another interesting point, masively disputed by some software vendors, is the development of a second hand software market.

Software Archive of Standard Components

Software components required to reproduce original environments for certain (complex) digital objects can be classified in several ways. There is standard software such as operating systems and off-the-shelf applications sold in (significant) numbers to customers. There might exist different releases and various localized versions (the user interaction part translated to different languages as is the case for Microsoft Windows or Adobe products) but otherwise the copies were exactly the same. Such software should be described uniquely and kept in a software archive of standard components.

There are several ideas on software identification and description already discussed in this blog (e.g. by Andrew Jackson). DOIs would definitely be helpful to tag software like ISBNs, describe books and other media. These tags would be useful for tool registries like TOTEM, too. Optimally, such software archives are managed by the relevant (national) memory institutions. As the archive's content is comparably small and well described by the tags, the workload can easily be shared (federated) among several institutions. Different ways could be envisioned to stock these archives. Legal deposit, as is well established for books and other media, is one option. Or, software components could be collected on-demand upon object ingest. This option is discussed and demonstrated e.g. by the bwFLA project. It provides necessary interfaces to a software archive, so that all required software components can be collected and described. This is done via observed installation processes which records all the required user interaction to install a certain component. Such additional information is to be stored alongside the standard metadata such as license keys. The successful rendering of the object can be directly validated by the user to verify the complete capture of all relevant components.

Unfortunately, a general, coordinated software archiving is still a partially unresolved issue. There are a several activities going on at the National Archives of New Zealand or the National Library of Australia. These activities are very valuable to the whole community as some of the software producers often do not archive their products very long. Additionally, some companies leave the market and not all assets are maintained. There exist initiatives like vetusware.com which try to tackle this problem but operate in a legally problematic domain. They might go down because of take-down or simply because of running out of funding. Other sources are specialized archives like browsers.evolt.org for web browsers. The drive-by software archiving as run by the Internet Archive might not capture all relevant software as many components were not freely and openly available for download. Especially for older and less popular platforms it becomes more difficult to get hold of obsolete software. Nevertheless, storing and maintaining software components is a prerequisite of the deal. Nevertheless, memory institutions should have special rights to archive software.

Licensing

Every actually running instance of an original environment requires a certain set of licenses depending on the installed or used software. If e.g. a set of presentation slides with embedded audio, video and spreadsheets needs to be rendered, the licenses for the operating system and the presentation software are required. Additionally, audio and video codecs as well as an appropriate spreadsheet renderer needs to be obtained and installed to make the presentation of the object complete. For EaaS a license management component is required to match the number of available licenses to the requested original environments to run. The sources of the licenses could be different and could depend on the user (and institution) requiring access to a certain digital object in its original environment. In a federated EaaS environment run by different institutions, the sharing and handling of licenses becomes an interesting topic, especially if national borders are crossed (e.g. because software vendors try to maintain seperated markets with different pricing).

Within the realm of (national) libraries and archives the licenses of the legal deposit might suffice. For a more open and general service other ways of licensing are required. Either, the software producers offer a specific type of license for that purpose or specifically acquired licenses (e.g. pre-owned license market) are used. Another option is that licenses are obtained (from the original user/producer of the object) when ingesting the particular object. This might be the case for finished (scientific) projects or end-of-life office environments in companies or government organizations. At the moment, licenses are often just thrown away like used IT equipment. For the future a more elaborate digital lifecycle management should be put in place. With the planning and beginning of a project the licensing of all required components should be secured for the complete intended lifecycle of a particular object.

Custom Made Software Components

Not for all software components a (federated) software archive of standard components makes sense. In many domains custom made software and user programming plays a significant role. This could be scripts or applications written by scientists to run their analysis on gathered data, run specific computations or extend existing standard software packages. Other examples are software tools written for governmental offices or companies to produce certain forms or implement and configure business processes. Such software is to be taken care of and stored alongside the preserved object. The same applies for complex setups of standard components with lots of very specific configurations. In these cases it could make sense to preserve the system as a whole (see blog post on that topic for full system preservation).

Pre-Produced and On-Demand Original Environments

EaaS allows to centralize services and share the efforts. This could be especially useful to re-use pre-produced original environments of standard components. Depending on the type of user - if rendering the object within the premises of the memory institution or being from some commercial entity or a private person - different ways of the (re)production of original environments could be chosen:

  • Complete environments together with the required metadata to run it in the chosen virtual machine or emulator. This would be the method to deploy for imaged complete systems.
  • Reproduce the complete environment from standard components using the license information delivered by the user together with the object to render. This may take a while as the setup procedure needs to be completed. The bwFLA project started to implement workflows to gather all the required metadata and user interaction to automatically reproduce such steps.
  • Re-use existing environments from a "cache" (pre-produced environments). This should be possible for in-house use or as an external service if the required type and number of licenses is available. Here a couple of legal concerns might prove problematic as many licenses may not explicitly allow software lending.
  • Partially re-use pre-configured environments if licenses are less problematic and just add the problematic/proprietary component.

Several ways were described to automatically re-produce certain environments e.g. for Windows operating systems (link) or as researched within the bwFLA context. Nevertheless, these procedures take time to complete and extend the time span till an artefact or original environment can be presented to the user.

Preservation Topics: 
From the new OPF Chairman

As many of you already know, I have taken over the role of Chairman of the Board of the Open Planets Foundation from Adam Farquhar as of February 1, 2012.

Clearly, Adam has already presided over an enormous achievement, first in conceiving and establishing the Open Planets Foundation, and second in bringing the OPF to the point where it is a stable, viable organisation that is both self-sustaining and debt-free. On behalf of the Board, I thank him and applaud his efforts.

Nevertheless, many new challenges lie ahead. First and foremost, we are hoping to achieve a new level of impact and financial sustainability through our new membership model.  The new model consists of three tiers of paid memberships, based on the size of the member organisation. The model also introduced affiliate members, whose contributions are made through in-kind effort. We hope that the tiered model will open the door to a much wider pool of members, which will in turn increase our visibility, impact, and community network. The challenge is to reach out to these organisations and convince them to join the OPF.

In addition, we hope that the affiliate model will help build our portfolio of software assets and increase our USP. The challenge is that this will require additional sustainable technical effort to manage the in-kind contributions effectively.

Finally, I hope that we can shape the organisation so that it can position itself and support its members in the context of Horizon 2020 and the new European funded project landscape.

I must admit that I find these challenges daunting. But I am confident that, with the cooperation of the Board, our Managing Director Bram van der Werf, the OPF support staff, and our membership, we can meet these challenges, secure the future of the organisation, and meet the requirements of our members.

I am looking forward to working with you all!

Ross King

SPRUCE Hackathon Leeds: extending C3PO to support Apache Tika

The SPRUCE Unified Characterisation Hackathon in Leeds brought together a group of developers to discuss the digital preservation community's approach to characterisation and to consolidate and improve existing toolsets.

Developed by Petar Petrov as part of the SCAPE project, C3PO is a tool for profiling digital collections based on FITS characterisation metadata. I had recently experimented with using FITS and C3PO to carry out a digital collections audit that formed part of a SPRUCE-funded project at Bishopsgate Institute.

At the hackathon, Petar, Per Møldrup-Dalum and I worked to extend C3PO to support the creation of collection profiles using metadata extracted with Apache Tika. Tika extracts a range of metadata, along with text content, from different media types. Like FITS, it therefore offers more precise characterisation and profiling of digital collections. A key advantage of Tika is its performance: this is particularly important to practitioners dealing with large datasets, such as the 300TB web archive that Per has been working with at the State and University Library in Denmark.

To add support for Tika to C3PO, Per wrote a parser for Tika's metadata output; Petar then implemented an adapter to enable C3PO to understand this output. We generated a test dataset of Tika output files and were able to use C3PO to ingest and analyze this metadata. In addition to adding support for Tika, we were also able to get C3PO running in Apache Tomcat. Petar hopes to release a Tika-enabled version of C3PO shortly.

Some challenges still remain. Parsing Tika's text metadata output proved awkward, and it would be more convenient to parse XML output if this can be provided without also extracting a document's text content (which is more expensive). Petar aims to modify C3PO so that it is able to ingest metadata from both FITS and Tika for the same dataset. This poses the problem of how to reconcile the two sets of metadata in the absence of unique identifiers for the files they describe. More problematic still is how to map the wide variety of metadata properties extracted by Tika, which are not well documented, onto those provided by FITS.

Despite these challenges, the work at the hackathon has extended the capabilities of C3PO and opened up exciting possibilities for future work. C3PO with Tika support is another useful option for collection owners looking to build up a detailed profile of their collections to assist with preservation planning.

Looking back at the second project year of SCAPE

The second project year of the integrated research project SCAPE has just finished. It is time to look back at the most interesting developments and significant results of this period.

SCAPE Year 2Last December, SCAPE successfully organised its first public training event: Keeping Control – Scalable Preservation Environments for Identification and Characterisation (supported by the European Capital of Culture in Guimarães, Portugal). The event was well attended, and the participants gave useful and very positive feedback to the trainers and organisers. The training material from the event is available on the OPF wiki (http://wiki.opf-labs.org/display/SP/Resources+-+SCAPE+Training+event+-+G...).

Communication and cooperation between the technical teams was increased during this project year, in order to integrate developments and results in SCAPE. The 1st Platform Release was an important milestone: the SCAPE components included in this release are listed on the OPF-wiki (http://wiki.opf-labs.org/display/SP/First+Platform+Release). The current version is a first step towards the scalable infrastructure which the project is developing. In 2013 the Platform team will test and enhance the current version of the platform. A short paper and a presentation explaining the architectural overview of the preservation platform are available on our project website (http://www.scape-project.eu/publication/an-architectural-overview-of-the...).

A number of SCAPE components have been developed and adapted in order to address scalability, automatisation, and quality assurance of stored data.  Two examples of SCAPE quality assurance tools that saw their first public releases in 2012 are Pagelyzer (comparison of web pages) and xcorrSound (comparison of sound files).

All SCAPE tools have been evaluated against the SCAPE scenarios and their current status assessed. A report on this gap analysis is available on the project website (http://www.scape-project.eu/news/d10-2-gap-analysis-on-action-services-t...). A first evaluation of SCAPE workflows has taken place as well. The workflows can be accessed on myExperiment (http://www.myexperiment.org/workflows?query=scape), and the evaluation report is available on the project website (http://www.scape-project.eu/news/d18-1-first-evaluation-report-draft).

Significant progress has been made towards extending the functionality of the planning component Plato, and adapting it to meet the objectives of SCAPE. A number of new features have been added to Plato in order to integrate information from myExperiment (e.g. about migration services), automated content profiling, and policy-awareness.  A public instance of the newest version of Plato can be accessed at www.ifs.tuwien.ac.at/dp/plato.  A first version of our machine readable Policy Model can be accessed at GitHub (https://github.com/openplanets/policies). Finally, a first version of the automated watch component SCOUT has been released as well (https://github.com/openplanets/scout).

SCAPE All-Staff Meeting 2013During the coming project year, the different SCAPE teams will work towards releasing the final version of the SCAPE preservation platform, and improving the SCAPE components in order to meet requirements for scalability. Dissemination, training and sustainability activities will continue as well. Planning for Year 3 has started at a very successful All-Staff Meeting, held in February 2013 in Paris. The agenda included a large number of sub-project and work package meetings, as well as of general cross-project sessions. During a workshop the project members identified the most successful and productive project results so far. The workshop results were very useful in gaining an overview of the current status of developments and to re-focus current work on the main project goals. The SCAPE team is looking forward to another very productive project year!

(pictures by Sava Cajetinac)

Preservation Topics: 
Challenges of Dumping/Imaging old IDE Disks

A couple of preservation workflows (such as full system preservation through imaging) or processing in digital forensics depend on reliable hardware-software stacks for identity system disk migrations. As especially the x86 platform is moving forward very fast, the hardware and software changes rapidly. Even if the standard suggests compatibility, there are a number of pitfalls which might prevent authentic copies of the original components to an image file.

Despite the formerly wide spread use of (parallel) port IDE disks in x86 computers, there are a couple of compatibility issues with these devices. These issues might prevent the full system preservation (as e.g. described in Full Disk Preservation of a MS-SQL database) to fail partially or completely. While e.g. the 40pin the physical interface did not change, the logical layer did. In the early days limitations of the system BIOS prevented older machines to properly "see" larger disks. For full system preservation the situation is usually reverted: The IDE port implementation might be significantly newer than the disk to be dumped.

Experimental Setup

Parallel port IDE was gradually phased out and replaced by SATA in the mid 2000. Thus, this kind of connector is not found in new x86 machines, which illustrates just another problem of hardware obsolescence. Older machines were found that still had this kind of connection. In the experiments several different IDE implementations (out of the stock of hardware available) were used:

  • Intel 865 chipset with a BIOS from 2005 (compact PC with Pentium processor)
  • Intel 875 chipset with a BIOS from 2004 (device not available any more because of died hardware)
  • Two port onboard controller (Silicon Image parallel port adaptor, same mainboard with Intel 875)
  • NVidia nForce chipset of 2005 (AMD CPU system)
  • Lindy cable multi (physical) IDE port to USB 2.0 adaptor
  • Davicontrol PCI dual-port IDE adaptor (external PCI card, Silicon Image PCI0680 Ultra ATA 133/166)

The identical IDE disks primariy considered where 240 MByte Quantum disks of an experiment to preserve a digital research environment featuring a DB2 database as a core component. The disks, taken from five client machines (the server was equipped with a SCSI disk which did not produce any hassles and identity of the imaging was easily proven), where numbered 1 ... 5. Other disks tested where 13, 30, 40 GByte disks dating from around the last turn of the century. For the system imaging procedure a fairly recent 3.2 Linux kernel (Ubuntu 12.04.2 LTS) was used.

Fortunately we got a set of identical (regarding production date and size) IDE disks to test. The partition layout was roughly the same with some variance. The installed operating system was OS/2 version 2.1 on OS/2 HPFS, a filesystem which is directly mountable in Linux.

First Step - Disk Recognition

A preliminary requirement to disk reading is the proper recognition of the device with the controller. This got checked with the various Gigabyte disks without any problems, but produced partly "interesting" results for the 20 year old disks:

  • The USB adaptor "saw" the disk, but was unable to produce a proper capacity reading (guessed on 2 TByte)
  • Newer IDE adaptors, like e.g. the onboard controller did not recognize the disk at all
  • Several disks were recognized by the Intel 865, 875 controllers, but two failed (#2, #4) in some earlier experiment
  • The failed disk 4 was properly recognized on the nForce and Davicontrol ports

Nevertheless, the old disks were not properly recognized in every boot-up cycle. They needed a certain time to spin up to answer the BIOS/operating system requests properly. Sometimes they "hung", which is indicated by a permanent lighted on-disk activity LED. To see, if the disks were recognized by the operating system (Linux), the kernel messages give the information on which disks are visible to the system. Later on, run with administrator privileges, the fdisk command
should give a proper listing of the contained partitions. The recognition was somewhat tricky:

  • Disk 2 spins up but does not get detected on any
  • Disk 3 is only detected on the Davicontrol
  • Most of the disks were not detected with every bootup. Pausing the machine, rebooting usually helps to get it finally done.
  • If a diskis not properly recognized during bootup, then the
  • unloading and loading of the IDE controller kernel module triggers the recognition. This usually helps.
  • The detection rate on the Davicontrol adaptor was different in different machines. The BIOS and succession of devices in the bus (different order of initialization) seems to influence the process (significantly)

Second Step - Disk Reading

The tool of choice to produce identical copies of block devices in Linux/Unix systems is dd. In standard configuration it reads the block device 512 Byte-wise and writes this to a file (if asked to). After proper recognition the disk was present through the high-level device, e.g. /dev/sda and through the device for each partition e.g. /dev/sda1...5 (numbering depending on the partitions detected).

Every disk aside disk 3 was read from beginning to end with dd if=/dev/sda of=image-file. This procedure copies every thing including the master boot record as well as the partition table. dd finished the process without any errors in every run on every machine. The machine log did not show any errors either. Thus, it was concluded, that the process ran flawlessly. Unfortunately, the simple partition check on the image file fdisk -l image-file did not necessary produce the proper partition listing. This was assumed to be a deficiency of the tool. After trying to boot the resulted system image in emulator or virtual machine, severe mounting problems with the contained filesystems occured (especially with the system partition). Investigating into that revealed that different dd runs (with different blocksize settings tested) produced different image files (using the diff utility and md5sum). Unfortunately no errors were reported from the system at all. Fortunately, by chance the fdisk run on the original medium produced different listings, including incomplete or corrupted partition tables, looking like:

Disk /dev/sda: 245 MB, 245426688 bytes
13 heads, 51 sectors/track, 723 cylinders, total 479349 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sda1 204867 476033 135583+ 5 Extended
/dev/sda2 * 51 204866 102408 7 HPFS/NTFS/exFAT
/dev/sdb3 * 476034 478685 1326 a OS/2 Bootmanager
/dev/sdb5 ? 4143585081 3991997998 2071690107 f6 Unknown

Partition table entries are not in disk order
omitting empty partition (6)

The original reading should have produced:

Disk /dev/sda: 245 MB, 245426688 bytes
13 heads, 51 sectors/track, 723 cylinders, total 479349 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sda1 204867 476033 135583+ 5 Extended
/dev/sda2 * 51 204866 102408 7 HPFS/NTFS/exFAT
/dev/sdb3 * 476034 478685 1326 a OS/2 Bootmanager
/dev/sdb5 * 204918 426308 110695+ 7 HPFS/NTFS/exFAT
/dev/sdb6 * 426360 476033 24837 7 HPFS/NTFS/exFAT

Partition table entries are not in disk order

Other examples of exactly the same disk showed different patterns for the /dev/sdb5 entry line.

This can be cross-checked with hexdump. Different readings produce different results already in the first couple of Kilobytes. Unfortunately the failures are not easy to be distinguished from proper ones. A proper partition table listing from the image file does not necessarily ensure a complete success. Without the proper tools and insight into the reading results a proper proof is difficult to produce as any non-failed dd reading looks correct. The experiences from numerous other tests never hinted on problems to be expected. The dumped images behaved as expected (booted in the emulator without any block device specific errors). As many of the test used non-intrusive dumping, it at least ensured that hardware specific errors could be ruled out (because the machine worked properly with the original operating system). The only source of errors could be the dumping mini Linux in these cases.

The only disk with real reading errors (because of hardware defects, not just producing wrong output) was disk 3:

[ 457.107249] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 457.107275] ata3.00: failed command: READ SECTOR(S)
[ 457.107293] ata3.00: cmd 20/00:03:2d:08:00/00:00:00:00:00/a3 tag 0 pio 1536 in
[ 457.107295] res 51/40:02:2d:08:00/00:00:00:00:00/a3 Emask 0x9 (media error)
[ 457.107317] ata3.00: status: { DRDY ERR }
[ 457.107328] ata3.00: error: { UNC }
[ 457.152340] ata3.00: configured for PIO2
[ 457.152362] sd 2:0:0:0: [sda] Unhandled sense code
[ 457.152366] sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 457.152372] sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
[ 457.152378] Descriptor sense data with sense descriptors (in hex):
[ 457.152382] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 457.152396] 00 00 15 7d
[ 457.152402] sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
[ 457.152410] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 00 00 15 7d 00 00 03 00
[ 457.152424] end_request: I/O error, dev sda, sector 5501
[ 457.152445] Buffer I/O error on device sda, logical block 5501
[ 457.152460] Buffer I/O error on device sda, logical block 5502
[ 457.152472] Buffer I/O error on device sda, logical block 5503

The error listing clearly states problem reading sector 5501+. Such error messages usually coincide with getting an error with dd or problems reading the filesystem in a partition.

To rule out a faulty version of the used Linux kernel, the experiments were repeated with similar results on a three year older kernel with similar results. To check for implementation flaws in the IDE driver even older versions were booted, but then the hardware did not get fully recognized and a disk dump was impossible.

Image Verification

After getting repeatedly different results from reading the disks with the i865 chip set, other devices were looked at for dumping. The AMD/nForce system from about the same era as the i865 system behaved pretty much the same but was able to read disk 4 which was un-accessible on the i865 (reason not totally clear: We may not have tried hard enough to register the disk to the machine by restarting, delaying bootup or reloading the required kernel modules). After a couple of runs, fdisk produced different results (which indicated an early error at the very beginning of the disk) for the same disk, same for the dd runs. The produced images were different again from the images produced on the i865 system. The Davicontrol adaptor was a recent addition to the hardware pool and was used next. Up to now it has never failed to produce a proper partition table reading of different disks. The test was repeated regularly. The produced images were exactly the same in different runs, tested with diff and md5sum. The partition listing of the resulted image files looked as expected (identical to the reading from the original source). The same was true for mounting the partitions (none of the previous filesystem errors encountered from the other images). Thus, it should be pretty obvious that the setup is producing the expected results and has none of the hardware issues of the setups used before. Nevertheless, there might be still hidden errors around.

Conclusions

dd does exactly the job it is expected to, imaging block devices block wise to image files. If the input is corrupted for some reason (but not reported as an error detected by the operating system), dd (or similarly dd_rescue) has no means to detect such an issue. The verification process of the system imaging is to be done thoroughly using different tools and methods to ensure correctness. Rerunning, and comparing the results is one of the available options. The encountered errors highlight once again the importance of a hardware archive offering various options to try on the device to be imaged. Adaptors and system buses get out of use and result in device obsolescence which might prevent the proper preservation of a system. Non-intrusive disk dumping, which requires a suitable dumping OS paired with proper hardware available, at least circumvents the problems detected in this case study.

Interview with a SCAPEr - Catherine Jones

Who are you?

Catherine Jones, STFC I am Catherine Jones. I am an Information Systems Project Manager in the Scientific    Computing Department at the Science and Technology Facilities Council (STFC) in the UK. 

STFC is one of the UK seven Research Councils and provides both funding for research in Astronomy, Particle and Nuclear Physics and Space Science and large scale scientific Facilities (http://www.stfc.ac.uk/home.aspx).

Your role in SCAPE?

I am the STFC project manager responsible for planning and coordinating all STFC activities on SCAPE. STFC contributes in four main areas in SCAPE: we lead the work in the Research Data Testbed, we build tools for preserving research data, we are part of the policy representation discussions and we lead work in developing guidelines for best practices in preservation of scientific data in the context of SCAPE work.

At the moment I am working on Policy Representation. There are two aims of this activity. Firstly to identify and provide guidance on which topics need to be considered and addressed within preservation policy for a whole organization or a particular content/collection to assist those people who need to write policy in this area. The second is to produce a set of machine understandable statements and potentially actionable statements which can be used by the PLATO planning tool and the SCOUT watch tool. I am working on describing a process to enable this translation of natural language policy into the machine understandable statements, together with a sample set. This is a challenge as humans don’t need every implicit fact made explicit as computers do!

Why is your organisation involved in SCAPE?

STFC is publically funded and hence the data produced and managed here should be preserved for the long term. This data forms part of the Record of Science alongside scholarly articles. By participating in the SCAPE project with partners across other sectors we can share experience and practice.

What are the biggest challenges in SCAPE as you see it?

A challenge that the Research Data Testbed is starting to consider and work on is the preservation of the context for research data. For many types of research data it is not enough to preserve the object to enable use/reuse in the future, other additional pieces of information also need to be preserved, or linked to in a permanent way. This is a challenge for the creation of these links without taking into consideration how these may be preserved over the longer term.

What do you think will be the most valuable outcome of SCAPE?

I think that the work done on the watch tool SCOUT which will enable certain conditions to be monitored in the wider environment is a welcome addition to the tools and infrastructure available for those concerned with digital preservation. To be able to use SCOUT effectively, then a particular organization will have considered and decided on the preservation objectives and underpinning policy, thus helping to ensure that digital objects are kept and hopefully functionally preserved for future use.

Contact information:

Catherine.jones@stfc.ac.uk
Skype handle: cm_j0nes
http://www.stfc.ac.uk/home.aspx

Preservation Topics: 
bwFLA Demo - Emulation as a Service (EaaS) and Digital Art Curation

Finally, a first semi-public demo instance is available to the OPF community. The current version features

- an overview on basic emulation services; different emulator + OS platforms are available for testing. The next bwFLA release will feature a sophisticated user management, i.e. users can start with a base image, clone this image as a dedicated user machine and further develop it to a dedicated rendering environment for certain digital artifacts;

- bwFLA / EaaS as a digital curation tool for dig. art by the example of Transmediale (http://www.transmediale.de/) CD-ROM art.

Access to the demo is password protected. Password and a quick overview of the demo features can be found in the members-area of the wiki: http://wiki.opf-labs.org/display/PT/bwFLA+test+demo+instance

If you are interested in the bwFLA workflows, our current use-cases and yet unreleased bwFLA features join the Webinar held by our dear college Annette on Tuesday, March 26 2012 at 12:00 BST / 13 CET.

Please register at: http://opf-bwfla-webinar.eventbrite.com/.

Preservation Topics: 
AttachmentSize
demo.png282.28 KB
cdroms.png348.16 KB
Digital Forensics and Emulation for Preservationista

Many of the tools and practises developed for the digital forensics field can be integrated into digital preservation techniques. This is particularly true regarding:

  • processes for securing, analysing, and appraising material prior to ingest into a repository or digital archive.
  • donated personal digital archives, where physical hardware and media are acquired, rather than digital content.

Digital Forensic best practises deal with:

  • Acquisition: activities to secure and preserve the state of physical and digital evidence. These include disk imaging, metadata creation, and producing authentic copies for examination. These techniques can also be used to "secure and preserve the state" of physical media and digital content.
  • Examination: a rigorous, systematic examination of data to locate information of interest to the investigation. Methods here include duplicate detection, identifying system files from Operating Systems, programs, etc., detecting encryption, detecting personal data, and time line analysis. These techniques can complement existing characterisation methods in the digital preservation field.
  • Analysis: an often manual analysis of extracted data, evaluating it for relevance to the investigation. A digital preservation practitioner's activities might include assessing the relevance digital material to the collection, finding or removing personal information, or creating a content profile.

These forensic tools and methods combined with established digital preservation tools and techniques can provide a pre-ingest workflow that:

  • secures data at the point of acquisition.
  • allows tools to be run on imaged copies protecting the source media and data.
  • provides the metadata required to make informed decisions regarding the content.

What those decisions are in practise depends on variable factors including the type of access to be provided, permissions granted by the rights holder, and institutional policy.

One scenario might be that a subset of content is extracted from the image and ingested into a repository. Access to this material could be provided through an emulated environment, the choice of environment and rendering software informed through metadata gathered during the pre-ingest process.

When requirements and permissions allow however a potentially exciting emulation opportunity may present itself. It is possible to virtualise some of disk images created from the original media. The image must contain a supported working operating system and the process isn't certain to be successful. When this approach works access to the material can be provided through an emulated version of the original machine, which may have belonged to an author or researcher. This article in The Atlantic, and this one from The New York Times describe a nice example of the use of these or similar techniques to preserve the personal digital archive of Salman Rushdie.

The OPF is hosting two hackathons focussing on these themes:

Preservation Topics: 
PDF Eh? - Another Hackathon Tale

“Characterization” can mean many things (I’m particularly fond, especially in this context, of the OED’s “creation of a fictitious character or fictitious characters”). Back in October Paul Wheatley suggested that digital preservation practitioners needed “better characterisation” and defined this as enabling them to determine the condition, content and value of digital records prior to ingest (computer-aided appraisal if you will). To this end SPRUCE organised a Unified Characterisation Hackathon with the intent of “unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities”. With FITS, DROID and JHOVE2 all well represented this promised to be a good event, and good it was!

After the usual chaos of any open agenda event, we quickly settled into groups covering four issues: Integrating Tika and C3PO, Integrating Tika and FITS, comparing and contrasting Tika’s and DROID’s approaches to file format signature definition, and identifying preservation risks in PDFs. I worked on this last one in the company of some great people who may well have their own takes on how it went, but here are my thoughts.

We began with a discussion about possible risks for PDFs. The answer wasn’t simple. For example, an encrypted PDF may be at risk, but Andy Jackson pointed out some encrypted PDFs have empty passwords and argued that this was a safe kind of encryption (assuming people remember to try empty passwords when opening them). Tangled into this discussion was the question of validity. It was suggested that the PDF/A specification could be used as a checklist for PDF preservation risks. However a simple conformance check - valid or invalid - was not enough. The specification might disallow things that we, as the keepers of that content, have decided are risks we are willing to take. Some issues may be too expensive to solve; require an even riskier migration or have other external factors (commercial agreements e.g.) that determine their importance. In short, we needed a tool that would not just respond in Boolean, but rather empower the user to make an informed choice about their content.

Johan van der Knijff steered the group towards Apache’s PDFBox and its Preflight component – a tool that tests PDFs for conformance to the PDF/A1-b specification (and by implication the PDF specification). Seemed a good starting point and Will Palmer quickly set to work. Very soon we had Preflight built, running and outputting XML instead of its usual unstructured text (see some example outputs and the changes made). Each divergence from the spec was interpreted as a possible preservation risk to be reported to the user. Will has since contacted the PDFBox developers to get this patch included into the Preflight release.

Armed with an XML statement showing which parts of the PDF/A specification a given document failed on, we now needed a way to allow the user to say which of these where of interest and only present those in the final report. Sheila Morrissey came up with an elegant solution. She created a policy file that enables practitioners to define which of the errors Preflight handled they were interested in – either flag as a warning, ignore completely or fail the entire validation process. She then defined an XSLT stylesheet used to filter Preflight’s output to create a report showing only those risks.

To show how this could all fit together we then created a GUI - PDF Eh? (a name I'm particularly proud of!) - that enables a user to run Preflight over a directory of files, applies the policy via the XSL transform and reports the results. Lynn Marwood created a parser for the rules file to enable the end-user to define their policy in the UI and re-run validation tests. Currently these rules are displayed but turning them on or off does not change the output. The GUI runs the validation too, so it is slow to respond when called on large directories. While a basic proof-of-concept, this code provides a useful framework and shows that with a few tweaks of an existing tool and some neat XSLT we can identify preservation risks and incorporate policies very succinctly.

Code for the GUI including the XSL is on GitHub.

We also felt this approach could be used in other contexts - a SCAPE component or a plug-in for (insert favourite preservation/repository solution).To prove this point Maurice integrated the XML-enabled Preflight with FIDO and used it to show how a validation check can be used to augment file identification, particularly in the case where the magic is intact but the file is otherwise broken (streams of zeros after the end-of-file marker for instance). While this extra step may add time and complexity to file ID via magic numbers, Maurice demonstrated here that further characterisation can help provide a definitive answer for awkward edge cases.

As an aside, using XML output gave rise to the question what should we use for the tags? I’m not sure we ever came to an answer. Maurice de Rooij argued for XCDL, I wondered about aligning with jpylyzer’s XML output. Standardising could make creation of further processing tools (comparisons for quality assurance for example) easier.

It was a good couple of days. We talked to other people, learnt new things and shared approaches and if that isn’t the first steps to a unified approach I don’t know what is! My only regret was getting so caught up I didn’t spend more time in the other groups, digging into Tika or FITS or DROID or C3PO. But there will be other days.

 

New characterisation developments from the SPRUCE hackathon

A day after running our Characterisation Hackathon (and helping out with a lively DPC event on PDF/A-3) and I'm still feeling exhausted. This was a developer only event and not as taxing on my facilitation skills as our usual mashups, but it's still been an action packed few days. All this moaning is of course somewhat irrelevant as these events are all about the participants and it was certainly those guys who did the hard work.

 

Andy Jackson shows his visualisations of tool sensitivity to bit flips throughout a file

Facing the challenge of taking on digital preservation characterisation and making it better, we began with some scene setting lightning talks from our hackers. Andy Jackson challenged us to take on some familiar problems which I paraphrased at the time on twitter as "Too many characterisation tools, too complex, don't meet users needs." As usual Andy wasn't pulling any punches. Despite loads of great development work on characterisation tools, we still have much to do. A key aim for the event was to get some of the key developers working together more effectively, and taking on some of the problems Andy hinted at.
 
Our starting point was a scratch space that was chock full of great ideas to pursue. It was collaboratively created by our event participants over previous weeks. We boiled this wealth of information down to 6 somewhat crude themes and then voted on them. For the rest of the hackathon we worked in small groups to take on these challenges. Periods of development time were intersperced with discussion, reporting back, demos and pereodic eating of some fabulous home baked cakes.
 
By 1600 on the second and final day of the hackathon we had some great results in four key areas. Individual blogging from our attendees will provide a lot more detail on the work undertaken, but for now, here's a summary from me:
 
Just solving the PDF problem
 
Despite there being a range of tools that tell us useful stuff about PDF files, there isn't a simple, focused tool for identifying clearly understood preservation risks in PDFs, which presents the results in a form suitable for a layman/woman to understand and act on. This was the challenge the first group took on. With the presence of some great devs, as well as PDF gurus such as Johan van der Knijff and Sheila Morrissey, I had high hopes. Those guys did not let us down. Starting with Apache Preflight and it's ability to validate a PDF to the PDF/A standard, this group worked primarily with the output to meet the use case outlined above. The resulting tool comes with a sensible default configuration that alerts the user to concrete risks, but this can easily be tweaked by more advanced users.
 
The potential of the results is considerable. As an ePrints plugin for example (and perhaps also used earlier in the deposit lifecycle), this could revolutionise preservation in our institutional repositories.
 
Consolidating file format identification
 
The second hackathon theme was file format identification, and more specifically, the signature magic that ID tools match with bytes from the headers (and sometimes footers) in target files. DROID, Tika and File all have their own magic, stored in different formats. "Team File Format" looked at mapping DROID and Tika formats together and seeing how we could get value out of amalgamating all this disparate knowledge. The results were fascinating but require some further exploration as the picture is a complex one. There are significant numbers of formats that have magic in DROID but not in Tika, magic in Tika but not in DROID and magic in both (while also noting that DROID magic is more specific to the version level than Tika magic). All of these groups have potential follow up / conversion / exploitation potential. Even where there is magic in both it's of course not always the same.
 
This work goes a long way to addressing the "too many tools argument" as levelled at file format ID. More needs to be done, but David Clipsham (he works full time on file format magic, so that makes him a digital preservation magician, right?) now has a great resource for compiling more DROID signatures and quality checking existing signatures. Additional tools for creating and testing new signatures have also progressed, and there will be more about this in a blog post from Peter May.
 
An interesting discussion for me emerged when David was telling us about the precision possible (or not always possible) when the constant magic bytes in a format are quite short. if they are too short, identification can create false positives in incorrect formats. Creating good signatures requires some art, not just science. Format really is a fuzzy thing: a fundamental digital preservation concept that's not always easy to get your head round.
 
 
Petar Petrov telling us about C3POAdding Apache Tika to FITS and C3PO
 
Apache Tika has become very popular with digital preservationists at our recent mashup events, and as well as file format identification it offers extraction of a host of properties of potential use in long term preservation. Incorporation into some of the best meta tools would be helpful again in meeting the complexity and "too many tools" arguments Andy mentioned at the start. Our two remaining groups therefore decided to incorporate support for Tika in the analysis and visualisation tool: C3PO and the combined characterisation tool: FITS. Of course we had the authors of FITS, C3PO and JHOVE on hand to spearhead this work with support from additional expert hackers. So the results for a day and a half of dev were impressive. Petar Petrov (of C3PO fame) was able to demo a C3PO analysis of Tika output. A new release is expected shortly. The FITS group had a considerably harder challenge, but the lion's share of the work has been completed and Spencer McEwen (FITS developer from Harvard) demo'd FITS in action, capturing properties for a small number of formats. The wealth of properties combined with FITS' role in making sense of and combining those properties with output from other wrapped tools, led to a big challenge. Spencer is hoping to do a quick release of another minor development (addition of execution timings for each tool in the FITS reporting) and then a more complete release with full Tika support will come later.
 
Much discussion was also had about how these developments and tools could progress in a more community driven way. OPF aims to support and coordinate tool development and much useful work was identified for the next few months, relating in particular to FITS, C3PO, JHOVE and JHOVE2.
 
Although some of our participants had met previously and some have had frequent exchanges on twitter, they have never all met up in the same room and then coded together. Clearly some strong bonds have been forged, so I'm hoping that the seeds have been sewn for lots more collaboration. A couple of SPRUCE Awards should help the sustainability of the event results, although several participants could already see where results would be useful for future activities they have on the horizon. An encouraging sign.
 
At the end of our events we ask the participants to fill out a quick anonymous feedback form. In our last question, we asked if there was appetite for other events like this, perhaps on an annual basis. Everyone said yes. Some used the words "definitely", "absolutely" and "pretty please". That latter was in capitals. Several used the word "yes" 3 or more times. A third of the respondees suggested repeating on an annual basis might not be frequent enough. Wow! Clearly there is a lot of appetite for making this happen (and the results speak for themselves), although of course, shrinking travel budgets are going to make this harder. One suggestion at the end of the event was to tag some hack events onto popular conferences in our community. Any takers for iPRES or PASIG?
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.