| Storage and access
|Collection storage / preservation of masters|| BnF is currently unifying all its storage solutions into a preservation system, SPAR, designed to incrementally ingest the library's various assets. So far, we have ingested the masters of all books and still images digitized since May 2010.
We have also developed the system to ingest previously digitized books and images, as well as Web archives, digitized audiovisual material, PDFs of large-size posters and third-party data. These collections are being prepared for ingest over the next year.
The infrastructure : We use disks for temporary storage during the submission process, and tapes for long term preservation.We have two geographically distinct sites, and may implement a targeted exchange of our most valuable data with another institution, thus creating a third storage site. See also: http://www.bnf.fr/en/professionals/preservation_spar/s.preservation_SPAR_infrastructure.html
The software : our repository is based on open software and specific developement by our subcontractor, Atos, based on BnF requirements. The whole system is designed around the OAIS model; for instance, the different modules of the repository software generally follow the functional entites described in the standard. (See also: http://www.bnf.fr/en/professionals/preservation_spar/s.preservation_spar_realization.html)
|Collection access|| Access copies are stored separately. This is a holdover from the first digitization workflows, and is has been maintained when the preservation system was launched since the processes work well and are constantly being improved. Integrating complex access services to the preservation system too soon would have resulted in potential instability; for now SPAR only delivers the master copies (DIP = AIP).
This means that the master copies of our digital documents usually go through two different processes to preserve them on the one hand, and to make them available to end users in a more appropriate end format on the other hand.
For example, web archives use the Wayback Machine (http://archive-access.sourceforge.net/projects/wayback/); the audiovisual collections are handled by a specific system handling appropriate rendering tools for the tranfered content; books and images are displayed in the digital library, Gallica (http://gallica.bnf.fr/), which relies on JPEG2000 technology to display the images, while also providing OCR in the Alto format and table of contents in XHTML.
|Access protocol||What technical protocol is used to access files? Local file systems? Windows shares? (SMB/CIFS)|
|Workflow overview|| In order to deal with the diversity of assets at BnF, SPAR implements different tracks for each general type of content (material digitized for preservation purposes, born digital content acquired through legal deposit, library archives, donations, etc.). Tracks are subdivided into channels according to the technical characteristics of the content managed in each channel.
Thus, when each collection has been put through its individual acquisition process (digitization and quality control, Web harvesting, media transfert of audiovisual documents), and while the files continue their journey into the appropriate access workflow, the content can be deposited into the SPAR delivery zone for the channel it belongs to. This doesn't have to be simultaneous, although our ultimate goal is to integrate the preservation workflow with the acquisition process, the way we already do with the output of our mass digitization campaigns.
Focusing on the worflow within the preservation system:
- the content is first processed by a pre-ingest module. There are as many pre-ingest modules as there are channels in SPAR. Their goal is to take "raw" data out of the acquisition and production workflows and turn them into a proper SIP which can be managed by the rest of the system in a generic way.
- Ingest: this module builds an AIP out of the SIP by identifying and characterizing (if required/possible) the files, writing the preservation metadata (technical information, description of the actions of the system) into the METS manifest of the package, etc.
- Storage through Storage Abstraction Services onto the Hardware
- Data management :relevant information on each package is stored in this module, as a METS manifest and as RDF in triple stores. These triples are searchable through a SPARQL endpoint.
The module also contains reference information about the system, the processes, formats, softwares, etc. These information are ingested as information packages, in order to make the system self-documented over the long term.
About the Web archives workflow in particular (a paper by Sébastien Peyrard and Clément Oury on this topic will be presented at iPRES 2011):
The BnF holds Web archives collections dating back to the 90s. They were first donated by the Internet Archive, but have been harvested in-house for the past few years, using the Internet Archive tool Heritrix (http://crawler.archive.org/) for crawling, and NetarchiveSuite (http://netarchive.dk/suite/) to manage the crawling process. These tools generate a lot of metadata about the conditions and results of the crawls. The resulting archives and metadata are encapsulated in ARC container files (http://www.archive.org/web/researcher/ArcFileFormat.php) and are the indexed for access.
The digital presservation team set out to develop the means to ingest those web archives into SPAR in 2010. We determined five different channels were needed to handle the technical diversity of the archives within the Automated Legal Deposit track. Each of these channels has a specific Pre-ingest module to create SIPs out of the ARC files: a METS manifest is added to each package, containing metadata compliant with the SPAR data model. The ARC files containing metadata about the crawls become SIPs themselves, and are linked to the SIPs of the Web content harvested during the crawl.
During the ingest process, JHOVE2 is run on the ARC files to identify and validate the ARC container file itself but also identify the embedded files retrieved from the crawled web sites, using the Unix FILE utility. The output from this goes into an XML file using the containerMD schema, which we developed for this purpose. The XML file is then attached to the METS manifest and both are encapsulated in an AIP for each ARC file, ARC metadata files included.
|Tools used in workflow|| Among the open source components are:
- Jhove, Mediainfo, Jhove2, MagicMimeType : identification and validation tools
- Virtuoso : core of the data management module
- iRods : core of the abstraction of storage module
- SAM/QFS : handling of tape libraries
|Workflow technologies||Custom coding and integration parts are J2EE based, with REST API for the communication between the modules.|
|Workflow challenges|| Strengths :
- The system is based on data (see http://www.ifs.tuwien.ac.at/dp/ipres2010/papers/fauduet-13.pdf). This means, for instance, that every action made on the data is registered in the audit trail, including the piece of software that does the action. This way we can track any misfunctions and apply the needed corrections: non-technical people knowns what goes on inside the system.
- From a hardware perspective, the use of an abstraction layer for storage (iRods) allows us to make changes on the infrastructure without impacting the software or even stopping the system.
- The modularity of the system introduces some complexities that make training the administrators difficult.
This is also true of the abstraction layer andthe handling of tapes : acquiring the knowledge and training people are probably the most complex tasks.
- From a technical point of view, the system is IO bounded (reading, writing and checksuming files). So there is a very high demand on the whole storage infrastructure and any failure/design error/misconfiguration is very penalizing. We are working on improving our supervision tools.
|Automation|| We try to avoid manual steps as a rule, given the mass of data involved.
|Location of Solution in workflow|| A solution dealing with the discrepancies in identfying web content might be run as a secondary control on sample information packages. Then, if necessary, metadata from the AIPs could be extracted from the preservation system, modified and reingested to create a more accurate release of the AIPs.
|Workflow change process|| The process requires human ressources, of which there are not enough, and might therefore be rather lengthy.
Changing the workflow requires involvement from the digital preservation team and the collection owners, and they are spread accross different departments in the BnF, which might make the process sligthly slow. Implementing the defined changes will then require developing ressources, which are a rare commodity at the library, and adding the changes to the system takes time from production engineers, who are usually overworked and take time fitting new operations into their schedules.
|Workflow execution actor|| BnF's production team in the IT department
|Workflow administration actor||BnF's production team in the IT department|
|Workflow executor rights|| The production team has about all the rights to tinker with the system, under the supervision of the project manager and the preservation team representing the users.
|Collection owner/curator|| The collection owners vary from content to content. In the case of Web archives, the curators work in the Legal Deposit Department, in the Digital Legal Deposit Services.
|Workflow champions||Is there a workflow champion, who is it?|
|Dataset/Issue/Solution links|| Dataset: French Web Archives
Issue: Identifying web content