London School of Economics

Skip to end of metadata
Go to start of metadata
Institutional context
Institution type (eg. Library, Archive)
Academic/research library  
Where are collection masters stored (media, number of copies, backup, preservation system)?
Now: local SAN, backed up, some content also on DVD

Soon: 4 copies, managed by our preservation repository infrastructure:
  • local SAN
  • mirrored, offsite SAN (daily sync)
  • incremental backup to disc (daily)
  • mirror of incremental backup to tape (daily - including media refresh cycle managed by backup tool)
Where is access to the collection provided from?
Now: low-res images through our catalogue OPAC

Later this year: through the OPAC and the LSE Digital Library front-end delivery application (in development)
What technical protocol is used to access files? Local file systems? Windows shares? (SMB/CIFS)
Now: low-res images on a public share, linked through the OPAC and available over http

Later this year: some sort of shares from the SAN to the repository server (subnet, trusted connection) which will serve over a RESTful/http API to the web application server (firewalled, trusted connection) which provides public access over http (untrusted connection)

Unless the files are part of a closed archive, in which case we are considering an IP-restricted access service to our physical reading room or manual retrieval by an archivist from a single workstation
Describe your existing content workflow (in words, or with a diagram
Our digital library is designed to support different workflows and collection types within the same infrastructure. Fedora operates at the core mediating all digital object CRUD operations to/from storage which provide the basis for ingest, curation, preservation and access operations which are implemented by a variety of tools operating to varying extents within our Hydra application framework.

Born-digital archives are captured through a variety of means - currently from legacy media (using hardware write-blockers and disk imaging tools), secure FTP, external HDD. Future developments to the capture stage are likely to invovle forensic analysis of disk images and web harvesting. From there, files are transferred onto our processing workstation and into our workflow tool: Archivematica, which manages quarantine, virus checking, format characterisation, metadata extraction, normalisation to a content type default format and calls to the Handle server to assign persistent IDs. Archivematica outputs a BagIt file containing all original and surrogate files, technical metadata and log files from all processes.

Digitised collections are generally scanned by external suppliers and delivered in formats according to fairly uniform specifications. Some content is scanned in house to the same specifications. Files are currently transferred to our SAN where we run FITS to extract technical metadata and call the Handle server to assign persistent IDs.

Once any content type reaches this stage it is picked up by our ingester (developed locally) which validates the content, interfaces with our repository API to write the content into long-term storage and interfaces with our catalogue API to update the catalogue with the persistent IDs of the objects it is writing to storage.

At this stage we distinguish digitised, open access content which is open by default and born-digital archives which are closed by default. In stages of the process still under development open content is indexed on ingest in a public index which will make it available for discovery through our future digital library front-end. Closed content is indexed in a separate index to make it available to our future administrative interface. Our repository handles all requests for content and only serves content marked as open.
What tools are part of the existing workflow?

Born-digital ingest:
  • Archivematica, which bundles other open-source tools:

    Digitised ingest:

  • FITS

    All collections:
  • Fedora Commons repository (digital object manager)
  • Hydra application framework (Ruby GEMS for interacting with Fedora content and Solr indexes)
  • Solr (indexing)
  • Blacklight (views on Solr indexes for search and browse)
  • Handle server (persistent IDs: mint and resolve)
What technologies underly the existing workflow?
  • JAVA
  • Ruby/Rails
  • LDAP
  • Network infrastructure, including various subnets and hardware firewalls
What challenges are present in the existing workflow? (technology, organisational, staffing)
  • Automating quality checking - particularly around metadata extraction and completeness - ensuring trust in the chain of handling, provenance, authenticity
  • Error reporting - separating workflow errors from technical errors and altering the appropriate person
  • Scaling from testing to production (technology/automation, staffing/training)
  • Extending to additional content types - eg bolting a web harvester or forensic tool into the capture stage of the workflow
Does the workflow include manual steps?
Yes, by design in the case of born-digital collections to allow for appraisal by the archivists

Uncertain in the case of digitised collections although there should be fewer manual interactons the workflow is still being designed.
Where in this content workflow would the prototype solution be deployed? Born-digital migration tool: as a module in the ingester, ensuring quality before a write into preservation storage

Image consistency check: between FITS and the ingester or, depending on how it is implemented, before FITS
What is the process for changing or enhancing the workflow? What obstacles to change are present?
As our workflow is not yet in production there are very few obstacles to making enhancements. These prototype tools will most likely initiate a manual interaction when a problem is detected, resulting in a feedback loop to some earlier point in the workflow. It should be possible to implement the tool without any change to the data model either side of the interaction.
Who executes the existing workflow?
Born-digital: the digital archivist; eventually: any archivist should be able to operate the workflow

Digitised: most likely, a technical developer, due to the high levels of automation that we intend to employ and the fact that most imaging is currently outsourced, making our workflow for QA and ingest only
Who adminsters the existing workflow?
To be determined. Most likely the digital archivist in the case of the born-digital workflow tool. There will be a global view on all digital content most likely overseen by the digitisation manager with support from technical members of the team.
What system rights do the workflow executors have? Can they install software? Can they use the web?
Born-digital ingest workflow: single machine on fire walled subnet allowing only incoming http connections on a specified port for serving outputs from the workflow to the ingester. Outgoing connections for purposes of anti-virus update are still being considered.
Who is the collection owner or curator? (section/department/team)
Born-digital: Archives and Rare Books (most likely, although not necessarily in the case of all possible future born-digital material)

Digitised: the digitisation working group (cross-library group taking all collection areas - responsibility will come down to the collection in individual cases)
Is there a workflow champion, who is it?
Project- or collection-specific but ultimately the Digitisation Manager

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.