Skip to end of metadata
Go to start of metadata
Title
Diversity of office document formats in digital objects archive
Detailed description Document instances of many different file formats are referenced in web content. Many of these formats might not be renderable in a web archive viewer in the future. This relates especially to older versions of text document formats.
Scalability Challenge

Issue champion To be defined
Other interested parties
SB: <comment_missing>
KB: For the KB normalizing web content is not in our current preservation strategy.
ONB: Would be interesting, but low priority
Possible Solution approaches
  • ALL
    • Normalize formats by migrating the document instances into an agreed standard format. For example, an institutional decision could be to migrate all document formats (Plain text, DjVu, PS, ODF, DOC, DOCX, RTF, etc.) to PDF. In this context, quality assurance could play a major role. Also on-the-fly/on demand migration for document formats is an option in order to avoid changing the original web archive content.
  • EXL
    • This strategy might be problematic for web archives as on-the-fly migration would require that the linking pages be updated as well, which in turn would require significant QA efforts
    • ANJ: linking documents do not need to be updated. OTF migration should function at the protocol, redirecting to the new resource, optionally supporting content negotiation.
  • KEEPS
    • Watch can contribute to the solution with the triggers:
      • Monitor new format versions for text documents
      • Monitor text document format use trends
    • For format migration the following tools are available:
      • ImageMagick
      • OpenOffice
      • GIMP
      • Inkscape
      • PDFBox
      • b2xtranslator
      • GraphicsMagick
      • AbiWord
Context
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets
Solutions  

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
webarchive webarchive Delete
characterisation characterisation Delete
identification identification Delete
qa qa Delete
watch watch Delete
planning planning Delete
issue issue Delete
obsolescence obsolescence Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.