Diversity of office document formats in digital objects archive
Detailed description Document instances of many different file formats are referenced in web content. Many of these formats might not be renderable in a web archive viewer in the future. This relates especially to older versions of text document formats.
KB: For the KB normalizing web content is not in our current preservation strategy.
ONB: Would be interesting, but low priority
    • Normalize formats by migrating the document instances into an agreed standard format. For example, an institutional decision could be to migrate all document formats (Plain text, DjVu, PS, ODF, DOC, DOCX, RTF, etc.) to PDF. In this context, quality assurance could play a major role. Also on-the-fly/on demand migration for document formats is an option in order to avoid changing the original web archive content.
    • This strategy might be problematic for web archives as on-the-fly migration would require that the linking pages be updated as well, which in turn would require significant QA efforts
    • ANJ: linking documents do not need to be updated. OTF migration should function at the protocol, redirecting to the new resource, optionally supporting content negotiation.
      • Monitor new format versions for text documents
      • Monitor text document format use trends
      • ImageMagick
      • OpenOffice
      • GIMP
      • Inkscape
      • PDFBox
      • b2xtranslator
      • GraphicsMagick
      • AbiWord
