Skip to end of metadata
Go to start of metadata


State and University Library Denmark - Web Archive Data
Description 220 TB of web archive content in ARC format
Licensing This is a closed archive only accessible by danish researchers
Owner SB
Dataset Location Currently not available
Collection expert Bjarne Andersen (SB)
Issues brainstorm
  • At some point within the next 2 years we will need to migrate the content from ARC to WARC. A crucial step in this migration is automatic QA to ensure that the migrated container has exactly the same content as the original. This is very important since we dont have budget to keep the original ARC-files. Several institutions have already done this (e.g. BL) - so tools most likely exist
  • A general characterisation of web content is also needed to even begin talking about preservation of this kind of material.
List of Issues IS12 ARC to WARC migration,IS25 Web Content Characterisation


IS12 ARC to WARC migration
Detailed description Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving. 
Scalability Challenge
ARC and WARC are both container formats. At the present SB has around 200 TB of web content data that needs to be migrated.
Issue champion Per Møldrup-Dalum (SB)
Other interested parties
During the IIPC Preservation Working Group meeting of 6-10-2011 this topic was discussed and the group was updated on Scape activities by Barbara Sierman and Sven Schlarb. The BNF is preparing the ARC-WARC migration and is creating a mapping between the two formats. If this issue is taken up, it will be interesting to contact the IIPC Preservation Working Group via Clement Oury (BNF) [email protected] who is the chair of the PWG. The BNF also created a JHOVE2 module for the ARC tool, and IIPC is asked to fund the development of a JHOVE2 module for the WARC tool. This combination might be interesting for the scenario (update by Barbara Sierman)
Possible Solution approaches KEEPS:
  • For format convertion the following tools are available:
    • warc-tools (this tool was not selected in D10.1, because a license was missing, but that will problably change in D10.2)
  • Since we will never be able to afford to keep the old ARC-files we need to be very sure that the resulting WARCs correspond 100% to the original ARCs
    Thus: We need a QA tool that checks record by record that the content is the same
Context TBD
Lessons Learned  
Training Needs
Datasets State and University Library Denmark - Web Archive Data  (not currently available to the project). ONB dataset


Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario


scenario scenario Delete
webarchive webarchive Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Oct 23, 2012

    The sentence "A general characterisation of web content is also needed to even begin talking about preservation of this kind of material." could be removed. The scenario WCT 3 is related to the issue of characterization.