Skip to end of metadata
Go to start of metadata
Title
IS12 ARC to WARC migration
Detailed description Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving. 
Scalability Challenge
ARC and WARC are both container formats. At the present SB has around 200 TB of web content data that needs to be migrated.
Issue champion Per Møldrup-Dalum (SB)
Other interested parties
During the IIPC Preservation Working Group meeting of 6-10-2011 this topic was discussed and the group was updated on Scape activities by Barbara Sierman and Sven Schlarb. The BNF is preparing the ARC-WARC migration and is creating a mapping between the two formats. If this issue is taken up, it will be interesting to contact the IIPC Preservation Working Group via Clement Oury (BNF) clement.oury@bnf.fr who is the chair of the PWG. The BNF also created a JHOVE2 module for the ARC tool, and IIPC is asked to fund the development of a JHOVE2 module for the WARC tool. This combination might be interesting for the scenario (update by Barbara Sierman)
Possible Solution approaches KEEPS:
  • For format convertion the following tools are available:
    • warc-tools (this tool was not selected in D10.1, because a license was missing, but that will problably change in D10.2)
      SB:
  • Since we will never be able to afford to keep the old ARC-files we need to be very sure that the resulting WARCs correspond 100% to the original ARCs
    Thus: We need a QA tool that checks record by record that the content is the same
Context TBD
Lessons Learned  
Training Needs
Datasets State and University Library Denmark - Web Archive Data  (not currently available to the project). ONB dataset
Solutions  

Evaluation

Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
webarchive webarchive Delete
qa qa Delete
issue issue Delete
planning planning Delete
watch watch Delete
obsolescence obsolescence Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.