View Source

| *Title* \\ | _IS12 ARC to WARC migration_ \\ |
| *Detailed description* | _Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving. _ |
| *Scalability Challenge* \\ | _ARC and WARC are both container formats. At the present SB has around 200 TB of web content data that needs to be migrated._ |
| *[Issue champion|SP:Responsibilities of the roles described on these pages]* | _[Per Møldrup-Dalum|]__ (SB)_ |
| *Other interested parties* \\ | _During the IIPC Preservation Working Group meeting of 6-10-2011 this topic was discussed and the group was updated on Scape activities by Barbara Sierman and Sven Schlarb. The BNF is preparing the ARC-WARC migration and is creating a mapping between the two formats. If this issue is taken up, it will be interesting to contact the IIPC Preservation Working Group via Clement Oury (BNF)_ [[email protected]|mailto:[email protected]]_ who is the chair of the PWG. The BNF also created a JHOVE2 module for the ARC tool, and IIPC is asked to fund the development of a JHOVE2 module for the WARC tool. This combination might be interesting for the scenario (update by Barbara Sierman)_ |
| *Possible Solution approaches* | KEEPS: \\
* For format convertion the following tools are available:
** warc-tools (this tool was not selected in D10.1, because a license was missing, but that will problably change in D10.2) \\
SB: \\
* Since we will never be able to afford to keep the old ARC-files we need to be very sure that the resulting WARCs correspond 100% to the original ARCs \\
Thus: We need a QA tool that checks record by record that the content is the same |
| *Context* | _TBD_ |
| *Lessons Learned* | |
| *Training Needs* | \\ |
| *Datasets* | ___[State and University Library Denmark - Web Archive Data|State and University Library Denmark - Web Archive Data]_  (not currently available to the project). ONB dataset \\ |
| *Solutions* | |

h1. Evaluation

| *Objectives* | _Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation_ |
| *Success criteria* | _Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?_ |
| *Automatic measures* | _What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?_ \\
_If possible specify very specific measures and your goal - e.g._ \\
_ \* process 50 documents per second_ \\
_ \* handle 80Gb files without crashing_ \\
_ \* identify 99.5% of the content correctly_ \\ |
| *Manual assessment* | _Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?_ \\
_If possible specify measures and your goal - e.g._ \\
_ \* Solution installable with basic linux system administration skills_ \\
_ \* User interface understandable by non developer curators_ \\ |
| *Actual evaluations* | links to acutual evaluations of this Issue/Scenario |