||State and University Library Denmark - Web Archive Data|
|Description||220 TB of web archive content in ARC format|
|Licensing|| This is a closed archive only accessible by danish researchers
|Dataset Location||Currently not available|
|Collection expert||Bjarne Andersen (SB)|
|List of Issues|| IS12 ARC to WARC migration,IS25 Web Content Characterisation
|| IS12 ARC to WARC migration
|Detailed description||Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving.|
| Scalability Challenge
||ARC and WARC are both container formats. At the present SB has around 200 TB of web content data that needs to be migrated.|
|Issue champion||Per Møldrup-Dalum (SB)|
| Other interested parties
||During the IIPC Preservation Working Group meeting of 6-10-2011 this topic was discussed and the group was updated on Scape activities by Barbara Sierman and Sven Schlarb. The BNF is preparing the ARC-WARC migration and is creating a mapping between the two formats. If this issue is taken up, it will be interesting to contact the IIPC Preservation Working Group via Clement Oury (BNF) [email protected] who is the chair of the PWG. The BNF also created a JHOVE2 module for the ARC tool, and IIPC is asked to fund the development of a JHOVE2 module for the WARC tool. This combination might be interesting for the scenario (update by Barbara Sierman)|
|Possible Solution approaches|| KEEPS:
|Datasets|| State and University Library Denmark - Web Archive Data (not currently available to the project). ONB dataset
|Objectives||Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation|
|Success criteria||Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?|
|Automatic measures|| What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* process 50 documents per second
* handle 80Gb files without crashing
* identify 99.5% of the content correctly
|Manual assessment|| Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
* Solution installable with basic linux system administration skills
* User interface understandable by non developer curators
|Actual evaluations||links to acutual evaluations of this Issue/Scenario|