Collection:
Title |
State and University Library Denmark - Web Archive Data |
Description | 220 TB of web archive content in ARC format |
Licensing | This is a closed archive only accessible by danish researchers |
Owner | SB |
Dataset Location | Currently not available |
Collection expert | Bjarne Andersen (SB)![]() |
Issues brainstorm |
|
List of Issues | IS12 ARC to WARC migration,IS25 Web Content Characterisation |
Issue:
Title |
IS12 ARC to WARC migration |
Detailed description | Migration from ARC to WARC is desirable as the WARC archive is better suited for the future of web archiving. |
Scalability Challenge |
ARC and WARC are both container formats. At the present SB has around 200 TB of web content data that needs to be migrated. |
Issue champion | Per Møldrup-Dalum![]() |
Other interested parties |
During the IIPC Preservation Working Group meeting of 6-10-2011 this topic was discussed and the group was updated on Scape activities by Barbara Sierman and Sven Schlarb. The BNF is preparing the ARC-WARC migration and is creating a mapping between the two formats. If this issue is taken up, it will be interesting to contact the IIPC Preservation Working Group via Clement Oury (BNF) [email protected]![]() |
Possible Solution approaches | KEEPS:
|
Context | TBD |
Lessons Learned | |
Training Needs | |
Datasets | State and University Library Denmark - Web Archive Data (not currently available to the project). ONB dataset |
Solutions |
Evaluation
Objectives | Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation |
Success criteria | Describe the success criteria for solving this issue - what are you able to do? - what does the world look like? |
Automatic measures | What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important? If possible specify very specific measures and your goal - e.g. * process 50 documents per second * handle 80Gb files without crashing * identify 99.5% of the content correctly |
Manual assessment | Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue? If possible specify measures and your goal - e.g. * Solution installable with basic linux system administration skills * User interface understandable by non developer curators |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Solutions:
Labels:
1 Comment
comments.show.hideOct 23, 2012
Miguel Ferreira
The sentence "A general characterisation of web content is also needed to even begin talking about preservation of this kind of material." could be removed. The scenario WCT 3 is related to the issue of characterization.