Collection:
Title |
Internet Memory Web collections |
Description | The data consists in web content crawled, stored and hosted by the Internet Memory Foundation (W)ARC format (approx. 300TB) Using this content, IM can also use its taskforce (QA team) to provide annotated data such as pairs of annotated snapshots for quality assurance scenarios. 1000 annotated paires of web pages (similar/dissimilar) were produced as part of PC.WP3: Quality Assurance Components. |
Licensing | Web collections crawled on behalf of partner institutions will require institutions agreement to be used by SCAPE partners |
Owner | Internet Memory |
Dataset Location | Provided upon request |
Collection expert | Leïla Medjkoune![]() |
Issues brainstorm | A bulleted list of possible preservation or business driven Issues. This is useful for describing ideas that might be turned into detailed Issues at a later date |
List of Issues | A list of links to detailed Issue pages relevant to this Dataset |
Issue:
|
|
|
|
|
|
|
![]() |
|
|
|
|
|
|
|
|
|
|
|
|
Solutions | |
Evaluation
Objectives | Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation |
Success criteria | Describe the success criteria for solving this issue - what are you able to do? - what does the world look like? |
Automatic measures | What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important? If possible specify very specific measures and your goal - e.g. * process 50 documents per second * handle 80Gb files without crashing * identify 99.5% of the content correctly |
Manual assessment | Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue? If possible specify measures and your goal - e.g. * Solution installable with basic linux system administration skills * User interface understandable by non developer curators |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Solutions:
Labels:
2 Comments
comments.show.hideOct 11, 2012
Miguel Ferreira
I agree with bjarne's comment. This is not migration, its "refreshing" the medium. What kind of "quality-test" is planned here?
Dec 13, 2012
Leïla Medjkoune
We would like to delete this scenario as our plans to move content directly to HBASE/HDFS was modified.
We decided to store (W)ARCS directly into HDFS rather than unpacking them to store resources directly into HBASE/HDFS.