Skip to end of metadata
Go to start of metadata
IS38 (W)ARC to HBASE migration
Detailed description Planned migration from (W)ARC content to a new infrastructure based on HBase
Scalability Challenge
Around 200 TB of Web data need to be migrated and continuity of services need to be maintained.
Issue champion Leïla Medjkoune (IM)
Other interested parties
Comment from Bjarne (SB): Isn't this "just" about unpacking content from (W)ARC and putting it into HBase ? - I see no real need for Structural and visual comparison. All objects are going to be 100% the same as the original ?
Possible Solution approaches UPMC Structural and visual comparison
Context IM is migrating its web content, currently stored into (W)ARC files to a new infrastructure based on Hbase.
The archive contains around 200 TB of data and is growing rapidly. Most of the content crawled will need to be migrated sometimes this year.
Once the new infrastructure is ready, services provided to cultural institutions by IM will have to rely on this new infrastructure. The Foundation is currently providing a high-level quality archive and related services such as redirection from live missing content to the archive or resolution of access issues through its access tool.

Looking at the investment in term of manual quality assurance, crawl preparation and developments, it is impossible to get a lower quality after content is migrated to this new infrastructure.

We are therefore planning to build a “quality test” migration using tools and methodologies developed by UPMC to detect and repair migration defects as described in WP11 work description.
Lessons Learned
Training Needs
Datasets IM Web Archive


Objectives Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
 * process 50 documents per second
 * handle 80Gb files without crashing
 * identify 99.5% of the content correctly
Manual assessment Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?
If possible specify measures and your goal - e.g.
 * Solution installable with basic linux system administration skills
 * User interface understandable by non developer curators
Actual evaluations links to acutual evaluations of this Issue/Scenario
issue issue Delete
obsolescence obsolescence Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Dec 13, 2012

    We delete this issue as explained in the related scenario page.