View Source

| *{-}Title{-}* \\ | _{-}IS38{-}_ -(W)ARC to HBASE migration- |
| *{-}Detailed description{-}* | _{-}Planned migration from (W)ARC content to a new infrastructure based on HBase{-}_ \\ |
| *{-}Scalability Challenge{-}* \\ | _{-}Around 200 TB of Web data need to be migrated and continuity of services need to be maintained.-_ \\ |
| *-[Issue champion|SP:Responsibilities of the roles described on these pages]-* | _-[Leïla Medjkoune|https://portal.ait.ac.at/sites/Scape/Management/_layouts/userdisp.aspx?ID=69&Source=https%3A%2F%2Fportal.ait.ac.at%2Fsites%2FScape%2FManagement%2F_layouts%2Fpeople.aspx%3FMembershipGroupId%3D5]-_ _-(IM)-_ |
| *{-}Other interested parties{-}* \\ | -Comment from Bjarne (SB): Isn't this "just" about unpacking content from (W)ARC and putting it into HBase ? - I see no real need for Structural and visual comparison. All objects are going to be 100% the same as the original ?- |
| *{-}Possible Solution approaches{-}* | _{-}UPMC Structural and visual comparison{-}_ \\ |
| *{-}Context{-}* | -IM is migrating its web content, currently stored into (W)ARC files to a new infrastructure based on Hbase.- \\
-The archive contains around 200 TB of data and is growing rapidly. Most of the content crawled will need to be migrated sometimes this year.- \\
-Once the new infrastructure is ready, services provided to cultural institutions by IM will have to rely on this new infrastructure. The Foundation is currently providing a high-level quality archive and related services such as redirection from live missing content to the archive or resolution of access issues through its access tool.- \\ \\
-Looking at the investment in term of manual quality assurance, crawl preparation and developments, it is impossible to get a lower quality after content is migrated to this new infrastructure.- \\ \\
-We are therefore planning to build a “quality test” migration using tools and methodologies developed by UPMC to detect and repair migration defects as described in WP11 work description.- |
| *{-}Lessons Learned{-}* | \\ |
| *{-}Training Needs{-}* | \\ |
| *{-}Datasets{-}* | -[IM Web Archive |SP:Internet Memory Web Archive]-\\ |
| *Solutions* | \\ |

h1. Evaluation

| *Objectives* | _Which scape objectives does this issues and a future solution relate to? e.g. scaleability, rubustness, reliability, coverage, preciseness, automation_ |
| *Success criteria* | _Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?_ |
| *Automatic measures* | _What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?_ \\
_If possible specify very specific measures and your goal - e.g._ \\
_ \* process 50 documents per second_ \\
_ \* handle 80Gb files without crashing_ \\
_ \* identify 99.5% of the content correctly_ \\ |
| *Manual assessment* | _Apart from automated measures that you would like to get do you foresee any necessary manual assessment to evaluate the solution of this issue?_ \\
_If possible specify measures and your goal - e.g._ \\
_ \* Solution installable with basic linux system administration skills_ \\
_ \* User interface understandable by non developer curators_ \\ |
| *Actual evaluations* | links to acutual evaluations of this Issue/Scenario |