Sven Schlarb <[email protected]>
As the owner of a number of legacy ARC files and a Web Archive currently harvesting WARCS, I need a digital preservation system that can migrate ARCs to WARCs in a timely fashion and ensure the completeness of the migration, so that I can more effectively manage Web Archive content by only having a single format to deal with. This also means I do not have to maintain two playback mechanisms for the long term.
- I need a tool that can migrate ARC to WARC files.
- I need a tool that can verify that the content of the ARC is the same as the content of the WARC.
Create experiments as child pages and they should appear automatically here
- Experiment-ID: /onb/arc2warc/jwat
Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (RS)
Data: ONB Web Archive
Issues: Currently snapshots compare using checksums and some issues there, e.g. animated GIF.
HDFS file access using Wayback Machine - Wayback on each node and using HDFS-held content. Currently this isn't working, but it should!
- Experiment-ID: /im/arc2warc/pagealizer
Large Scale ARC to WARC Migration using Pagealyzer * (LM)
Data: IMF Web Archive
Issues: Selenium stability.
- Experiment-ID: /sb/arc2warc/jwat
Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (NBR)
Data: SB Web Archive
Issues: Intention is to redo the ONB experiment on SB content
A QA of the migration could use comparing snapshots of each of the sites, it could also take the approach of comparing all the files in each. There may be other aspects of ARCs and WARCs (header information, logs, etc.) that will need checking too. For example, has the log file format changed between the two? Is the WARC structurally sound?, etc.
Scenarios, case studies, etc. that provide background to this story.