View Source

h2. ID

arc2warc


h2. Status
{tip:title=Active}

h2. Contact

Sven Schlarb <[email protected]>


h2. User Story

As the owner of a number of legacy ARC files and a Web Archive currently harvesting WARCS, I need a digital preservation system that can migrate ARCs to WARCs in a timely fashion and ensure the completeness of the migration, so that I can more effectively manage Web Archive content by only having a single format to deal with. This also means I do not have to maintain two playback mechanisms for the long term.



h2. User Requirements/Components

# I need a tool that can migrate ARC to WARC files.
# I need a tool that can verify that the content of the ARC is the same as the content of the WARC.


h2. Experiments

_Create experiments as child pages and they should appear automatically here_
{pageTree:[email protected]}

* Experiment-ID: /onb/arc2warc/jwat
Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (RS)
Data: ONB Web Archive
Workflow: Yes.
Issues: Currently snapshots compare using checksums and some issues there, e.g. animated GIF.
HDFS file access using Wayback Machine - Wayback on each node and using HDFS-held content. Currently this isn't working, but it should\!

* Experiment-ID: /im/arc2warc/pagealizer
Large Scale ARC to WARC Migration using Pagealyzer * (LM)
Data: IMF Web Archive
Workflow: Yes.
Issues: Selenium stability.

* Experiment-ID: /sb/arc2warc/jwat
Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (NBR)
Data: SB Web Archive
Workflow: Yes.
Issues: Intention is to redo the ONB experiment on SB content





h2. Developer Notes

A QA of the migration could use comparing snapshots of each of the sites, it could also take the approach of comparing all the files in each. There may be other aspects of ARCs and WARCs (header information, logs, etc.) that will need checking too. For example, has the log file format changed between the two? Is the WARC structurally sound?, etc.

Using JWAT

h2. Related Documents

_Scenarios, case studies, etc. that provide background to this story._