Skip to end of metadata
Go to start of metadata

ID

arc2warc

Status

Active

Contact

Sven Schlarb <sven.schlarb@onb.ac.at>

User Story

As the owner of a number of legacy ARC files and a Web Archive currently harvesting WARCS, I need a digital preservation system that can migrate ARCs to WARCs in a timely fashion and ensure the completeness of the migration, so that I can more effectively manage Web Archive content by only having a single format to deal with. This also means I do not have to maintain two playback mechanisms for the long term.

User Requirements/Components

  1. I need a tool that can migrate ARC to WARC files.
  2. I need a tool that can verify that the content of the ARC is the same as the content of the WARC.

Experiments

Create experiments as child pages and they should appear automatically here

  • Experiment-ID: /onb/arc2warc/jwat
    Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (RS)
    Data: ONB Web Archive
    Workflow: Yes.
    Issues: Currently snapshots compare using checksums and some issues there, e.g. animated GIF.
    HDFS file access using Wayback Machine - Wayback on each node and using HDFS-held content. Currently this isn't working, but it should!
  • Experiment-ID: /im/arc2warc/pagealizer
    Large Scale ARC to WARC Migration using Pagealyzer * (LM)
    Data: IMF Web Archive
    Workflow: Yes.
    Issues: Selenium stability.
  • Experiment-ID: /sb/arc2warc/jwat
    Large Scale ARC to WARC Migration using JWAT with QA using PhantomJS snapshots * (NBR)
    Data: SB Web Archive
    Workflow: Yes.
    Issues: Intention is to redo the ONB experiment on SB content

Developer Notes

A QA of the migration could use comparing snapshots of each of the sites, it could also take the approach of comparing all the files in each. There may be other aspects of ARCs and WARCs (header information, logs, etc.) that will need checking too. For example, has the log file format changed between the two? Is the WARC structurally sound?, etc.

Using JWAT

Related Documents

Scenarios, case studies, etc. that provide background to this story.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.