compared with
Current by Leïla Medjkoune
on Oct 01, 2014 13:14.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (1)

View Page History
** The approach suggested to tackle these potential quality issues is to create reference images for each crawled site. A reference image is a snapshot that has undergone the usual (still labor intensive) manual quality assurance process or a screenshot of the live taken while crawling on a number of different browsers.  Using this reference image each new crawl can be compared by automated means. This can be done using various metrics as indicators, like the change in the size of the crawl, the number of changed pages, but also more advanced methods like automated visual comparison of percentage and location of changed in the rendered pages.
** Now with the various metrics as indicators on the change in the new crawl compared to the reference crawl various actions can be undertaken: the changes very minor: the quality of this crawl is good, a lower crawl frequency can be set;  the changes are relatively small: the quality of this crawl is good;  the change are relatively big: manual inspection is required.  When this crawl is approved it will become the new reference crawl. We could also envisage that such automated comparison could triger automated actions that would improve crawls completeness/quality and allow monitoring of the Webarchives from a qualitative and long term preservation perspective. Overall this approach is likely improve the effectivess, efficiency and scalability of quality assurance for specific crawls and Web Archives in general. It is strongly related to PW-12: Automated Watch as outlined in [WCT7|http://wiki.opf-labs.org/display/SP/WCT7+Format+obsolescence+detection].
** Another approach would consist in taking screenshots of the live web pages as the crawl is running to compare these to screenshots of the same rendered web pages (though a wayback alike tool). This would allow pointing any major capture or access issue and would limit the human manual QA intervention
* EXL
** We agree this is a difficult problem. Even if we were able to take a screen shot of the crawled website and perform a reliable image comparison of the two scans, it would be difficult to differentiate between changes which are errors (incomplete scans, etc) versus content. |