View Source

| *Title* | Validate and report filetypes per file |
| *Detailed description* | _To report the file discrepency on a per file basis, we:_ \\
_ \- open each file from the archive_ \\
_ \- analyse the filetype using droid_ \\
\\
_ \- compare the filetype with the web archive and report mismatch ( we may need to normalize variations in the same mimetype expressions, i.e. image\jpg=image\jpeg )_ \\
_ \- export report in csv_ \\
_Investigate if this tool can be offered as module\add on for the{_}_[jhove2-bfn|https://bitbucket.org/lbihanic/jhove2-bnf]_ _fork_ \\ |
| *Solution Champion* | Lucien van wouw <[email protected]> \\ |
| *Corresponding Issue(s)* | [REQ:Identifying web content]\\ |
| *Tool/code link* | [heritrix-imp|https://github.com/openplanets/AQuA/tree/master/heritrix-imp] |
| *[Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]* | [Heritrix |http://crawler.archive.org/]and [Droid|http://sourceforge.net/projects/droid/]\\ |
| *Evaluation* | _Relativly time consuming as the entire archive needs to be unpacked before being able to identify each file. But the endresults shows clearly the differences in a simple csv between the Archive format description and the external tool ( Droid in this case ). Solution could be seen as a stand alone validation tool; but in regards to the problem description it ought to be seen as a proof of concept. That is: similar functionality ought to be ported into the Hove2-bfn module._ |