digitalcorpora.org

Skip to end of metadata
Go to start of metadata

The digitalcorpora.org site holds a number of interesting sets of test files. Including...

Govdocs1 — (nearly) 1 million freely-redistributable files

This is a large corpus of documents drawn from a web crawl of US government sites. It appears to be extremely varied and complex.

Also of interest is that a company called Forensic Innovations, Inc. has analysed of the corpus using its FITools product and published a 'ground truth' file format analysis (see here for a summary). It would be very interesting to investigate this claim by comparing identification tools.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.