The digitalcorpora.org site holds a number of interesting sets of test files. Including...
Govdocs1 — (nearly) 1 million freely-redistributable files
This is a large corpus of documents drawn from a web crawl of US government sites. It appears to be extremely varied and complex.
Also of interest is that a company called Forensic Innovations, Inc. has analysed of the corpus using its FITools product and published a 'ground truth' file format analysis (see here for a summary). It would be very interesting to investigate this claim by comparing identification tools.
Labels:
None