The [digitalcorpora.org|http://digitalcorpora.org/] site holds a number of interesting sets of test files. Including...
h2. Govdocs1 --- (nearly) 1 million freely-redistributable files
This is [a large corpus of documents|http://digitalcorpora.org/corpora/files] drawn from a web crawl of US government sites. It appears to be extremely varied and complex.
Also of interest is that a company called Forensic Innovations, Inc. has analysed of the corpus using its FITools product and published a 'ground truth' file format analysis ([see here for a summary|http://digitalcorpora.org/corpora/files/govdocs1-simple-statistical-report]). It would be very interesting to investigate this claim by comparing identification tools.
h2. Govdocs1 --- (nearly) 1 million freely-redistributable files
This is [a large corpus of documents|http://digitalcorpora.org/corpora/files] drawn from a web crawl of US government sites. It appears to be extremely varied and complex.
Also of interest is that a company called Forensic Innovations, Inc. has analysed of the corpus using its FITools product and published a 'ground truth' file format analysis ([see here for a summary|http://digitalcorpora.org/corpora/files/govdocs1-simple-statistical-report]). It would be very interesting to investigate this claim by comparing identification tools.