The [|] site holds a number of interesting sets of test files. Including...

h2. Govdocs1 --- (nearly) 1 million freely-redistributable files
This is [a large corpus of documents|] drawn from a web crawl of US government sites. It appears to be extremely varied and complex.

Also of interest is that a company called Forensic Innovations, Inc. has analysed of the corpus using its FITools product and published a 'ground truth' file format analysis ([see here for a summary|]). It would be very interesting to investigate this claim by comparing identification tools.