Identifying the content of MS Office documents

One line summary We have OLE2 Office documents, which may contain more documents, and we want to identify which version of Office each was created by.
Detailed description The older binary Office document formats (OLE) are effectively file systems, and the format information only really gives very superficial information about the object. We can tell that it is an OLE 2.0 Compound Document, but need to know which kind and what the creating application was. OLE can also contain sub-objects, so we want to know about that too.

Issue champion Mette van Essen
Possible approaches Use Apache POI ( to deconstruct the object.
Use doc2x etc. ( to transform the older format documents to the new OOXML formats and examine those.
Use the commercial library to analyse the object.
AQuA Solutions Apache POI Office Document Analyser
Collections MS Word 97-2003 Documents (NANETH)
