One line summary A utility based on Apache POI that is able to analyse MS Office documents.
Detailed description Uses POI to walk through the OLE file structures and look for embedded objects and their properties.

Solution champion Andrew Jackson
Git link https://github.com/openplanets/AQuA/tree/master/office-analyser
  • Needs to be scripted up to be usable to run over collection of files
  • Useful learning about binary office format identifiers, applicable for format registry development and population. And could support DROID signature development / JHOVE module development
  • Helped to generate follow up questions on ComponentObjectStream that can be directed to Microsoft colleagues on the SCAPE Project.
  • Useful information extracted to enlighten us on technical characteristics of this content
  • Good potential for taking this forward
* This is actually a good start for further investigation of the issues which can arise with not only MS Word 97-2003 but also with other Office documents of the same period.
* If the solution is supported by DROID and JHOVE it would be easy to implement it in our organisation and our preservation workflows.
* If we know more about the Office files and application and platform on which they are created it is easier to decide on an preservation strategy. To know which embedded or linked objects are in the document is important for the migration of the object.

It is now important to do further testing with more documents.
Tool (link) http://poi.apache.org/
Issue Identifying the content of MS Office documents
