Apache POI Office Document Analyser

Skip to end of metadata
Go to start of metadata
One line summary A utility based on Apache POI that is able to analyse MS Office documents.
Detailed description Uses POI to walk through the OLE file structures and look for embedded objects and their properties.




Solution champion Andrew Jackson
Git link https://github.com/openplanets/AQuA/tree/master/office-analyser
Group Evaluation Notes
  • Needs to be scripted up to be usable to run over collection of files
  • Useful learning about binary office format identifiers, applicable for format registry development and population. And could support DROID signature development / JHOVE module development
  • Helped to generate follow up questions on ComponentObjectStream that can be directed to Microsoft colleagues on the SCAPE Project.
  • Useful information extracted to enlighten us on technical characteristics of this content
  • Good potential for taking this forward
Detailed Evaluation How well does the solution meet your issue?
* This is actually a good start for further investigation of the issues which can arise with not only MS Word 97-2003 but also with other Office documents of the same period.
Do you think you can implement the solution in your organisation?
* If the solution is supported by DROID and JHOVE it would be easy to implement it in our organisation and our preservation workflows.
Summarise the benefits to your organisation that the solution could provide?
* If we know more about the Office files and application and platform on which they are created it is easier to decide on an preservation strategy. To know which embedded or linked objects are in the document is important for the migration of the object.

It is now important to do further testing with more documents.
Tool (link) http://poi.apache.org/
Issue Identifying the content of MS Office documents
Labels:
apache apache Delete
poi poi Delete
ms-office ms-office Delete
office office Delete
ms ms Delete
ole ole Delete
word word Delete
excel excel Delete
access access Delete
microsoft microsoft Delete
container container Delete
characterisation characterisation Delete
embedded_objects embedded_objects Delete
aqua aqua Delete
solution solution Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.