Extracting embedded objects from Office OpenXML documents

Skip to end of metadata
Go to start of metadata
Title Extracting embedded objects from Office OpenXML documents
Detailed description Overview: docXtractor is a python script using zipfile and lxml hooks to extract media from OOXML files (specifically docx in the current
alpha implementation). docXtractor parses the internal XML structure and relative link contents and produces a CSV files corresponding
to the docx-internal image identifiers, descriptions, and relative mappings to extracted objects (images, spreadsheets, etc).

In progress:
- Cross-platform compat/testing (currently Linux)
- Code refactoring
Solution Champion Kam Woods ([email protected])
Corresponding Issue(s) Extracting embedded objects from docx files
Tool/code link https://github.com/kamwoods/docXtractor
Tool Registry Link http://lxml.de/
Evaluation Image extraction: Extracts images as requested.
Metadata extraction: Performs full traversal of document.xml and document.xml.rels files to
                                  build mapping between extracted objects and relative identifiers.
Time performance: ~2-3GB/minute

Evaluation notes from final presentation at hackathon:
Dev: Solution is command line, but partially completed GUI!
Dev: Is quite fast as not parsing complete document.
Dev: Extracted files have been renamed (Image1, Image2...), so a mapping is required.
Dev: Mapping is not embedded in the actual document, but in another file. Not sure why this is the case!
Dev: Some useful contextual metadata also there. Including path and original format.
Dev: Intention to write a full parser.
CO: Need to do a little more file system testing, but otherwise can take this away and use it!
solution solution Delete
embedded_objects embedded_objects Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.