|Title||Extracting embedded objects from Office OpenXML documents|
|Detailed description|| Overview: docXtractor is a python script using zipfile and lxml hooks to extract media from OOXML files (specifically docx in the current
alpha implementation). docXtractor parses the internal XML structure and relative link contents and produces a CSV files corresponding
to the docx-internal image identifiers, descriptions, and relative mappings to extracted objects (images, spreadsheets, etc).
- Cross-platform compat/testing (currently Linux)
- Code refactoring
|Solution Champion|| Kam Woods ([email protected])
|Corresponding Issue(s)|| Extracting embedded objects from docx files
|Tool Registry Link||http://lxml.de/|
|Evaluation|| Image extraction: Extracts images as requested.
Metadata extraction: Performs full traversal of document.xml and document.xml.rels files to
build mapping between extracted objects and relative identifiers.
Time performance: ~2-3GB/minute
Evaluation notes from final presentation at hackathon:
Dev: Solution is command line, but partially completed GUI!
Dev: Is quite fast as not parsing complete document.
Dev: Extracted files have been renamed (Image1, Image2...), so a mapping is required.
Dev: Mapping is not embedded in the actual document, but in another file. Not sure why this is the case!
Dev: Some useful contextual metadata also there. Including path and original format.
Dev: Intention to write a full parser.
CO: Need to do a little more file system testing, but otherwise can take this away and use it!
Skip to end of metadata Go to start of metadata