View Source

| *Title* | Extracting embedded objects from Office OpenXML documents |
| *Detailed description* | Overview: docXtractor is a python script using zipfile and lxml hooks to extract media from OOXML files (specifically docx in the current \\
alpha implementation). docXtractor parses the internal XML structure and relative link contents and produces a CSV files corresponding \\
to the docx-internal image identifiers, descriptions, and relative mappings to extracted objects (images, spreadsheets, etc). \\
\\
In progress: \\
\- GUI \\
\- Cross-platform compat/testing (currently Linux) \\
\- Code refactoring \\ |
| *Solution Champion* | Kam Woods ([email protected]) \\ |
| *Corresponding Issue(s)* | [REQ:Extracting embedded objects from docx files]\\ |
| *Tool/code link* | [https://github.com/kamwoods/docXtractor] |
| *[Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]* | [http://lxml.de/] |
| *Evaluation* | Image extraction: Extracts images as requested. \\
Metadata extraction: Performs full traversal of document.xml and document.xml.rels files to \\
                                  build mapping between extracted objects and relative identifiers. \\
Time performance: \~2-3GB/minute \\
\\
Evaluation notes from final presentation at hackathon:\\
Dev: Solution is command line, but partially completed GUI\!\\
Dev: Is quite fast as not parsing complete document.\\
Dev: Extracted files have been renamed (Image1, Image2...), so a mapping is required.\\
Dev: Mapping is not embedded in the actual document, but in another file. Not sure why this is the case\!\\
Dev: Some useful contextual metadata also there. Including path and original format.\\
Dev: Intention to write a full parser.\\
CO: Need to do a little more file system testing, but otherwise can take this away and use it\! \\ |