Title |
Extracting embedded objects from docx files |
Detailed description | We preserve MS Word documents as docx files. We are reasonably confident that the XML structure preserves the report text and structure well. We are not so confident about other objects within the files and wish to preserve them separately, or at least assess whether we should be preserving them separately. Most report files have images within them (preserved as emf within docx file) and some have embedded spreadsheets (as xlsx) and potentialy other embedded objects. We need a tool that looks into the zip folder structure of the docx (specifically in the /media/ folder) and extracts the content to a separate place where it can be dealt with accordingly. |
Issue champion | ![]() |
Other interested parties |
Any other parties who are also interested in applying Issue Solutions to their Datasets |
Possible Solution approaches | Python script is being written to solve this problem |
Context | |
Lessons Learned | Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice |
Datasets | ADS Grey Literature Library |
Solutions | Extracting embedded objects from Office OpenXML documents |
Labels: