Extracting embedded objects from docx files

Skip to end of metadata
Go to start of metadata
Title
Extracting embedded objects from docx files
Detailed description We preserve MS Word documents as docx files. We are reasonably confident that the XML structure preserves the report text and structure well. We are not so confident about other objects within the files and wish to preserve them separately, or at least assess whether we should be preserving them separately. Most report files have images within them (preserved as emf within docx file) and some have embedded spreadsheets (as xlsx) and potentialy other embedded objects. We need a tool that looks into the zip folder structure of the docx (specifically in the /media/ folder) and extracts the content to a separate place where it can be dealt with accordingly.
Issue champion Jenny Mitcham
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets
Possible Solution approaches Python script is being written to solve this problem
Context
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets ADS Grey Literature Library
Solutions Extracting embedded objects from Office OpenXML documents
Labels:
york_hackathon york_hackathon Delete
issue issue Delete
embedded_objects embedded_objects Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.