Label: embedded_objects

Content with label embedded_objects in Practical Preservation Issues (See content from all spaces)
Related Labels: solution, email, york_hackathon, data_capture, issue, harvesting

Page: Extracting embedded objects from docx files
Title \\ Extracting embedded objects from docx files Detailed description We preserve MS Word documents as docx files. We are reasonably confident that the XML structure preserves the report text and structure well. We are not so confident about ...
Other labels: york_hackathon, issue
Page: Extracting embedded objects from Office OpenXML documents
Title Extracting embedded objects from Office OpenXML documents Detailed description Overview: docXtractor is a python script using zipfile and lxml hooks to extract media from OOXML files (specifically docx in the current \\ alpha implementation). docXtractor parses ...
Other labels: solution
Page: Web based email "harvesting"
Title \\ Web based email "harvesting" Detailed description The setting is collecting private archives, more specific web based emails. It should be possible to automatically harvest emails from web based email accounts. The system should scale as the number ...
Other labels: york_hackathon, email, issue, harvesting, data_capture