Parsing PST and OST email files for textual mining and searching
The issue is that PST/OST [MS Outlook] files cannot be used by many text extraction tools because they generally require MBOX format.
We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.
Critical to the parsing is to have at minimum:
- Subject Line
- Message Body
Possible Solution approaches
Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page.
- Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.
Do not want to use commercial solutions to convert files to MBOX.
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
"OST archive with attachments - MIT IASC
"Parsing PST OST file using TIKA