Title
Parsing PST and OST email files for textual mining and searching
Detailed description
The issue is that PST/OST [MS Outlook] files cannot be used by many text extraction tools because they generally require MBOX format.
We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.
Critical to the parsing is to have at minimum:
- Sender
- Recipient
- Date
- Subject Line
- Message Body
Issue Champions
"Bill LeFurgy
Possible Solution approaches
Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page.
- Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.
Context
Do not want to use commercial solutions to convert files to MBOX.
Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice
Datasets
"OST archive with attachments - MIT IASC
Solutions
"Parsing PST OST file using TIKA
1 Comment
comments.show.hideJun 04, 2013
Maurice de Rooij
Please read my comment on: http://wiki.opf-labs.org/pages/viewpage.action?pageId=25887031