View Source

_Parsing PST and OST email files for textual mining and searching_

*Detailed description*

The issue is that PST/OST \[MS Outlook\] files cannot be used by many text extraction tools because they generally require MBOX format.

We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.

Critical to the parsing is to have at minimum:
* Sender
* Recipient
* Date
* Subject Line
* Message Body

*Issue Champions*



*Possible Solution approaches*
_Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page._
* Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.

Do not want to use commercial solutions to convert files to MBOX. 

*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_

*_"_{*}***[OST archive with attachments - MIT IASC|OST archive with attachments]*

_"__[KB:Parsing PST OST file using TIKA]_

_"__[KB:Converting PST & OST files to MBOX format]_