View Source

*Title*
_Parsing PST and OST email files for textual mining and searching_


*Detailed description*

The issue is that PST/OST \[MS Outlook\] files cannot be used by many text extraction tools because they generally require MBOX format.

We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.

Critical to the parsing is to have at minimum:
* Sender
* Recipient
* Date
* Subject Line
* Message Body

*Issue Champions*

"[~wlef]


"[~smithkr]



*Possible Solution approaches*
_Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page._
* Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.

*Context*
Do not want to use commercial solutions to convert files to MBOX. 

*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_

*Datasets*
*_"_{*}***[OST archive with attachments - MIT IASC|OST archive with attachments]*

*Solutions*
_"__[KB:Parsing PST OST file using TIKA]_

_"__[KB:Converting PST & OST files to MBOX format]_