*Title*
_Parsing PST and OST email files for textual mining and searching_
*Detailed description*
The issue is that PST/OST \[MS Outlook\] files cannot be used by many text extraction tools because they generally require MBOX format.
We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.
Critical to the parsing is to have at minimum:
* Sender
* Recipient
* Date
* Subject Line
* Message Body
*Issue Champions*
"[~wlef]
"[~smithkr]
*Possible Solution approaches*
_Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page._
* Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.
*Context*
Do not want to use commercial solutions to convert files to MBOX.
*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_
*Datasets*
*_"_{*}***[OST archive with attachments - MIT IASC|OST archive with attachments]*
*Solutions*
_"__[KB:Parsing PST OST file using TIKA]_
_"__[KB:Converting PST & OST files to MBOX format]_
_Parsing PST and OST email files for textual mining and searching_
*Detailed description*
The issue is that PST/OST \[MS Outlook\] files cannot be used by many text extraction tools because they generally require MBOX format.
We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.
Critical to the parsing is to have at minimum:
* Sender
* Recipient
* Date
* Subject Line
* Message Body
*Issue Champions*
"[~wlef]
"[~smithkr]
*Possible Solution approaches*
_Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page._
* Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.
*Context*
Do not want to use commercial solutions to convert files to MBOX.
*Lessons Learned*
_Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice_
*Datasets*
*_"_{*}***[OST archive with attachments - MIT IASC|OST archive with attachments]*
*Solutions*
_"__[KB:Parsing PST OST file using TIKA]_
_"__[KB:Converting PST & OST files to MBOX format]_