Parsing PST and OST email files for textual mining and searching

Skip to end of metadata
Go to start of metadata

Parsing PST and OST email files for textual mining and searching

Detailed description

The issue is that PST/OST [MS Outlook] files cannot be used by many text extraction tools because they generally require MBOX format.

We want to have a solution that parses email messages so that they can be further used by other text extraction and search tools.

Critical to the parsing is to have at minimum:

  • Sender
  • Recipient
  • Date
  • Subject Line
  • Message Body

Issue Champions

"Bill LeFurgy

"Kari Smith

Possible Solution approaches
Brief brainstorm of possible approaches to solving the Issue. Each approach should be described in a single sentence as part of a bulleted list. Further detail can go in a dedicated Solution page.

  • Solution does not need to be cross-platform as we anticipate this to be a function or task that will be run periodically and users of the solution will have access to multiple operating systems if necessary.

Do not want to use commercial solutions to convert files to MBOX. 

Lessons Learned
Notes on Lessons Learned from tackling this Issue that might be useful to inform digital preservation best practice

"OST archive with attachments - MIT IASC

"Parsing PST OST file using TIKA

"Converting PST & OST files to MBOX format

chapel_hill chapel_hill Delete
issue issue Delete
appraisal_assessment appraisal_assessment Delete
unknown_characteristics unknown_characteristics Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.