Identifying the content of Email Mailboxes

Skip to end of metadata
Go to start of metadata
One line summary A single mailbox file (.mbox/.mbx/.pst) can consists of a lot of email messages with or without attachments and we want to identify them.                                                                                                                        

Detailed description The main focus will be on an Eudora mailbox. Eudora uses an mboxo variation which is one of the mbox family's. Eudora separates out attachments embedded in the message, storing the attachments as separate individual files in one folder. We want to know how many message there are in the mailbox, how many attachments, which fileformats the attachments have, etc.
MBOX File Specifications

We have the same problem with .pst files (Microsoft Outlook Personal Folder File Format). Each PST file represents a message store that contains an arbitrary hierarchy of  Folder objects, which contains Message objects, which can contain  Attachment objects. Information about Folder objects, Message objects, and Attachment objects are stored in properties, which collectively contain all of the information about the particular item.
Outlook Personal Folders (.pst) File Format specifictions

The main reason why we want to know this at this moment is because we are running tests on different preservation tools for E-mail (like the PeDALS Email ExtractorCERP Email Preservation Parser and other tools) and it is a lot of work to check manually if all messages and attachments are processed by the tool.
Issue champion Mette van Essen
Possible approaches MESSAGES:
1. Read mbox protocol
2. Identify start + end email messages (bytes)
3. Count them
4. Automate
5. Compare
------------------------------------------------------------------------
ATTACHEMENTS:
1. Identify attachment sequencing
2. Count
------------------------------------------------------------------------
Other Solutions:
- Normalize to IMAP
- Parse in email readers (thunderbird)
- Apache virtual file system


Context  NANETH
AQuA Solutions Identifying the content of Email Mailboxes - Solution
Collections Email Mailbox Collections  
Labels:
email email Delete
characterisation characterisation Delete
migration migration Delete
normalisation normalisation Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jul 29, 2011

    Interesting this. Couple of thoughts.

    You'll probably be aware of this already, but for the PST to mbox migration you may want to check out the (open) libPST library:

    http://www.five-ten-sg.com/libpst/

    It includes a 'readpst' utility that converts a PST to mbox format. There's a link to a Windows binary at the bottom of the following page:

    http://blog.christophersmart.com/2009/12/01/build-libpst-for-windows/

    I recently used readpst myself to convert 2.5 years of work-related Outlook e-mail to mbox using this tool, and I was pretty impressed with the results. Behaviour for attachments, inline images and formatting all looked pretty good to me.
     
    As for Eudora attachments, there's one thing to watch out for in particular: as you're correctly pointing out Eudora stores al attachments in one folder (actually I think the later paid-for versions could also be configured to use multiple directories, but I may be wrong here). The tricky bit is that the references to attachments in messages are always stored as full paths. If the Eudora-to-mbox migration tool uses these paths to locate the attachments, this can result in loss of data. Here's why.

    Suppose I have set up Eudora on my system. The client is configured to store all attachments in the following folder:

    At some point I switch over to another PC. On the new PC the attachment folder is here:

    Now suppose that during the switch I've copied the contents of my old e-mail directory (C:\Documents and Settings\johan\Application Data\Qualcomm\Eudora\attach) to the new one (F:\data\Qualcomm\Eudora\attach). In that case Eudora can read all my old e-mail messages, including attachments. However, any e-mail messages with attachments that originate from the 'old' system will still contain a link to the 'old' attachment directory!

    So the link to attachment "rubbish.pdf" may point to directory "C:\Documents and Settings\johan\Application Data\Qualcomm\Eudora\attach", whereas in reality the file will be in "F:\data\Qualcomm\Eudora\attach"! If the migration tool is not aware of this, these attachments will not be included in the migration! I remember that back in 2008 this was also the default behaviour of the Thunderbird importer. To get around this, I ended up writing a script that located all attachment references in each mailbox and replaced their file paths by the path to the actual attachment directory.

    From what I remember there's a similar issue for embedded content (which is in the "Embedded" folder), but I don't remember much about that.

    I used to use Eudora myself for many years. In 2008 I switched over to Thunderbird, which forced me to find a solution for my archived e-mail from 1998 onward (which was all in Eudora format). In the end I managed to migrate all my Eudora mailboxes including attachments to a Thunderbird-readable mbox format, but this required quite a bit of fiddling around with custom-built scripts. Apart from the attachment problem, another thing I remember is that some of my Eudora mailboxes contained some non-text control characters, which were creating problems as well (probably some byte corruption issue). This is another thing that needed a custom built script. But maybe this situation has improved by now.

    1. Aug 04, 2011

      Hi Johan,

      Thank you. This is very useful information.
      I knew about readPST but I haven't tried it yet (now I will for sure).

      If I got more information about this subject I will let you know.

      M.