Harvest webmail accounts

Skip to end of metadata
Go to start of metadata
Title Harvest a web mail account and generate a ARC file out of it
Detailed description In order to harvest a web mail account, we use the JavaMail API (http://java.sun.com/developer/onlineTraining/JavaMail/contents.html#JavaMailIntro) with the POP3 protocol. The first issue is that usually this protocol is SSL protected so you need to access POP3 through SSL (the last version of JavaMail >1.4.3 now provides a pop3s for SSL ones).

As soon as we can connect to the pop3 store, we iterate over the messages and serialize their content to an arc file using the Heritrix utility (http://crawler.archive.org/). The rationale to use arc instead of a mbox format is that format is already in use for archiving the web.

We choose to use the mailto schema to identify each record (message or part of a message) in the arc file.
So for example, a message with 2 alternatives (one text, the other html) will appear as 3 records :
  1. mailto://<username>?subject=<subject>

    (multipart/alternative) : containing the headers (from, to, ... fileds)

  2. mailto://<username>?subject=<subject>#part1

    (text/plain) : the text of the mail

  3. mailto://<username>?subject=<subject>#part2

    (text/html) : the html of the mail 

    A first extension came from adding IMAP protocol to go beyond POP3 restrictions. The main advantage is that you can define a specific folder to be harvested (POP3 restricts you to the INBOX folder).

Solution Champion Thomas Ledoux
Corresponding Issue(s) Web based email "harvesting"
Tool/code link The code can be found in the attached war file : PopmailArchive.war
Tool Registry Link
Evaluation A needed extension is to be able to really go by HTTP(s) exchanges to solve not pop3 or imap providers or firewalls restrictions. Some programs such as MrPostman (http://sourceforge.net/projects/mrpostman/) will proxy from http to pop3 :  the main problem of such a tool is the maintenance of the scripts that does the translation between pop3 commands and http requests. Webmail sites do change often and become more and more sophisticated (ajax, ....). A solution could be to use the mobile version of such sites to have a simple interaction.
CO: ARC was not original requirement, but this is ok.
CO: This is the prototype that was required, and CO can now try this out with some testers.
Labels:
solution solution Delete
data_capture data_capture Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.