View Source

| *Title* | Harvest a web mail account and generate a ARC file out of it |
| *Detailed description* | In order to harvest a web mail account, we use the JavaMail API ([http://java.sun.com/developer/onlineTraining/JavaMail/contents.html#JavaMailIntro]) with the POP3 protocol. The first issue is that usually this protocol is SSL protected so you need to access POP3 through SSL (the last version of JavaMail >1.4.3 now provides a *pop3s* for SSL ones). \\
\\
As soon as we can connect to the pop3 store, we iterate over the messages and serialize their content to an arc file using the Heritrix utility ([http://crawler.archive.org/]). The rationale to use arc instead of a mbox format is that format is already in use for archiving the web. \\
\\
We choose to use the mailto schema to identify each record (message or part of a message) in the arc file. \\
So for example, a message with 2 alternatives (one text, the other html) will appear as 3 records : \\
# {noformat}mailto://<username>?subject=<subject>{noformat} (multipart/alternative) : containing the headers (from, to, ... fileds)
# {noformat}mailto://<username>?subject=<subject>#part1{noformat} (text/plain) : the text of the mail
# {noformat}mailto://<username>?subject=<subject>#part2{noformat} (text/html) : the html of the mail&nbsp; \\
\\
A first extension came from adding IMAP protocol to go beyond POP3 restrictions. The main advantage is that you can define a specific folder to be harvested (POP3 restricts you to the INBOX folder). |
| *Solution Champion* | Thomas Ledoux \\ |
| *Corresponding Issue(s)* | [REQ:Web based email "harvesting"] \\ |
| *Tool/code link* | The code can be found in the attached war file :&nbsp;[^PopmailArchive.war] |
| *[Tool Registry Link|http://wiki.opf-labs.org/display/TR/Home]* | * JavaMail : [http://www.oracle.com/technetwork/java/javamail/index.html]
* Heritrix : [http://crawler.archive.org/] |
| *Evaluation* | A needed extension is to be able to really go by HTTP(s) exchanges to solve not pop3 or imap providers or firewalls restrictions. Some programs such as MrPostman ([http://sourceforge.net/projects/mrpostman/|http://sourceforge.net/projects/mrpostman/]) will proxy from http to pop3 :&nbsp; the main problem of such a tool is the maintenance of the scripts that does the translation between pop3 commands and http requests. Webmail sites do change often and become more and more sophisticated (ajax, ....). A solution could be to use the mobile version of such sites to have a simple interaction. \\
CO: ARC was not original requirement, but this is ok. \\
CO: This is the prototype that was required, and CO can now try this out with some testers. \\ |