View Source

The idea of this part is to connect full-text search with warcbase and give the user the ability to run queries on the archived Web and get some aggregation views about the content of the archive.


We ingested the sample warc files into HBase table to be the wayback machine end of the search interface. We indexed the sample arc files using Terrier search engine. The index is more than normal index, it is enriched with some metadata such as the MimeType and the detected language of the indexed documents. There are two ways to add metadata to the index. First, run a pre-processing step that uses Pig user defined function to output the metadata of each document. Second option, during indexing, use tika to detect both the MimeType and language.

(copied in from Thaer Sammar)