Version 1 by Clemens Neudecker
on Dec 04, 2013 13:58.

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
The idea of this part is to connect full-text search with warcbase and give the user the ability to run queries on the archived Web and get some aggregation views about the content of the archive.
The idea of this part is to connect full-text search with warcbase and give the user the ability to run queries on the archived Web and get some aggregation views about the content of the archive.


We ingested the sample warc files into HBase table to be the wayback machine end of the search interface. We indexed the sample arc files using Terrier search engine. The index is more than normal index, it is enriched with some metadata such as the MimeType and the detected language of the indexed documents. There are two ways to add metadata to the index. First, run a pre-processing step that uses Pig user defined function to output the metadata of each document. Second option, during indexing, use tika to detect both the MimeType and language.