compared with
Current by Clemens Neudecker
on Dec 04, 2013 13:58.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (1)

View Page History

We ingested the sample warc files into HBase table to be the wayback machine end of the search interface. We indexed the sample arc files using Terrier search engine. The index is more than normal index, it is enriched with some metadata such as the MimeType and the detected language of the indexed documents. There are two ways to add metadata to the index. First, run a pre-processing step that uses Pig user defined function to output the metadata of each document. Second option, during indexing, use tika to detect both the MimeType and language.

(copied in from Thaer Sammar)