compared with
Current by Clemens Neudecker
on Dec 09, 2013 12:31.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (14)

View Page History
h1. Working plan

Create UDFs for
* language detection using Tika
* identification using Tika 

h2. Language detection

For now clone [] and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:


h1. Goals
h1. Extended pig script

Create UDFs for
* language detection using Tika
* identification using Tika 

h1. Pig scripts

Combined script for mime type and language detection

org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);

-- Detect the mime type of the content using tika
a = foreach raw generate url,mime,content, org.warcbase.pig.piggybank.DetectMimeType(content) url,mime,content,SUBSTRING(date,0,12) as date,org.warcbase.pig.piggybank.DetectMimeType(content) as tikaMime;

-- Select the textual files
b = filter a by (tikaMime matches 'text.*'); == 'text/html');

-- Strip the tags from the content
c = foreach b generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.ExtractRawText(content) as txt;

-- Use tika to identify the language of the textual content
d = foreach c generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.DetectLanguage(txt) as lang,txt;

store d into 'tmp' using PigStorage();
-- Group by language and date
e = group d by (lang,date);

Example output from above script: [^output.txt]
-- Count the number of entries in the groups
f = foreach e generate $0, COUNT($1);

dump f;