Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Working plan

Pig examples

-- Simple language detection example
--register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
DEFINE ArcLoader org.warcbase.pig.ArcLoader();
DEFINE ExtractRawText org.warcbase.pig.piggybank.ExtractRawText();
DEFINE DetectLanguage org.warcbase.pig.piggybank.DetectLanguage();
raw = load '$testArcFolder'
    using ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);
b = foreach raw generate url, mime, ExtractRawText(content) as content;
-- c = foreach b generate url,mime,DetectLanguage(content) as lang;
c1 = foreach b generate DetectLanguage(content) as lang;
d = group c1 by lang;
g = foreach d generate group, COUNT(c1);
store e into '$experimentfolder/e';

For now clone https://github.com/perdalum/warcbase and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

Goals

Create UDFs for

  • language detection using Tika
  • identification using Tika
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.