View Source

h1. Working plan

Pig example

{code}
-- Simple language detection example

register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

DEFINE ArcLoader org.warcbase.pig.ArcLoader();
DEFINE ExtractRawText org.warcbase.pig.piggybank.ExtractRawText();
DEFINE DetectLanguage org.warcbase.pig.piggybank.DetectLanguage();

raw = load 'arcfile.arc'  using ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);

b = foreach raw generate url, mime, ExtractRawText(content) as content;
c1 = foreach b generate DetectLanguage(content) as lang;


d = group c1 by lang;
g = foreach d generate group, COUNT(c1);


store e into 'e';
-- dump e;
{code}


For now clone [https://github.com/perdalum/warcbase] and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

{code}
[ca, 1]
[en, 68]
[et, 8]
[hu, 34]
[it, 3]
[lt, 143]
[no, 35]
[pt, 2]
[ro, 6]
{code}

The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika

package org.warcbase.pig.piggybank;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.tika.language.LanguageIdentifier;
import java.io.IOException;
public class DetectLanguage extends EvalFunc<String> {
&nbsp; &nbsp; @Override
&nbsp; &nbsp; public String exec(Tuple input) throws IOException {
&nbsp; &nbsp; &nbsp; &nbsp; if (input == null \|\| input.size() == 0 \|\| input.get(0) == null) {
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return null;
&nbsp; &nbsp; &nbsp; &nbsp; }
&nbsp; &nbsp; &nbsp; &nbsp; String text = (String) input.get(0);
&nbsp; &nbsp; &nbsp; &nbsp; return new LanguageIdentifier(text).getLanguage();
&nbsp; &nbsp; }
}

{code}
package org.warcbase.pig.piggybank;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.tika.language.LanguageIdentifier;
import java.io.IOException;

public class DetectLanguage extends EvalFunc<String> {
    @Override
    public String exec(Tuple input) throws IOException {

        if (input == null || input.size() == 0 || input.get(0) == null) {
            return null;
        }

        String text = (String) input.get(0);
        return new LanguageIdentifier(text).getLanguage();
    }
}
{code}

h1. {color:#000000}{*}Goals{*}{color}

Create UDFs for
* language detection using Tika
* identification using Tika
\\