compared with
Current by Clemens Neudecker
on Dec 09, 2013 12:31.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (7)

View Page History
h1. Working plan

Create UDFs for
* language detection using Tika
* identification using Tika 

h2. Language detection

Pig example



For now clone [https://github.com/perdalum/warcbase] clone [https://github.com/cneud/warcbase] and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

{code}
{code}

h1. {color:#000000}{*}Goals{*}{color}
The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika

Create UDFs for
* language detection using Tika
* identification using Tika
\\ {code}
package org.warcbase.pig.piggybank;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.tika.language.LanguageIdentifier;
import java.io.IOException;

public class DetectLanguage extends EvalFunc<String> {
    @Override
    public String exec(Tuple input) throws IOException {

        if (input == null || input.size() == 0 || input.get(0) == null) {
            return null;
        }

        String text = (String) input.get(0);
        return new LanguageIdentifier(text).getLanguage();
    }
}
{code}

h2. MIME type detection

{code}
package org.warcbase.pig.piggybank;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

import org.apache.tika.Tika;
import org.apache.tika.detect.DefaultDetector;

import java.io.IOException;
import java.io.InputStream;
import java.io.ByteArrayInputStream;

import org.apache.tika.parser.AutoDetectParser;

public class DetectMimeType extends EvalFunc<String> {

@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null) {
return null;
}
String content = (String) input.get(0);

InputStream is = new ByteArrayInputStream( content.getBytes() );
DefaultDetector detector = new DefaultDetector();
AutoDetectParser parser = new AutoDetectParser(detector);
return new Tika(detector, parser).detect(is);
}
}
{code}

h1. Extended pig script

Combined script for mime type and language detection

{code}
-- Combined mime type check and language detection on an arc file
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

-- Load arc file properties: url, date, mime, content
raw = load '/tmp/IAH-20080430204825-00000-blackbook.arc.gz' using
org.warcbase.pig.ArcLoader() as (url: chararray, date:chararray, mime:chararray, content:chararray);

-- Detect the mime type of the content using tika
a = foreach raw generate url,mime,content,SUBSTRING(date,0,12) as date,org.warcbase.pig.piggybank.DetectMimeType(content) as tikaMime;

-- Select the textual files
b = filter a by (tikaMime == 'text/html');

-- Strip the tags from the content
c = foreach b generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.ExtractRawText(content) as txt;

-- Use tika to identify the language of the textual content
d = foreach c generate url,mime,tikaMime,date,org.warcbase.pig.piggybank.DetectLanguage(txt) as lang;

-- Group by language and date
e = group d by (lang,date);

-- Count the number of entries in the groups
f = foreach e generate $0, COUNT($1);

dump f;
{code}