
Working plan
Pig example
For now clone https://github.com/perdalum/warcbase and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:
The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika
package org.warcbase.pig.piggybank;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.tika.language.LanguageIdentifier;
import java.io.IOException;
public class DetectLanguage extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null)
String text = (String) input.get(0);
return new LanguageIdentifier(text).getLanguage();
}
}
Goals
Create UDFs for
- language detection using Tika
- identification using Tika