
Working plan
Language detection
Pig example
For now clone https://github.com/cneud/warcbase and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:
The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika
MIME type detection
Goals
Create UDFs for
- language detection using Tika
- identification using Tika
Pig scripts
Combined script for mime type and language detection
Labels:
None