Working plan
Create UDFs for
- language detection using Tika
- identification using Tika
Language detection
Pig example
For now clone https://github.com/cneud/warcbase and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:
The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika
MIME type detection
Extended pig script
Combined script for mime type and language detection
Labels:
None