Working plan

Create UDFs for

  • language detection using Tika
  • identification using Tika 

Language detection

Pig example

For now clone and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika

MIME type detection

Extended pig script

Combined script for mime type and language detection

