Skip to end of metadata
Go to start of metadata

Working plan

Create UDFs for

  • language detection using Tika
  • identification using Tika 

Language detection

Pig example

For now clone https://github.com/cneud/warcbase and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika

MIME type detection

Extended pig script

Combined script for mime type and language detection

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.