Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Working plan

Language detection

Pig example

For now clone and checkout the pig-integration branch. Running the unit tests will run the above Pig Latin script on the provided test gzip'ed ARC file. The language distribution reported by Tika is:

The UDF is added to the piggy bank of the warcbase project by adding this class that leverages Tika

MIME type detection


Create UDFs for

  • language detection using Tika
  • identification using Tika 

Pig scripts

Combined script for mime type and language detection

Example output from above script: output.txt

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.