AQuA Mashup Tool List

Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

List of Tools

The following list includes all of the tools examined or used by participants in the AQuA Mashups. You can either:

  • Add a brief tool description to the table on this page, or
  • Add a richer entry to the The Registry (this is almost as quick, but much better!)
Tool Name                URL                                          
Description / Use                                                                                        
Link to AQuA solution where the tool is used
File Utility http://www.darwinsys.com/file/
Open Source file format identification utility, written in C and packaged with every unix like distribution.
Bugs: http://bugs.gw.com/ and also http://bugs.debian.org/cgi-bin/pkgreport.cgi?package=file
 
Sanselan
http://commons.apache.org/sanselan/ From site:  This Pure-Java library reads and writes a variety of image formats, including fast parsing of image info (size, color space, icc profile, etc.) and metadata. Might be handy! :-)
tiff2RDF - visualising image collection consistency
Extractor
http://planetarium.hki.uni-koeln.de/planets_cms/extractor
No doubt you all know about this, but I'm just adding them as I find them! :-) Extracts technical metadata from a number of file formats including images and audio!
 
JHOVE http://hul.harvard.edu/jhove/ Extracts properties from files and attempts to validate against format spec. Supports AIFF ASCII GIF HTML JPEG JPEG2000 PDF TIFF UTF-8 WAVE XML. jp2 header analysis

JHOVE2 https://bitbucket.org/jhove2/main/wiki/Home Successor to JHOVE. Integrates DROID. Supports ICC NetCDF SGML Shapefile TIFF UTF-8 WAVE XML.  
DROID http://sourceforge.net/apps/mediawiki/droid/ Identifies files based on internal 'magic' signatures, or file extension. Notes if these are inconsistent. GUI.
 
ExifTool
http://www.sno.phy.queensu.ca/~phil/exiftool/
Metadata/properties extraction (and editing) tool that supports dozens of formats, with an emphasis on image formats. Might well be the best properties extraction tool in existence, but strangely ignored by most of the digital preservation community ....
 
PSTViewTool http://pstviewtool.codeplex.com/http://pstsdk.codeplex.com/ Open source projects from Microsoft, a PST viewer and the underlying PST access library (C++)
 
libPST
http://www.five-ten-sg.com/libpst/ PST manipulation/migration library. See JvdK's comment here.

Unpaper http://unpaper.berlios.de/ For post-processing scans. Can spot rotation, black marks, etc. and so may be diagnostic use. http://unpaper.berlios.de/unpaper.html#imagefiles  
b2x Translator
http://b2xtranslator.sourceforge.net/ doc/ppt/xls to docx/pptx/xlsx conversion tools from a Microsoft partner.
 
JDeskew http://www.jdeskew.com/
Open source Java deskewing library
 
PeDALS http://sourceforge.net/projects/pedalsemailextr/
Email message to XML file extractor for digital preservation created by the Persistent Digital Archives and Library System (PeDALS) research project  
FITS http://code.google.com/p/fits/ The File Information Tool Set (FITS) identifies, validates, and extracts technical metadata for various file formats. It wraps several third-party open source tools, normalizes and consolidates their output, and reports any errors. Includes JHOVE, DROID, file, and others.
EAP File Verification
Identify compressed TIFFs and convert them to uncompressed TIFFs
tiff2RDF - visualising image collection consistency


ODF Converter
http://odf-converter.sourceforge.net/ The goal for this project is to provide translators to allow for interoperability between applications based on ODF (OpenDocument) standards (currently ODF 1.1) and Microsoft OpenXML based Office applications. ... Along with the add-ins for Microsoft Word, Excel and PowerPoint, we also provide a command line translator that allows doing batch conversions. These translators can also be run on the server side for certain scenarios.  
ODF Toolkit
http://odftoolkit.org/
Includes a validator. Mostly Java with some .Net code too.
 
PDFtk http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/    
pdf2xml http://sourceforge.net/projects/pdf2xml/
See also http://discerning.com/hacks/docutils/pdf2xml/readme.html  
PDFSSA4MET http://code.google.com/p/pdfssa4met/ PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging. PDFSSA4MET attempts to provide metadata extraction and tagging based on structural and syntactic analysis of content in XML.  
pdftohtml http://pdftohtml.sourceforge.net/
   
pstoedit http://www.pstoedit.net/pstoedit
   
Multivalent http://multivalent.sourceforge.net/    
JODConverter http://www.artofsolving.com/opensource/jodconverter JODConverter, the Java OpenDocument Converter, converts documents between different office formats.
It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.
 
pdf2svg http://www.cityinthesky.co.uk/opensource/pdf2svg    
Email Preservation Parser http://siarchives.si.edu/cerp/parserdownload.htm    
pHash http://www.phash.org/ The open source perceptual hash library. A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.
See also http://stackoverflow.com/questions/596262/image-fingerprint-to-compare-similarity-of-many-images
Identifying rotated, duplicate images using pHash
Fiji
http://pacific.mpi-cbg.de/wiki/
Fiji is an image processing package. It can be described as a distribution of ImageJ together with Java, Java 3D and a lot of plugins organized into a coherent menu structure.
See also http://fly.mpi-cbg.de/~saalfeld/Projects/javasift.html, http://rsb.info.nih.gov/ij/plugins/mssim-index.html
 
getID3() http://getid3.sourceforge.net/
getID3() is a PHP library that extracts useful information from MP3s & other multimedia file formats AQUAdio - characterization of user-generated audio field recordings
The GIMP
http://www.gimp.org GIMP is the GNU Image Manipulation Program. Identify compressed TIFFs and convert them to uncompressed TIFFs
Taverna
http://www.taverna.org.uk/ Taverna is an open source Workflow Management System. It consists of a suite of tools used to design and execute scientific workflows.  
Cue
https://github.com/jdf/cue.language A small Java library for simple text analysis - counting strings, identifying languages, and removing stop words. Used in futureArch's very simple word cloud generation.
 
Apache Tika http://tika.apache.org/
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
Characterising Externally Generated Content
AQDC - Document Compare
Apache Lucene http://lucene.apache.org/java/docs/index.html
Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Characterising Externally Generated Content
Analysis of Lucene Index Word Frequency
Java Image Comparison
http://mindmeat.blogspot.com/2008/07/java-image-comparison.html
Basic image comparison for duplication/differences based on block by block comparison
java image blocks comparison
BWF MetaEdit http://bwfmetaedit.sourceforge.net/
For extracting file-specific metadata (sample rate, sample bit-rate). Audio Auditing Script
jHears
http://jhears.org/
For audio fingerprinting (which also relies on SoX). Both client and server software are required. Audio Auditing Script
Kakadu
http://www.kakadusoftware.com/
JPEG2000 software framework
jp2 header analysis
ssdeep
http://ssdeep.sourceforge.net/
ssdeep is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length. ssdeep uses a rolling hash algorithm, hence changes to the file will result in only localized changes in the CTPH signature. ssdeep for duplicate image detection
pdiff: Perceptual Image Difference utility http://pdiff.sourceforge.net/
Image comparison/differencing tool
Perceptual Image Diff comparison
ImageMagick
http://www.imagemagick.org
Bitmap image software suite
tiff2RDF - visualising image collection consistency
OpenJPEG
http://www.openjpeg.org/
The OpenJPEG library is an open-source JPEG 2000 codec written in C language. It has been developed in order to promote the use of JPEG 2000, the new still-image compression standard from the Joint Photographic Experts Group (JPEG). Validating TIFF to JPEG2000 migration
Compare OCR results of the same source material in different formats (TIFF, JP2)
Apache POI
http://poi.apache.org/
Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). Apache POI Office Document Analyser
tesseract-ocr http://code.google.com/p/tesseract-ocr/
OCR engine
Compare OCR results of the same source material in different formats (TIFF, JP2)
PDFbox
http://pdfbox.apache.org/
JAVA PDF library for creation, manipulation and content extraction of PDF documents Detect, extract and analyse embedded objects in PDFs
PDF Characterisation Tool
itext
http://itextpdf.com/
PDF library for manipulation, content extraction and creation
PDF Characterisation Tool

Sub-pages

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.