
List of Tools
The following list includes all of the tools examined or used by participants in the AQuA Mashups. You can either:
- Add a brief tool description to the table on this page, or
- Add a richer entry to the The Registry (this is almost as quick, but much better!)
Tool Name | URL |
Description / Use |
Link to AQuA solution where the tool is used |
File Utility | http://www.darwinsys.com/file/![]() |
Open Source file format identification utility, written in C and packaged with every unix like distribution. Bugs: http://bugs.gw.com/ ![]() ![]() |
|
Sanselan |
http://commons.apache.org/sanselan/![]() |
From site: This Pure-Java library reads and writes a variety of image formats, including fast parsing of image info (size, color space, icc profile, etc.) and metadata. Might be handy! :-) |
tiff2RDF - visualising image collection consistency |
Extractor |
http://planetarium.hki.uni-koeln.de/planets_cms/extractor![]() |
No doubt you all know about this, but I'm just adding them as I find them! :-) Extracts technical metadata from a number of file formats including images and audio! |
|
JHOVE | http://hul.harvard.edu/jhove/![]() |
Extracts properties from files and attempts to validate against format spec. Supports AIFF ASCII GIF HTML JPEG JPEG2000 PDF TIFF UTF-8 WAVE XML. | jp2 header analysis |
JHOVE2 | https://bitbucket.org/jhove2/main/wiki/Home![]() |
Successor to JHOVE. Integrates DROID. Supports ICC NetCDF SGML Shapefile TIFF UTF-8 WAVE XML. | |
DROID | http://sourceforge.net/apps/mediawiki/droid/![]() |
Identifies files based on internal 'magic' signatures, or file extension. Notes if these are inconsistent. GUI. |
|
ExifTool |
http://www.sno.phy.queensu.ca/~phil/exiftool/![]() |
Metadata/properties extraction (and editing) tool that supports dozens of formats, with an emphasis on image formats. Might well be the best properties extraction tool in existence, but strangely ignored by most of the digital preservation community .... |
|
PSTViewTool | http://pstviewtool.codeplex.com/![]() ![]() |
Open source projects from Microsoft, a PST viewer and the underlying PST access library (C++) |
|
libPST |
http://www.five-ten-sg.com/libpst/![]() |
PST manipulation/migration library. See JvdK's comment here. |
|
Unpaper | http://unpaper.berlios.de/![]() |
For post-processing scans. Can spot rotation, black marks, etc. and so may be diagnostic use. http://unpaper.berlios.de/unpaper.html#imagefiles![]() |
|
b2x Translator |
http://b2xtranslator.sourceforge.net/![]() |
doc/ppt/xls to docx/pptx/xlsx conversion tools from a Microsoft partner. |
|
JDeskew | http://www.jdeskew.com/![]() |
Open source Java deskewing library |
|
PeDALS | http://sourceforge.net/projects/pedalsemailextr/![]() |
Email message to XML file extractor for digital preservation created by the Persistent Digital Archives and Library System (PeDALS) research project | |
FITS | http://code.google.com/p/fits/![]() |
The File Information Tool Set (FITS) identifies, validates, and extracts technical metadata for various file formats. It wraps several third-party open source tools, normalizes and consolidates their output, and reports any errors. Includes JHOVE, DROID, file, and others. |
EAP File Verification Identify compressed TIFFs and convert them to uncompressed TIFFs tiff2RDF - visualising image collection consistency |
ODF Converter |
http://odf-converter.sourceforge.net/![]() |
The goal for this project is to provide translators to allow for interoperability between applications based on ODF (OpenDocument) standards (currently ODF 1.1) and Microsoft OpenXML based Office applications. ... Along with the add-ins for Microsoft Word, Excel and PowerPoint, we also provide a command line translator that allows doing batch conversions. These translators can also be run on the server side for certain scenarios. | |
ODF Toolkit |
http://odftoolkit.org/![]() |
Includes a validator. Mostly Java with some .Net code too. |
|
PDFtk | http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/![]() |
||
pdf2xml | http://sourceforge.net/projects/pdf2xml/![]() |
See also http://discerning.com/hacks/docutils/pdf2xml/readme.html![]() |
|
PDFSSA4MET | http://code.google.com/p/pdfssa4met/![]() |
PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging. PDFSSA4MET attempts to provide metadata extraction and tagging based on structural and syntactic analysis of content in XML. | |
pdftohtml | http://pdftohtml.sourceforge.net/![]() |
||
pstoedit | http://www.pstoedit.net/pstoedit![]() |
||
Multivalent | http://multivalent.sourceforge.net/![]() |
||
JODConverter | http://www.artofsolving.com/opensource/jodconverter![]() |
JODConverter, the Java OpenDocument Converter, converts documents between different office formats. It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today. |
|
pdf2svg | http://www.cityinthesky.co.uk/opensource/pdf2svg![]() |
||
Email Preservation Parser | http://siarchives.si.edu/cerp/parserdownload.htm![]() |
||
pHash | http://www.phash.org/![]() |
The open source perceptual hash library. A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar. See also http://stackoverflow.com/questions/596262/image-fingerprint-to-compare-similarity-of-many-images ![]() |
Identifying rotated, duplicate images using pHash |
Fiji |
http://pacific.mpi-cbg.de/wiki/![]() |
Fiji is an image processing package. It can be described as a distribution of ImageJ together with Java, Java 3D and a lot of plugins organized into a coherent menu structure. See also http://fly.mpi-cbg.de/~saalfeld/Projects/javasift.html ![]() ![]() |
|
getID3() | http://getid3.sourceforge.net/![]() |
getID3() is a PHP library that extracts useful information from MP3s & other multimedia file formats | AQUAdio - characterization of user-generated audio field recordings |
The GIMP |
http://www.gimp.org![]() |
GIMP is the GNU Image Manipulation Program. | Identify compressed TIFFs and convert them to uncompressed TIFFs |
Taverna |
http://www.taverna.org.uk/![]() |
Taverna is an open source Workflow Management System. It consists of a suite of tools used to design and execute scientific workflows. | |
Cue |
https://github.com/jdf/cue.language![]() |
A small Java library for simple text analysis - counting strings, identifying languages, and removing stop words. Used in futureArch's very simple word cloud generation![]() |
|
Apache Tika | http://tika.apache.org/![]() |
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. |
Characterising Externally Generated Content AQDC - Document Compare |
Apache Lucene | http://lucene.apache.org/java/docs/index.html![]() |
Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. |
Characterising Externally Generated Content Analysis of Lucene Index Word Frequency |
Java Image Comparison |
http://mindmeat.blogspot.com/2008/07/java-image-comparison.html![]() |
Basic image comparison for duplication/differences based on block by block comparison |
java image blocks comparison |
BWF MetaEdit | http://bwfmetaedit.sourceforge.net/![]() |
For extracting file-specific metadata (sample rate, sample bit-rate). | Audio Auditing Script |
jHears |
http://jhears.org/![]() |
For audio fingerprinting (which also relies on SoX). Both client and server software are required. | Audio Auditing Script |
Kakadu |
http://www.kakadusoftware.com/![]() |
JPEG2000 software framework |
jp2 header analysis |
ssdeep |
http://ssdeep.sourceforge.net/![]() |
ssdeep is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length. ssdeep uses a rolling hash algorithm, hence changes to the file will result in only localized changes in the CTPH signature. | ssdeep for duplicate image detection |
pdiff: Perceptual Image Difference utility | http://pdiff.sourceforge.net/![]() |
Image comparison/differencing tool |
Perceptual Image Diff comparison |
ImageMagick |
http://www.imagemagick.org![]() |
Bitmap image software suite |
tiff2RDF - visualising image collection consistency |
OpenJPEG |
http://www.openjpeg.org/![]() |
The OpenJPEG library is an open-source JPEG 2000 codec written in C language. It has been developed in order to promote the use of JPEG 2000, the new still-image compression standard from the Joint Photographic Experts Group (JPEG). | Validating TIFF to JPEG2000 migration Compare OCR results of the same source material in different formats (TIFF, JP2) |
Apache POI |
http://poi.apache.org/![]() |
Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). | Apache POI Office Document Analyser |
tesseract-ocr | http://code.google.com/p/tesseract-ocr/![]() |
OCR engine |
Compare OCR results of the same source material in different formats (TIFF, JP2) |
PDFbox |
http://pdfbox.apache.org/![]() |
JAVA PDF library for creation, manipulation and content extraction of PDF documents | Detect, extract and analyse embedded objects in PDFs PDF Characterisation Tool |
itext |
http://itextpdf.com/![]() |
PDF library for manipulation, content extraction and creation |
PDF Characterisation Tool |
Sub-pages
Labels:
None