FIDO usage guide

Skip to end of metadata
Go to start of metadata

Audience

Developers and Practicioners

Preface

This usage guide is written to be suitable for both Windows (cmd.exe) and Linux/Apple, but functionality or invocation may differ and will be stated if this is the case.
Example output will only be added if it differs from default.

Purpose of FIDO

The main (and only) purpose of FIDO is to identify the file format of a digital object (a computerfile), hence the abbrevation F ormat I dentification for D igital O bjects.

It is created for both command line usage and implementation into Digital Preservation workflow systems.

The most important information FIDO returns is the PUID (PRONOM Persistent Unique Identifier) of a file format. This PUID can be used in conjunction with file format registry software (such as Tessella's PRONOM Technical Registry) to determine what actions are needed to successfully preserve a file or how to access it.

FIDO uses PRONOM signatures to try and determine the file format. If a signature is advanced enough, FIDO is able to return the format version as well (eg. PDF 1.3).

Note: FIDO is not able to validate a file, ie. if it conforms to specifications. This is a special kind of operation which requires file format validation software such as JHOVE. FIDO is also not able to extract metadata from a file, you have to use FITS or the NLNZ metadata extraction tool for that.

Matchtypes

FIDO tries to identify a file by scanning the file with all available signatures. If it is able to determine the format based on a signature, the matchtype returned is "signature".

If the file type is a "container" type file (ZIP or OLE2) triggered by a succesful signature match, FIDO tries to determine what kind of container format it is, using the PRONOM container signature file, it will then return matchtype "container".

If the file type can not be identified using a signature, FIDO uses the extension to determine the file type. If successful, it will return all PUIDs that use this extension. Please note that this might be ambigious. For example, the extension doc might be a Microsoft Word document, but it might as well be a WordPerfect file or a plain-text file.

If FIDO fails to identify the file format with a signature or file extension, matchtype "fail" is returned.

Matchtype Description Order Accuracy
signature the object has been identified with (a) PRONOM signature(s) 1 High
container the object has been idenfified with (a) PRONOM container signature(s) 2 High
extension the object could only be identified by file extension 3 Medium
fail the object could not be identified with signature nor file extension 4 Low

Daily usage

Installation

Download the latest release of FIDO

Explanation of release tags

Every commit to the master branch of FIDO is considered an official release and is therefore tagged individually.

Next to Python code, FIDO releases also contain PRONOM signatures, which are distributed by TNA, each release with its own version number. To separate releases of FIDO with only an update to the PRONOM signature version, the following schema is in use from version 1.3.1 and up.

For example, release 1.3.1 has PRONOM version 70 distributed with it and is tagged '1.3.1-70'.
If a PRONOM update is available but there are no code changes the consecutive tag will be '1.3.1-71'. Please note that this is only reflected in release tags, FIDO will still only report its version number without the PRONOM version number.

Installation using setuptools

FIDO is distributed to be installed with the Python package setuptools.

To install FIDO, run python setup.py install in the root folder of the unzipped distribution.
After installation, you can run FIDO as a Python module from any folder using

Manual installation

You can install FIDO anywhere you like. Just unzip the package and follow instructions below.

It is assumed both the FIDO installation folder and the Python interpreter are on your path.

Running "fido" (without the .py extension) from anywhere

Windows

Make sure Python scripts are associated with the Python interpreter and the .py extension is added to PATHEXT.

Linux/Apple

Make sure the fido.py script is executable (chmod +x). The shebang in fido.py is used to invoke the Python interpreter. Additionally, if you want to run FIDO without using the .py extension, rename the file or create a symlink in your home folder.

Furthermore, if you plan on using update_signatures.py and toxml.py, make sure they are exectuable (chmod +x).

Show all command line options

Analysing a single file

Analysing a folder non-recursively

Analysing a folder recursively

Analysing ZIP and TAR contents

32 bit zipfiles only

In the output of FIDO, the names of folders and files in a ZIP or TAR are appended to the name of the ZIP or TAR, separated by an exclamation mark, for example:

or

Analysing a folder recursively and analyse ZIP and TAR contents

Updating signatures using the update script

This interactive script updates FIDO its signatures to the latest release of PRONOM signatures. Container signatures however, will not be automatically updated. Updates to container signatures will be pushed with FIDO when available. The signature extension file will also not be updated automatically.

Updating signatures can take a long time, depending on your connection speed and remote availability of the PRONOM website. This website is maintained by The National Archives UK (TNA). For more information, visit PRONOM.

If you want to interrupt the update process or is interrupted (e.g. due to network failure), running the update script will resume the download of signatures.

The actual conversion of PRONOM signatures is done by prepare.py which is called by the update script.

Under normal circumstances you would never have to run this script because it is invoked by the signature update script.
It converts PRONOM type signatures to FIDO type signatures which are basically "normal" regular expressions. It uses the PRONOM zip file as defined in "conf/versions.xml".

Using a proxy with the update script

Command Line Interface proxy usage

Redirection of FIDO output and error messages

Command Line Trickery

Usage in workflow systems

Python workflow systems

[TR: FIDO Python workflow implementation tips]

Java workflow systems

FidoJavaWrapper by Danish Data Archive

FidoJavaWrapper implementation tips

Advanced usage

Using printformat to control output

If you want to alter the output of FIDO for matches and non-matches, there is no need to change the code. FIDO uses 'printf' formatting to control the output of matches. Whether you want to output XML or advanced CSV or TSV formatting, thanks to this powerful feature you are in control!

Invoke FIDO with 'matchprintf' to control the output of matches and 'nomatchprintf' for non-matches.

matchprintf

Available fields:

Field Description
info.alias File format alias (PRONOM property)
info.apple_uti Apple Uniform Type Identifier: Introduction to Uniform Type Identifiers Overview (PRONOM property)
info.count File X of Y total files
info.filename Filename; relative or absolute, depends on how the name of the folder or file was passed
info.filesize Filesize (bytes)
info.formatname File format name (PRONOM property)
info.group_index Count of match group: in case of multiple results per file
info.group_size Size of match group: in case of multiple results per file
info.matchtype Type of match: extension, signature or container
info.mimetype Mime type: IANA MIME Media Types (PRONOM property)
info.puid PUID: PRONOM Persistent Unique Identifier (PRONOM property)
info.signaturename Name of signature (PRONOM property)
info.time Time used in milliseconds to identify the file
info.version File version (PRONOM property)

Default: info.time, info.puid, info.formatname, info.signaturename, info.filesize, info.filename, info.mimetype, info.matchtype

nomatchprintf

Available fields:

Field Description
info.count File X of Y total files
info.filename Filename; relative or absolute, depends on how the name of the folder or file was passed
info.filesize Filesize (bytes)
info.matchtype Type of match: always "fail" (only with nomatchprintf)
info.time Time used in milliseconds to identify the file

Default: info.time, info.filesize, info.filename, info.matchtype

Tip
Enclose textual fields in quotes when outputting in CSV or TSV format because some fields may contain delimiters such as commas and tabs

Only show files that could be identified

Only show files that could not be identified

Using a pre-defined filelist instead of a path

Windows

Linux/Apple

Reading a filelist from a pipe

Windows

Linux/Apple

Reading from STDIN

To read from STDIN on both Windows and Linux/Apple, you can not call fido as-is. You have to invoke Python using unbuffered mode (python -u). On Windows this is especially important as STDIN, STDOUT and STDERR are also put in "binary" mode using this switch.

Windows/Linux/Apple

Please note that reading from STDIN is a bit quirky!

First of all, this will only return a match of types "signature", "container" (and on Linux/Apple also "extension"), if:

  • on Windows: there is a signature or container match; it is not possible to fall back to extension matching, since there is no extension available (hence, reading from STDIN).
  • on Linux (and possibly Apple): there is a signature or container match; it is possible to fall back to extension matching, only if Python is able to read the system STDIN file descriptor /proc/self/fd/0, otherwise there is no extension available (hence, reading from STDIN).

Reading from STDIN and passing a filename

There is however a command line option available to pass the filename to FIDO while reading from STDIN. This is to sort or less overcome the general quirkiness of identifying a file from STDIN and have the filename available for possible fallback to extension matching. This option might sound a bit silly I hear you think ("why not just read the file from disk by passing the filename?"), but it has a real world purpose.

Imagine having a container such as BagIt or a compressed file using an esoteric compression-schema not supported by FIDO. The operation of unpacking all files and saving them to disk to have them identified by FIDO and deleting them afterwards would take much longer than streaming each file to STDIN. This is the reason why this option has been introduced.

Windows/Linux/Apple

Signatures

Use PRONOM signatures only, not using 'format_extensions.xml'

By default, FIDO loads the file 'format_extensions.xml'. This file is meant for signatures which are not (yet) available through PRONOM. This file can also be used to add your own signatures. To only use PRONOM signatures, use the switch '-pronom_only'.

Adding your own signatures

Overriding signatures

Defining signature versions

Defining custom signature files

Defining "FIDO style" signature files

Defining "PRONOM container signature style" files

Defining formats during identification

Include formats during identification

Exclude formats during identification

Defining read buffers

The definition of a read buffer in FIDO is the amount of bytes read into memory to try and match a signature. In most cases, a file format can be identified by reading in the first x bytes, in other case it is also necessary to read in the last x bytes. FIDO loads the first and last x bytes of a file, defined by the read buffer, in bytes.

Sometimes it might be necessary to increase read buffers, although FIDO is configured to use a "ballpark figure" of 128 kb.

A reason to increase the read buffer might be to identify subtypes of supertype file formats. A good example is Adobe Illustrator (.ai), which is a PDF 1.5 subtype. For more information, read the discussion on the FIDO GitHub issue page.

Increasing buffer size might or might not slow down the identification process. Increasing it too much, for example to several gigabytes (if you are planning to handle such big files to begin with), might cause FIDO or your system to crash. Of course a general and obvious rule of thumb is not to increase buffer sizes to the maximum available memory or even more than your system has available.

Buffers are defined in x number of bytes

File buffers on Windows/Linux/Apple

Container file buffers on Windows/Linux/Apple

Disabling "deep scan" of container files

By default, when FIDO detects that a file is a container (compound) object (ZIP or OLE2), it will start a deep (complete) scan of the file using the PRONOM container signatures. When identifying big files, this behaviour can cause FIDO to slow down sigificantly. You can disable deep scanning by invoking FIDO with the '-nocontainer' argument. While disabling deep scan speeds up identification, it may reduce accuracy.

Using the "to XML" script

Use this script to convert the output of FIDO to an XML file. This can be done by piping the result to the script directly. It is also possible to use the script afterwards. Note that toxml.py expects the default pre-programmed output of matches and non-matches, see using printformat to control output.

Using toxml directly

Using toxml afterwards

Windows

Linux/Apple

Labels:
fido fido Delete
usage usage Delete
guide guide Delete
python python Delete
java java Delete
advanced advanced Delete
workflow workflow Delete
identification identification Delete
pronom pronom Delete
signatures signatures Delete
technical technical Delete
implementation implementation Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.