View Source

h1. Audience
Developers and Practicioners

h1. Preface
This usage guide is written to be suitable for both Windows (cmd.exe) and Linux/Apple, but functionality or invocation may differ and will be stated if this is the case.
Example output will only be added if it differs from default.

h1. Purpose of FIDO
The main (and only) purpose of FIDO is to identify the file format of a digital object (a computerfile), hence the abbrevation *F* ormat *I* dentification for *D* igital *O* bjects.

It is created for both command line usage and implementation into Digital Preservation workflow systems.

The most important information FIDO returns is the PUID ([PRONOM Persistent Unique Identifier|]) of a file format. This PUID can be used in conjunction with file format registry software (such as Tessella's PRONOM Technical Registry) to determine what actions are needed to successfully preserve a file or how to access it.

FIDO uses [PRONOM|] signatures to try and determine the file format. If a signature is advanced enough, FIDO is able to return the format version as well (eg. PDF 1.3).

Note: FIDO is not able to validate a file, ie. if it conforms to specifications. This is a special kind of operation which requires file format validation software such as JHOVE. FIDO is also not able to extract metadata from a file, you have to use FITS or the NLNZ metadata extraction tool for that.

h2. Matchtypes
FIDO tries to identify a file by scanning the file with all available signatures. If it is able to determine the format based on a signature, the matchtype returned is "signature".

If the file type is a "container" type file (ZIP or OLE2) triggered by a succesful signature match, FIDO tries to determine what kind of container format it is, using the PRONOM container signature file, it will then return matchtype "container".

If the file type can not be identified using a signature, FIDO uses the extension to determine the file type. If successful, it will return all PUIDs that use this extension. Please note that this might be ambigious. For example, the extension {{doc}} might be a Microsoft Word document, but it might as well be a WordPerfect file or a plain-text file.

If FIDO fails to identify the file format with a signature or file extension, matchtype "fail" is returned.

|signature|the object has been identified with (a) PRONOM signature(s)|1|High|
|container|the object has been idenfified with (a) PRONOM container signature(s)|2|High|
|extension|the object could only be identified by file extension|3|Medium|
|fail|the object could not be identified with signature nor file extension|4|Low|

h1. Daily usage

h2. Installation
[Download the latest release of FIDO|]

h3. Explanation of release tags
Every commit to the [master branch of FIDO|] is considered an official release and is therefore tagged individually.

Next to Python code, FIDO releases also contain PRONOM signatures, which are distributed by TNA, each release with its own version number. To separate releases of FIDO with only an update to the PRONOM signature version, the following schema is in use from version 1.3.1 and up.

[major].[minor].[patch]-[PRONOM version]

For example, release 1.3.1 has PRONOM version 70 distributed with it and is tagged '1.3.1-70'.
If a PRONOM update is available but there are no code changes the consecutive tag will be '1.3.1-71'. Please note that this is only reflected in release tags, FIDO will still only report its version number without the PRONOM version number.

h3. Installation using setuptools
FIDO is distributed to be installed with the Python package [setuptools|].

To install FIDO, run {{python install}} in the root folder of the unzipped distribution.
After installation, you can run FIDO as a Python module from any folder using
python -m

h3. Manual installation
You can install FIDO anywhere you like. Just unzip the package and follow instructions below.

It is assumed both the FIDO installation folder and the Python interpreter are on your path.

h3. Running "fido" (without the .py extension) from anywhere

h4. Windows
Make sure Python scripts are associated with the Python interpreter and the .py extension is added to PATHEXT.

h4. Linux/Apple
Make sure the script is executable (chmod +x). The shebang in is used to invoke the Python interpreter. Additionally, if you want to run FIDO without using the .py extension, rename the file or create a symlink in your home folder.

Furthermore, if you plan on using {{}} and {{}}, make sure they are exectuable (chmod +x).

h2. Show all command line options
{code}fido -h{code}

h2. Analysing a single file
{code}fido /path/to/file.ext{code}

h2. Analysing a folder non-recursively
{code}fido /path/to/analyse{code}

h2. Analysing a folder recursively
{code}fido -recurse /path/to/analyse{code}

h2. Analysing ZIP and TAR contents
{status:colour=red|title=32 bit zipfiles only}
{code}fido -zip /path/to/{code}

In the output of FIDO, the names of folders and files in a ZIP or TAR are appended to the name of the ZIP or TAR, separated by an exclamation mark, for example:


h2. Analysing a folder recursively and analyse ZIP and TAR contents
{code}fido -recurse -zip /path/to/analyse{code}

h2. Updating signatures using the update script
This interactive script updates FIDO its signatures to the latest release of PRONOM signatures. Container signatures however, will not be automatically updated. Updates to container signatures will be pushed with FIDO when available. The signature extension file will also not be updated automatically.

Updating signatures can take a long time, depending on your connection speed and remote availability of the PRONOM website. This website is maintained by The National Archives UK (TNA). For more information, visit [PRONOM|].

If you want to interrupt the update process or is interrupted (e.g. due to network failure), running the update script will resume the download of signatures.

The actual conversion of PRONOM signatures is done by which is called by the update script.

Under normal circumstances you would never have to run this script because it is invoked by the signature update script.
It converts PRONOM type signatures to FIDO type signatures which are basically "normal" regular expressions. It uses the PRONOM zip file as defined in "conf/versions.xml".

h3. Using a proxy with the update script
[PT:Command Line Interface proxy usage]

h2. Redirection of FIDO output and error messages
[PT:Command Line Trickery]

h1. Usage in workflow systems

h2. Python workflow systems

[TR: FIDO Python workflow implementation tips]

h2. Java workflow systems
[FidoJavaWrapper by Danish Data Archive|]

[FidoJavaWrapper implementation tips]

h1. Advanced usage

h2. Using printformat to control output

If you want to alter the output of FIDO for matches and non-matches, there is no need to change the code. FIDO uses 'printf' formatting to control the output of matches. Whether you want to output XML or advanced CSV or TSV formatting, thanks to this powerful feature you are in control!

Invoke FIDO with 'matchprintf' to control the output of matches and 'nomatchprintf' for non-matches.

h3. matchprintf
Available fields:
|info.alias|File format alias (PRONOM property)|
|info.apple_uti|Apple Uniform Type Identifier: [Introduction to Uniform Type Identifiers Overview|] (PRONOM property)|
|info.count|File X of Y total files|
|info.filename|Filename; relative or absolute, depends on how the name of the folder or file was passed|
|info.filesize|Filesize (bytes)|
|info.formatname|File format name (PRONOM property)|
|info.group_index|Count of match group: in case of multiple results per file|
|info.group_size|Size of match group: in case of multiple results per file|
|info.matchtype|Type of match: extension, signature or container|
|info.mimetype|Mime type: [IANA MIME Media Types|] (PRONOM property)|
|info.puid|PUID: [PRONOM Persistent Unique Identifier|] (PRONOM property)|
|info.signaturename|Name of signature (PRONOM property)|
|info.time|Time used in milliseconds to identify the file|
|info.version|File version (PRONOM property)|

Default: info.time, info.puid, info.formatname, info.signaturename, info.filesize, info.filename, info.mimetype, info.matchtype

h3. nomatchprintf
Available fields:
|info.count|File X of Y total files|
|info.filename|Filename; relative or absolute, depends on how the name of the folder or file was passed|
|info.filesize|Filesize (bytes)|
|info.matchtype|Type of match: always "fail" (only with nomatchprintf)|
|info.time|Time used in milliseconds to identify the file|

Default: info.time, info.filesize, info.filename, info.matchtype


Enclose textual fields in quotes when outputting in CSV or TSV format because some fields may contain delimiters such as commas and tabs

h3. Only show files that could be identified
fido -nomatchprintf "" [...]

h3. Only show files that could not be identified
fido -matchprintf "" [...]

h2. Using a pre-defined filelist instead of a path

h3. Windows
dir /b > files.txt
fido -input files.txt

h3. Linux/Apple
find . -type f > files.txt
fido -input files.txt

h2. Reading a filelist from a pipe

h3. Windows
dir /b | fido -input -

h3. Linux/Apple
find . -type f | fido -input -

h2. Reading from STDIN
To read from STDIN on both Windows and Linux/Apple, you can not call {{fido}} as-is. You have to invoke Python using unbuffered mode ({{python -u}}). On Windows this is especially important as STDIN, STDOUT and STDERR are also put in "binary" mode using this switch.

h3. Windows/Linux/Apple
python -u /path/to/ - <[/path/to/file.ext | binary stream (socket, HTTP, ...)]

*Please note that reading from STDIN is a bit quirky!*

First of all, this will only return a match of types "signature", "container" (and on Linux/Apple also "extension"), if:
* on Windows: there is a signature or container match; it is not possible to fall back to extension matching, since there is no extension available (hence, reading from STDIN).
* on Linux (and possibly Apple): there is a signature or container match; it is possible to fall back to extension matching, only if Python is able to read the system STDIN file descriptor {{/proc/self/fd/0}}, otherwise there is no extension available (hence, reading from STDIN).

h2. Reading from STDIN and passing a filename

There is however a command line option available to pass the filename to FIDO while reading from STDIN. This is to sort or less overcome the general quirkiness of identifying a file from STDIN and have the filename available for possible fallback to extension matching. This option might sound a bit silly I hear you think ("why not just read the file from disk by passing the filename?"), but it has a real world purpose.

Imagine having a container such as BagIt or a compressed file using an esoteric compression-schema not supported by FIDO. The operation of unpacking all files and saving them to disk to have them identified by FIDO and deleting them afterwards would take much longer than streaming each file to STDIN. This is the reason why this option has been introduced.

h3. Windows/Linux/Apple
python -u /path/to/ -filename "file.ext" - <[/path/to/file.ext | binary stream (socket, HTTP, ...)]

h2. Signatures

h3. Use PRONOM signatures only, not using 'format_extensions.xml'
By default, FIDO loads the file 'format_extensions.xml'. This file is meant for signatures which are not (yet) available through PRONOM. This file can also be used to add your own signatures. To only use PRONOM signatures, use the switch '-pronom_only'.
fido -pronom_only [...]

h3. Adding your own signatures

h3. Overriding signatures

h3. Defining signature versions

h3. Defining custom signature files

h4. Defining "FIDO style" signature files

h4. Defining "PRONOM container signature style" files

h3. Defining formats during identification

h4. Include formats during identification

h4. Exclude formats during identification

h2. Defining read buffers
The definition of a read buffer in FIDO is the amount of bytes read into memory to try and match a signature. In most cases, a file format can be identified by reading in the first *x* bytes, in other case it is also necessary to read in the last *x* bytes. FIDO loads the first and last *x* bytes of a file, defined by the read buffer, in bytes.

Sometimes it might be necessary to increase read buffers, although FIDO is configured to use a "ballpark figure" of 128 kb.

A reason to increase the read buffer might be to identify subtypes of supertype file formats. A good example is Adobe Illustrator (.ai), which is a PDF 1.5 subtype. For more information, [read the discussion on the FIDO GitHub issue page|].

Increasing buffer size might or might not slow down the identification process. Increasing it too much, for example to several gigabytes (if you are planning to handle such big files to begin with), might cause FIDO or your system to crash. Of course a general and obvious rule of thumb is not to increase buffer sizes to the maximum available memory or even more than your system has available.

Buffers are defined in *x* number of bytes

h3. File buffers on Windows/Linux/Apple
fido -bufsize X [...]

h3. Container file buffers on Windows/Linux/Apple
fido -container_bufsize X [...]

h2. Disabling "deep scan" of container files
By default, when FIDO detects that a file is a container (compound) object (ZIP or OLE2), it will start a deep (complete) scan of the file using the PRONOM container signatures. When identifying big files, this behaviour can cause FIDO to slow down sigificantly. You can disable deep scanning by invoking FIDO with the '-nocontainer' argument. While disabling deep scan speeds up identification, it may reduce accuracy.

fido -nocontainer [...]

h2. Using the "to XML" script
Use this script to convert the output of FIDO to an XML file. This can be done by piping the result to the script directly. It is also possible to use the script afterwards. Note that {{}} expects the default pre-programmed output of matches and non-matches, see [using printformat to control output|FIDO usage guide#Using printformat to control output].

h3. Using toxml directly
fido [...] | tomxl >xmlfile.xml

h3. Using toxml afterwards

h4. Windows
type fido_result.csv | tomxl >xmlfile.xml

h4. Linux/Apple
cat fido_result.csv | tomxl >xmlfile.xml