Identify compressed TIFFs and convert them to uncompressed TIFFs

Skip to end of metadata
Go to start of metadata
One line summary Given is a list of TIFF images, some of them are compressed as "Group 4 Fax" TIFF images. The compression causes issues in some application contexts, therefore it might be required to remove the compression from a large TIFF images input file set.

In order to achieve this, a workflow has been created using the Taverna Workflow Design and Execution workbench. The workflow mainly uses the FITS (File Information Toolset) in order to identifiy the TIFF images are compressed with the "Group 4 Fax" compression scheme, and then converts these images using "The GIMP", an OpenSource image manipulation tool, in order to remove the compression from the compressed images.                                                                                                                     
Detailed description The diagram below provides a global overview on the components that are used in the workflow.



Notes in order to understand the diagram

The green boxes are operating nodes that apply a characterisation or file format conversion to a file.

The purple boxes are input parameters on the top of the diagram and output results on the bottom of it.

The violette boxes, like "readFile, Read_Text_File, Flatten_List, Get_Image_From_URL" are local services that are available in Taverna by default, so they can be generally applied in a wide variety of data analysis and conversion contexts.

The dark blue boxes are so called splitters that create kind of input slots where parameters can be connected to out of XML descriptions of these parameters that are available in a Web Service Description (WSDL), for example.

Finally, the brown boxes (URL2List, Beanshell) are so called Beanshells which are customisable components that can define their own input and output parameters and then process them using a Java Scripting language (also external Java libraries can be used by making them available to Taverna and defining the dependency on the library for the Beanshell).

The workflow has various parameters in order to configure the workflow run:

The "Get_list_of_images" component has a surrounding box because it is a workflow that is used as a nested workflow in the containing workflow. Parameters of the workflow"url_to_textfile_with_image_urls", a textfile containing URL references to the images that should be processed (kind of a batch process)

"csresult_regex" which is a regular expression that is used to identify the compression scheme, for example the expression .*Group 4 Fax.* is used to find the items where FITS identified the compression scheme T6/Group 4 Fax.

"convert_compression" is an integer number that indicates the compression scheme that should be applied when converting the images that have been identified according to the regular expression just mentioned. More concretely, 0 means to remove the compression, and further values from 1 to 6 mean None (0), LZW (1), PACKBITS(2), DEFLATE (3), JPEG (4), CCITT G3 Fax (5), CCITT G4 Fax (6)).

"convert_numcolors" is the number of colours that the target image should have.

Workflow execution

For the batch processing, the workflow takes a URL reference to a textfile that contains a list of URL references to the TIFF image files as input:

http://<someserver>/000001.tif
http://<someserver>/000002.tif
http://<someserver>/000003.tif
etc.

Taverna's list handling then hands over these images one by one to the FITS operation "characteriseFile" which tries to identify the file format and some file properties. This means that it creates an XML description of the identification result which is based on a set of identification tools that FITS uses (FITS wraps e.g. Droid, Jhove 1 amoung others and normalizes the characterisation output).
The „Read_Text_File“ component reads the XML identification result and uses an Xpath expression in order to extract the compression scheme property value:

/default:fits/default:metadata/default:image/default:compressionScheme

In the example setting, most of the images have the compression scheme value „uncompressed“, and some have the value „T6/Group 4 Fax“.
   
The intention is to identify those images that have the compression scheme value „T6/Group 4 Fax“, therefore the „Beanshell“ component is used to determine the images that have this property.

The Beanshell component has the characterisation results list charactres_in_list and the images list tiff_images_in_list as input and picks out those where the regular expression csresult_regex matches, e.g. the expression .*Group 4 Fax.* can be used.

This is the Java code snippet that is used in the beanshell in order to filter out the „Group 4 Fax“ compressed images.

List tiff_images_out_list = new ArrayList();
for( int i = 0; i < tiff_images_in_list.size(); i++) {
        String item = tiff_images_in_list.get(i);
        String charres = charactres_in_list.get(i);
        if(charres.matches(csresult_regex))
            tiff_images_out_list.add(item);
}

The output list of the Beanshell component then only contains those images that have the  Group 4 Fax“ compression scheme, and those images are handed over to the operation convertTIFFtoTIFFByURL which is a conversion service based on „The GIMP“ image manipulation tool. This service is configured by the  convert_compression and convert_numcolors parameters.  In this scenario, convert compression is set to 0 (NONE) and the number of colours is set to 2 (bitonal).

The GIMP service uses a java wrapper which executes GIMP on the command line.

In order to execute the command, the Java class ProcessBuilder is used which takes a string array in order to create the command.
The following array of command strings is an example for a GIMP command that can be handed over to the ProcessBuilder which can then be used to execute the command.

/usr/bin/gimp
--verbose
-c
-i
-d
-b
(convertTIFFtoTIFF "/tmp/tmpfilefromurl4002680093931603769.tmp" "/tmp/tmpfilefromurl4002680093931603769.tmp.out.tiff"  2 0)
(gimp-quit 0)

where /usr/bin/gimp is the gimp executable, -b is used for starting the command in batch mode, -i means that we do not require the GIMP interface, -d means that we do not need the tools. Then the "convertTIFFtoTIFF" script is called with 4 parameters, the first two being the input and output files, then the number of colours and the compression scheme to be used (0 := NONE). The JAVA wrapper cares about handing over the parameters from the workflow layer (Taverna) down to the fu-script command execution layer. Finally gimp-quit 0 exits the batch process.

The following fu-script (GIMP scripting language) shows the source of the convertTIFFtoTIFF script which does the actual image conversion:

; Copyright (C) 2011
; Author Sven Schlarb <shsschlarb-aqua@yahoo.de>

; convertTIFFtoTIFF
;   infile        STRING   Name of file to be loaded
;   outfile       STRING   Name of file to be saved
;   num-colors    INT32    Default: 256, The number of colors to quantize to
;   compression   INT32    Switch integer, Compression type: {None (0), LZW (1), PACKBITS(2), DEFLATE (3), JPEG (4), CCITT G3 Fax (5), CCITT G4 Fax (6)}

(define (impactConvertTIFFtoTIFF infile outfile num-colors compression)
        (let* ((image (car (file-tiff-load 1 infile infile)))
          (drawable (car (gimp-image-active-drawable image)))
         )

         ; flatten image if it has an alpha channel
         (if (gimp-drawable-has-alpha drawable)
                (set! drawable (car (gimp-image-flatten image)))
     )

         ; only convert to indexed if the original image is not already an indexed image
         (if not(gimp-drawable-is-indexed drawable)
                (gimp-convert-indexed
                         image        ;  image         IMAGE    The image
                         0       ;  dither-type   INT32    Dither type { NO-DITHER (0), FS-DITHER (1), FSBLOWBLEED-DITHER (2), FIXED-DITHER (3)}
                         0      ;  palette-type  INT32    Palette type { MAKE-PALETTE (0), WEB-PALETTE (2), MONO-PALETTE (3), CUSTOM-PALETTE (4)}
                         num-colors   ;  num-cols      INT32    Default: 256, The number of colors to quantize to
                         FALSE        ;  alpha-dither  INT32    Default: 0, Dither transparency to fake partial opacity, Boolean integer, 0: No, 1: Yes
                         TRUE         ;  remove-unused INT32    Default: 0, Remove unused or duplicate color entries from final palette, Boolean integer, 0: No, 1: Yes
                         ""           ;  palette       STRING   The name of the custom palette to use, ignored unless (palette_type == GIMP_CUSTOM_PALETTE)
                )
     )

         ; file-tiff-save (Saves files in tiff file format)
         (file-tiff-save
                 1            ;   run-mode     INT32     Interactive, non-interactive
                 image        ;   image        IMAGE     Input image
                 drawable     ;   drawable     DRAWABLE  Drawable to save
                 outfile      ;   filename     STRING    file name to save
                 outfile      ;   raw-filename STRING    file name to save
                 compression  ;   compression  INT32     Compression type: {None (0), LZW (1), PACKBITS(2), DEFLATE (3), JPEG (4), CCITT G3 Fax (5), CCITT G4 Fax (6)}
                 )
        )
)
(script-fu-register
   "convertTIFFtoTIFF"
   "<Toolbox>/Xtns/Script-Fu/aqua/convertTIFFtoTIFF"
   "Convert TIFF to TIFF"
   "AQuA"
   "Copyright 2011"
   "2011-06-15"
   ""
   SF-FILENAME "Infile"        "infile.tiff"
   SF-FILENAME "Outfile"       "outfile.tiff"
   SF-VALUE    "num-colors"    "256"
   SF-VALUE    "compression"    "0"
)

Note that for making scripts available to GIMP, you have to "refresh scripts" in GIMP, also if you are only using the command line, otherwise GIMP is not be aware of the new script.

Finally, the operation characteriseFile uses again FITS in order to identify the conversion result in order to verify if the compression has been removed correctly.

The workflow can be downloaded from myExperiment, some of the services are password protected, please write an email to shsschlarb-aqua@yahoo.de> if you would like to try out the workflow.
Solution champion
Sven Schlarb <shsschlarb-aqua@yahoo.de>
myExperiment link http://www.myexperiment.org/workflows/2174
Evaluation
  • Developed in Taverna, so proof of concept, rather than code that is ready to use.
  • However, demostrates concepts well and provides guide for a programmer to realise a similar solution
  • Will be fed into the SCAPE Project which will ultimately provide preservation workflows for users. See SCAPE Project.
  • Approach supports easy generation of a proof of concept using scripting, that can then be productised if the approach is successful
  • Documented approach should support easy reuse of components
  • Collection/Issue champion will implement within their workflow for this particular collection using python scripts
  • Wider application of other components for use in other workflows
Tool (link) Taverna Workflow Design and Execution workbench, FITS (File Information Toolset), The GIMP
Issue BOPCRIS issue - Mix of compressed and uncompressed TIFFS
Labels:
fits fits Delete
format format Delete
characterisation characterisation Delete
migration migration Delete
conversion conversion Delete
gimp gimp Delete
fu-script fu-script Delete
taverna taverna Delete
solution solution Delete
quality_assurance quality_assurance Delete
identification identification Delete
aqua aqua Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.