Skip to end of metadata
Go to start of metadata

Below is an e-mail exchange between Ross Spencer (The National Archives) and Johan van der Knijff (most recent message at the top). Relevance: Ross' feedback points to some useful enhancements and additions for future versions of jpylyzer.



From: Johan van der Knijff
Sent: 15 December 2011 13:18
To: 'Spencer, Ross'
Cc: Gollins, Tim; Owens, Chris; 'Wheatley, Paul'; 'Jackson, Andrew'; 'Wilson, Carl'
Subject: RE: JP2 validator / properties extractor [UNCLASSIFIED]

Hi Ross,

OK, thanks for clearing that up! Would it perhaps be useful to include an option that extracts any embedded data, such as ICC profiles and the contents of XML and UUID boxes to file (exifTool already does something similar for ICC profiles)?

Checking for well-formedness of data in XML boxes is something I might add; not so sure about validation (could really slow the tool down in case of references to schemas that have to be accessed over Internet).

That would leave you the option to extract these to separate files and subsequently use any tool you like to validate/further process them. This is something I already had in mind for some time, and it’s something that could be implemented easily. For XML boxes it might also be feasible to add their contents directly to the output file (rather than to a separate profile), but I’ll have to see what’s the best solution.

Also, the UUID boxes in particular can be used to embed pretty much anything. I remember Rob Buckley once told me of an institution that used this box to embed PDF inside JP2 (probably an extreme example). With a generic profile extraction option like the one I had in mind you could easily extract stuff like this.

Just an idea...

Cheers,

Johan  


From: Spencer, Ross
Sent: 15 December 2011 12:58
To: Johan van der Knijff
Cc: Gollins, Tim; Owens, Chris; 'Wheatley, Paul'; 'Jackson, Andrew'; 'Wilson, Carl'
Subject: RE: JP2 validator / properties extractor [UNCLASSIFIED]

Thanks for the info Johan.

I can see an XSLT on top of that would be easy to write, I hadn’t realised you had a single field to state the validity of the file. Very useful. I’ll definitely take a more in depth look at the validation work. It will be really useful for our work.

On the two points XML and ICC.

1)      XML: I just mean when you’re reading and extracting the XML box from the file. We’re taking images with embedded XML data and it is important that it conforms to the spec so much as it is well formed and valid. We extract it and then run it through a validator against a schema. Once a tool like yours extracts it, we can then pass it along the workflow.

2)      ICC Profiles: To characterise it better, I’d state the problem as: ‘JP2 contains an ICC color profile but the ICC color profile isn’t valid’. It’s really just me thinking out loud with what I’ve seen in the past few weeks. I appreciate you won’t validate the ICC profile but it raises a question for institutions ingesting JP2 about what to do and what that means about the validity of the file. Knowing we have a profile is the perfect start. It would then be up to the organisation’s workflows to extract it and validate it. That seems reasonable. It’s not really something I expect you to answer as part of this work! J

Thanks for following up my email.

Ross

From: Johan van der Knijff
Sent: 15 December 2011 11:41
To: Spencer, Ross
Cc: Gollins, Tim; Owens, Chris; Wheatley, Paul; Jackson, Andrew; Wilson, Carl
Subject: RE: JP2 validator / properties extractor [UNCLASSIFIED]

Hi Ross,

Thanks for the update and for your suggestions. Some first thoughts on this:

1. About the XML output: simplifying the reporting of validation results is something that I was planning to do anyway. Right now, everything under ‘tests’ is essentially an XML dump of every single test and check that is done. The nesting directly follows from the ‘box’ structure of  JP2, but I agree this will be overly verbose and complicated for most users. One possibility would be to restrict the reporting of test results to only those tests that resulted in an error (i.e. returned “False”). A further refinement would be to link the somewhat cryptic descriptor fields (e.g. “minorVersionIsValid”) to more use user-friendly error messages.

By the way, you don’t need to view/parse all these fields to see if a file is valid JP2; for that you can simply check the ‘isValidJP2’ field at the top level of the XML, e.g.:

<isValidJP2>True</isValidJP2>

The purpose of the individual tests is only to provide more specific information on *what* is exactly wrong. In this context , reporting only that (as in your example) a File Type box is valid / not valid isn’t valid isn’t particularly informative, as it doesn’t tell you which of the 4 tests of this box failed. Again, reporting only the errors (either by default or as an option) would fix this.

On a related note, you’ve probably noticed that the ‘properties’ output follows a similar tree structure. For the properties this is unlikely to change, as essentially the aim of the properties reporting is to provide an XML representation of the actual file structure and contents. Also, some boxes are repeatable (e.g. it is possible to have multiple Colour Specification boxes in one file, each containing a different colour space definition), which can result in multiple instances of property-value pairs. I would say that using the current structure would be the best way of dealing with this.    

2. About your comment “The key for us will be when XML support is added”. Not quite sure what you mean by this. Are you thinking of XML boxes inside a JP2, or is this related to the output format as well?

3. I’m not quite following your remarks about ICC profiles. Could you elaborate a bit on that or perhaps give an example? To be clear: jpylyzer does *not* validate ICC profiles, but it does check if embedded ICC profiles are allowed in JP2 (the format has some restrictions on which ICC profile classes and types are allowed).

On a more general note, one thing that  will probably make it somewhat difficult to interpret the results of jpylyzer at this stage is that there’s currently no documentation describing how jpylyzer actually validates a file and which criteria are used for labeling a file valid / not valid. This will all be covered by a user manual that’ll come out early 2012. All I can say about it at this point is that my aim has been to follow the standard as closely as possible, and be as complete as possible.

Meanwhile, if you’re interested in the more in-depth details you might want to have a closer look at one of the validator functions in the source code (the ‘validateImageHeaderBox’ would be a good starting point) and compare that against the description of that box in the standard (i.e.  http://www.jpeg.org/public/15444-1annexi.pdf  , and then section I.5.3.1 for the Image header box).

I hope this answers your questions to some extent. Please keep me informed in case of any new developments (or if this raises new questions).

Cheers,

Johan


From: Spencer, Ross
Sent: 14 December 2011 18:37
To: Johan van der Knijff
Cc: Gollins, Tim; Owens, Chris; 'Wheatley, Paul'; 'Jackson, Andrew'; 'Wilson, Carl'
Subject: RE: JP2 validator / properties extractor [UNCLASSIFIED]

Hi Johan,

Excellent work. I’ve tested on a couple of JP2 that we had found troublesome. I think the results look ok, but need to do more in depth looking.

The key for us will be when XML support is added. I’m keen to understand the assertion of the tool when it validates. I presume following the spec, XML must be well-formed and valid and so you will check for this.

The ICC color profile is different. We’ve examples of JP2 with a profile embedded but which doesn’t validate against the ICC color profile tool. It might be something we handle in our workflow if we accepted embedded ICC profiles rather than the JP2 validator to say it is invalid but it’s an interesting example.

One more thing, there are a lot of XML fields that just exist to state something exists and is valid. Can these be nested better? – or another less verbose mode added? – so an example might be:

    <fileTypeBox>

      <boxLengthIsValid>

        True

      </boxLengthIsValid>

      <brandIsValid>

        True

      </brandIsValid>

      <minorVersionIsValid>

        True

      </minorVersionIsValid>

      <compatibilityListIsValid>

        True

      </compatibilityListIsValid>

    </fileTypeBox>

Becomes:

    <fileTypeBox>

     True

    </fileTypeBox>

I am not entirely sure on the rules for validation of the format though as my work had a slightly different angle, but my main request would be the XML simplified, however that might be.  

Good work though. Look forward to seeing it developed further.

Ross

This email was received from the INTERNET and scanned by the Government Secure Intranet anti-virus service supplied by Cable&Wireless Worldwide in partnership with MessageLabs. (CCTM Certificate Number 2009/09/0052.) In case of problems, please call your organisation’s IT Helpdesk.
Communications via the GSi may be automatically logged, monitored and/or recorded for legal purposes.

Please don't print this e-mail unless you really need to.

---------------------------------------------------------------------------------

National Archives Disclaimer
 
This email message (and attachments) may contain information that is confidential to The National Archives. If you are not the intended recipient you cannot use, distribute or copy the message 
or attachments. In such a case, please notify the sender by return email immediately and erase all copies of the message and attachments. Opinions, conclusions and other information in this message 
and attachments that do not relate to the official business of The National Archives are neither given nor endorsed by it.

------------------------------------------------------------------------------------


Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Dec 15, 2011

    Hi Ross,

    OK, thanks for clearing that up! Would it perhaps be useful to include an option that extracts any embedded data, such as ICC profiles and the contents of XML and UUID boxes to file (exifTool already does something similar for ICC profiles)?

    I would be in favour of this. 

    Checking for well-formedness of data in XML boxes is something I might add; not so sure about validation (could really slow the tool down in case of references to schemas that have to be accessed over Internet).

    That would leave you the option to extract these to separate files and subsequently use any tool you like to validate/further process them. This is something I already had in mind for some time, and it’s something that could be implemented easily. For XML boxes it might also be feasible to add their contents directly to the output file (rather than to a separate profile), but I’ll have to see what’s the best solution.

    Yes for XML and ICC i can see the benefit of a separate file to export to but yes, for XML it might be equally easy to have it in the standard output. JHOVE did something similar. In terms of validation, I did find issues with the one I added to my tool, lxml. It had trouble with some regular expressions. Then, I'm not sure validation is as important as just having access to the embedded XML and being able to use it. 

    Also, the UUID boxes in particular can be used to embed pretty much anything. I remember Rob Buckley once told me of an institution that used this box to embed PDF inside JP2 (probably an extreme example). With a generic profile extraction option like the one I had in mind you could easily extract stuff like this.

    Just an idea...

    Cheers,

    Johan  

    UUID boxes can embed anything?! - Doesn't surprise me. Interesting scenario. It would certainly be useful to know something is there. What the tool does with it, I don't have any ideas. 

    Hope some of these comments help. 

    Ross