Characterisation Hackathon Scratch Space

Skip to end of metadata
Go to start of metadata

This page is for getting down any ideas for potential work activities at the hackathon. None of this is set in stone, and I expect the first hour of the event itself will be focussed on agreeing an approach and focus for development. Please contribute!

  • Expand on "Open FITS" work from #fileidhack, address maintenance/support issues etc.
  • Open FITS versus opening FITS Google Code to more contributors.  Discuss how to manage ongoing community involvement, move primary project to github?   Advantages versus google code? (https://github.com/harvard-lts/fits)
  • Add Tika to FITS, and possibly other tools if specific needs are identified
  • Work with C3PO
  • Add other analysis/visualisation tools to work with the characterisation data (either FITS XML, or the C3PO datastore).
  • Some nice visualisation links here
  • Note on visualisation tools: Per recommends "R" (http://www.r-project.org/http://www.rstudio.com/ and http://www.rstudio.com/http://www.rstudio.com/ demonstrated rather well (with very much related cool FITS stuff) here.
  • Utilise/exploit some of the characterisation results from previous SPRUCE Mashups, eg. this RDF visualisation approach
  • Establish testing and evaluation approach, utilising LDS3
  • Anyone interested in using Apache Mahout to characterise collections?
  • would love to see better PDF characterization, specifically metrics of scanned image + OCR files.  perhaps number of pages, number of images, number of words?
  • Couple of things from SCAPE:
    • Characterisation of large (8GB+) video files in an efficient way (just read header - does that work?)
    • Characterisation of large audio files
  • We might want to consider what to extract - so perhaps ensuring tools can extract minimum identified metadata:
  • Wouldn't mind delving into this http://www.sleuthkit.org/tsk_hadoop/ a little more.

JHOVE stuff?

I'm wondering if something useful could be done with JHOVE for this. I have a wish list a mile long for it, and perhaps better integration with other applications would be useful. – Gary McGath

Registry

Carl Wilson and I discussed an intermediate registry where you can pull signatures from in your desired format.
E.g.: add PRONOM and TIKA sigs, output as XML that DROID, FIDO and TIKA can understand.
I propose to skip the "high level" stuff about trusted resources, responsibility for data and hosting, etc and start with a prototype of an intermediate registry.

  • just wondering - do we need a registry here or just some tools that migrate one format to the other? A Web form and these scripts would do it?

Maurice de Rooij / 08-01-2013

More on Visualisation

Interesting article - no, not just because it is about beer (though, of course, that helps) - http://blog.neo4j.org/2013/01/fun-with-beer-and-graphs.html - no reason an archivist couldn't follow a workflow like that for a collection...

Mentions Gephi (https://gephi.org/) which looks ace and perhaps a substitute for Welkin!! [Peter C]

Parallel processing

Moved content to separate page:
Parallel processing of identification and characterisation jobs

FITS profiling

I've added a new page summarizing some performance profiling on OpenFits. FITS profiling Gary McGath

jpylyzer

Two possible ideas:

  • Recent work by Andy Jackson with bitmashed JP2s showed that jpylyzer's robustness in case of seriously malformed files could do with some further improvements. Details here: https://github.com/openplanets/jpylyzer/issues/31. This issue could be addressed at the hackathon
  • Creation of set of open reference test images that jpylyzer contributors can use to test their modifications against. Currently I have a set of test images with various issues, but most of these contain copyrighted content, and cannot be easily shared for that reason. Open reference images should reproduce features of copyrighted ones (which in most cases can be done with some pretty straightforward hex-editing).

(added by Johan van der Knijff)

JHOVE2 stuff

Coding possibilities

  • Create a new format module by creating wrappers for other tools (including non-java tools  e.g. jpylyzer)
  • Create a new format module from scratch (for preview, see #Creating a New Format Module (.ppt) on JHOVE2 wiki)
  • Create a new identifier module: (2.1.0 release of JHOVE2 will include File-based identifier)
    • combined DROID + File
    • weighted DROID + File   
  • Mavenize JHOVE2 and get it up to standard maven repositories so it can be included via Maven in other projects
  • Use XSL for assessment 
  • Create exif-uncompressed profile for TIFF module
  • How to invoke characterization of already-identified files
  • Update to latest version of DROID

Discussion topics?

  • Describe JHOVE2 framework
  • JHOVE2 governance
  • what JHOVE2 means by "clump" sources (multi-file digital objects like GIS files)
  • Wrapping code in JHOVE2 -- How do we "characterize" that code?  Is the unexamined code worth wrapping?  (all apropos of what parser to wrap for PDF module)
  • Beyond the unit test -- test corpora for format modules
  • What does JHOVE2 need done to it to make it play nicely with the SCAPE framework?

C3PO

Just some ideas about new features and optimisation of old ones. All other ideas and feedback are more than welcome. [Petar P]

CORE/ CLI

  • Cache the output of Map/Reduce Jobs per default
  • Think of a way to consolidate output of different adaptors
  • Write a TIKA Adaptor (or any other characterisation tool)
  • Fix profile generation bugs for sparse data
  • Bundle optional characterisation with the command line (e.g. provide path to data folder/or path to characterisation data folder)
  • Write a MetaDataGatherer for some repository (probably not possible because of unified interface)
  • Write ConflictResolution rules
  • ...

WEB App

  • More visualisations (not only histograms)
    • scatterplot (based on two characteristics)
    • bubble charts
    • represent conflicts in a weighted manner
  • Filtering/Slicing and Dicing
    • Fix filtering bug (integer properties)
    • Optimise filtering and Map/Reduce Job execution during filtering
    • Redesign filtering (maybe also UI)
  • ...

FITS

  • Write more XSLTs to get more properties out of JHoves output
  • Bundle Tika in FITS (maybe remove Droid)
  • Update Droid 
  • Other new tools: Aduna Aperture, MediaInfo, FIDO.  https://code.google.com/p/fits/wiki/future_tools
  • Merge changes from previous FITS hackathon into core FITS (update exiftool)
  • How should FITS standard output be changed to better support video?
  • Better documentation - what is needed?  
  • Swithun Crowe contributed a patch to version 0.4.2 that adds an option to normalizes FITS date and file size output and adds PREMIS object and event XSL transforms.  This should be incorporated to the latest version of FITS.  See extras.zip
  • FITS depends on undocumented features of JHOVE, i.e., specific values in the XML output. Can this be made less fragile? Figuring out exactly what to document and keep from changing in JHOVE would be one answer. Gary McGath
  • Better automated testing
  • ...

MUPPET: MUlti Pass file Properties Extraction Tool

MUPPET Pitch Document

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jan 08, 2013

    Great suggestions so far guys, thanks for the input!

  2. Feb 09, 2013

    I just added some suggestions about C3PO. Feel free to add more and/or change these.

    Looking forward to meeting you all!