Use ohcount to detect source code text files

Skip to end of metadata
Go to start of metadata
Title This solution uses a tool called ohcount to spot source code files, and compares the results with those from file.
Detailed description A detailed description of the Solution. Feel free to include links to further information (eg. OPF blog posts!). Note that a Solution is a specific digital preservation application of a software tool or tools. It might for example be a scripted tool, or a myExperiment workflow. See also
Solution Champion Andrew Jackson
Corresponding Issue(s) Ability to automatically identify script files
Tool/code link
Tool Registry Link  ohcount
CO: Only tested on partially annotated test corpus, but looks very promising. Will be trying it out when back in the office.
Dev: Needs larger/broader test corpus to get a better impression of accuracy
CO: Potential to build up identification workflow using a combination of tools to build up a complete ID picture

Results Summary




  • The ohcount 'No Results' are all binary formats that were left in the test corpus, so should not be considered a failure.
  • The 'False Positives' stem from a few sources (e.g. Perl CGI scripts being classed as "(null)") but are mostly due to problems with the test data or the tests!
    • Header files classed as C by ohcount but as C++ in the test corpus
    • Files classed as 'perl' in the test corpus not actually being Perl files.
    • Mixed files like html+javascript not being tested/categorised properly.

The full test corpus cannot be shared, unfortunately, but the Perl files that failed were from mod_perl:

solution solution Delete
identification identification Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Apr 12, 2012

    Note that later experience has shown that ohcount is only reliable when the file extension is correct! If you rename a Ruby source file to be .c instead of .rb, it thinks it is C and reports all the lines as matching (with zero comments).

    This is v. disappointing, as there is lots of clever logic encoded in the 'ragel' parsers that reflects the language formats, but it seems that this is only being used to distinguish code from comments when the format is already know by extension (rather than used to detect the overall format).

    I suspect a new approach is needed here, e.g. the Bayesian model suggested by: