|Title||This solution uses a tool called ohcount to spot source code files, and compares the results with those from file.|
|Detailed description||A detailed description of the Solution. Feel free to include links to further information (eg. OPF blog posts!). Note that a Solution is a specific digital preservation application of a software tool or tools. It might for example be a scripted tool, or a myExperiment workflow. See also http://www.ohloh.net/p/planets-suite|
|Solution Champion|| Andrew Jackson
|Corresponding Issue(s)|| Ability to automatically identify script files
|Tool Registry Link|| ohcount
CO: Only tested on partially annotated test corpus, but looks very promising. Will be trying it out when back in the office.
Dev: Needs larger/broader test corpus to get a better impression of accuracy
CO: Potential to build up identification workflow using a combination of tools to build up a complete ID picture
- The ohcount 'No Results' are all binary formats that were left in the test corpus, so should not be considered a failure.
- The 'False Positives' stem from a few sources (e.g. Perl CGI scripts being classed as "(null)") but are mostly due to problems with the test data or the tests!
- Header files classed as C by ohcount but as C++ in the test corpus
- Files classed as 'perl' in the test corpus not actually being Perl files.
The full test corpus cannot be shared, unfortunately, but the Perl files that failed were from mod_perl: