|Title||This solution uses a tool called ohcount to spot source code files, and compares the results with those from file.|
|Detailed description||A detailed description of the Solution. Feel free to include links to further information (eg. OPF blog posts!). Note that a Solution is a specific digital preservation application of a software tool or tools. It might for example be a scripted tool, or a myExperiment workflow. See also http://www.ohloh.net/p/planets-suite|
|Solution Champion|| Andrew Jackson
|Corresponding Issue(s)|| Ability to automatically identify script files
|Tool Registry Link|| ohcount
CO: Only tested on partially annotated test corpus, but looks very promising. Will be trying it out when back in the office.
Dev: Needs larger/broader test corpus to get a better impression of accuracy
CO: Potential to build up identification workflow using a combination of tools to build up a complete ID picture
- The ohcount 'No Results' are all binary formats that were left in the test corpus, so should not be considered a failure.
- The 'False Positives' stem from a few sources (e.g. Perl CGI scripts being classed as "(null)") but are mostly due to problems with the test data or the tests!
- Header files classed as C by ohcount but as C++ in the test corpus
- Files classed as 'perl' in the test corpus not actually being Perl files.
The full test corpus cannot be shared, unfortunately, but the Perl files that failed were from mod_perl:
./Perl/mod_perl-1.30/t/net/perl/dirty-lib ./Perl/mod_perl-1.30/t/net/perl/dirty-script.cgi ./Perl/mod_perl-1.30/t/net/perl/dirty-test.cgi ./Perl/mod_perl-1.30/t/net/perl/echo
Apr 12, 2012
Note that later experience has shown that ohcount is only reliable when the file extension is correct! If you rename a Ruby source file to be .c instead of .rb, it thinks it is C and reports all the lines as matching (with zero comments).
This is v. disappointing, as there is lots of clever logic encoded in the 'ragel' parsers that reflects the language formats, but it seems that this is only being used to distinguish code from comments when the format is already know by extension (rather than used to detect the overall format).
I suspect a new approach is needed here, e.g. the Bayesian model suggested by: http://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet