Ability to automatically identify script files

Skip to end of metadata
Go to start of metadata
Title
Ability to automatically identify script files
Detailed description It is necessary to provide accurate and automated identifications for various file format types in order to effectively manage and preserve such objects. At present, many of the file format identification tools do not identify text and programming language files with a sufficient level of veractity (i.e. on extension only, or by searching for an internal byte sequence which in practice is not always present in such files).
Issue champion Andrew Fetherston, The National Archives
Other interested parties
There would seem to be a reasonable suggestion that more accurate identification of script files would be of use to the wider digital preservation community, in particular those involved in web archiving.
Possible Solution approaches
  • Create traditional byte sequence signatures which could be used in existing tools (e.g. DROID) - would need to be of sufficient variablilty and granularity to identify accurately all script file formats, without producing clashes and misidentifications. From previous experience this does not appear to be a practical approach.
  • Utilise regular file identification tools to filter out files which were correctly identified, then run a 'scripting subset' signature on the remainder - minimising misidentifications and clashes (but may still result in incorrect identifications). Script subset signature could include regular script expressions as part of identification process (e.g. #!/usr/bin/perl for PERL file) - again this would not identify valid perl files which did not contain this declaration, but could help to reduce the unknown sample before further processing with additional tools.
  • Use a lexigraphical approach to investigate contents of file in attempt to identify programming language - may not work for all files (smaller files with general coding may be harder to identify correctly)
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Lessons Learned Seems clear that most likely approach to solve issue is to use a combination of tools - with different abilities and areas of focus.
Would also like to investigate 'toolkits' and 'signature kits' for certain collection types - not every file format signature is needed to run over every collection
Datasets http://wiki.opf-labs.org/display/REQ/Script+and+Programming+Language+File+Identifications 
Solutions http://wiki.opf-labs.org/display/REQ/Use+ohcount+to+detect+source+code+text+files
Labels:
issue issue Delete
york_hackathon york_hackathon Delete
identification identification Delete
unknown_file_formats unknown_file_formats Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.