Title |
Ability to automatically identify script files |
Detailed description | It is necessary to provide accurate and automated identifications for various file format types in order to effectively manage and preserve such objects. At present, many of the file format identification tools do not identify text and programming language files with a sufficient level of veractity (i.e. on extension only, or by searching for an internal byte sequence which in practice is not always present in such files). |
Issue champion | ![]() |
Other interested parties |
There would seem to be a reasonable suggestion that more accurate identification of script files would be of use to the wider digital preservation community, in particular those involved in web archiving. |
Possible Solution approaches |
|
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Lessons Learned | Seems clear that most likely approach to solve issue is to use a combination of tools - with different abilities and areas of focus. Would also like to investigate 'toolkits' and 'signature kits' for certain collection types - not every file format signature is needed to run over every collection |
Datasets | http://wiki.opf-labs.org/display/REQ/Script+and+Programming+Language+File+Identifications![]() |
Solutions | http://wiki.opf-labs.org/display/REQ/Use+ohcount+to+detect+source+code+text+files![]() |
Labels: