Script and Programming Language File Identifications

Description A collection of 600 script files, from various sources, comprising C++, Javascript, CSS, Perl, PHP, Python.
Owner Collection is owned by various organisations, and includes web-harvested content
Dataset Location Dataset currently restricted to use at Hackathon
Collection Champion Andrew Fetherston, The National Archives
Issues brainstorm Current identification of text/script files is not very strong using traditional digital file format tools, which has implications for digital preservation repositories, web archiving etc.
  • nature of the file types means that internal structure can be very varied, may or may not contain certain regular expressions and byte sequences or declarations, not particularly suited for the development of internal byte sequences for identification purposes.
  • difficult to obtain consistent and accurate results only looking for byte sequences.
  • may need initial scan of files with file format identification tool (e.g. DROID, Linux/Unix Command: file) to sepearate unidentified or extension only identifications
  • can use lexigraphical analysis to identify probable script language used - could also use to identify sections of script embedded in other file types
List of Issues Need to consider issues of potential mis-identification of files due to presence of embedded content.
Overheads of running additional file identification tools
