Skip to end of metadata
Go to start of metadata
Title
IS17 Characterisation of text-based formats
Detailed description Problem: it is getting increasingly common that scientific journal articles (which are usually in PDF format) are accompanied by supplemental files. These are often research data, or software source code or scripts. In the majority of cases such files have some text-based format. The actual formats are sometimes fairly obscure, and often they are only known by and used within specific communities. Two (fairly random) examples are:
  • The MDL Molfile format, which is used to store information about chemicals
  • The Geo-EAS format, which is used for storing geospatial point dataSome more common examples:
  • Perl scripts
  • R scripts (see: http://www.r-project.org/ )

    Correct identification of such files is problematic for a number of reasons. First, signature-based identification does not work particularly well for text-based formats. Second, even if we are able to identify these files as ‘text’, this is not very informative, as there are significant differences between, for instance, a file with tabular rainfall records and an R script that performs statistical analyses on population data. Third, for many of these formats no commonly used identifiers (be it PUID or even MIME type) exist. XML (which is also text based) has similar issues, and a blog post by Asger Blekinge suggests to improve the characterisation of XML by using name space information.
Scalability Challenge
Solution should work for large volumes of data.
Issue champion Johan van der Knijff (KB)
Other interested parties
 
Possible Solution approaches A possible solution would be to come up with improved characterisation tools or methods that are able to provide more specific information on text-based formats. One possible approach would be to investigate the potential of automatic language identification algorithms (e.g. a Python, Perl and R script each use a characteristic vocabulary). Since many of the affected formats are only known by specific audiences, and these formats are not covered by any existing registries, we would ideally also need some solution for this. A possibility would be a lightweight registry (probably in the form of a simple Wiki) that would allow the specialised communities that are using these formats to contribute information on some of the more obscure formats directly. The scope of such as registry should not be limited to text-based formats, but should be open to any scientific data format.
Context Note that this scenario could provide an interesting link to the Scientific Data Sets scenarios.
Lessons Learned Notes on Lessons Learned from tackling this Issue that might be useful to inform the development of Future Additional Best Practices, Task 8 (SCAPE TU.WP.1 Dissemination and Promotion of Best Practices)
Training Needs Is there a need for providing training for the Solution(s) associated with this Issue? Notes added here will provide guidance to the SCAPE TU.WP.3 Sustainability WP.
Datasets To be confirmed
Solutions Reference to the appropriate Solution page(s), by hyperlink

Evaluation

Objectives Coverage, preciseness, reliability, scalability
Success criteria Ability to ake more precise distinction between wide range of text-based formats
Automatic measures Identify ## % of some_data_set correctly
Manual assessment Easily installable; supported by high quality user documentation; usable by non-developers
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
identification identification Delete
lsdr lsdr Delete
webarchive webarchive Delete
issue issue Delete
unknown_file_formats unknown_file_formats Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.