|| IS31 Semantic checking of very large data files
|Detailed description||In order to ensure that the data to be preserved is of adequate quality , the contents of NeXus data files would need to be validated for their correctness against a given data model. Each data model is specified in a NeXus Definition Language (NXDL) file and contains assertions that define the expected content of a NeXus file. For example, a data model could define a metadata element (key-value pair) called “Integral” to represent the total integral monitor counts for grazing incidence small angle diffractometer GISAS for either x-ray or neutrons. In this scenario, the data type of the metadata element “Integral” would be an integer. For a NeXus data file conforming to this data model, it would be necessary to validate the value(s) assigned to “Integral” to ensure it is of appropriate data type.|
|Issue champion||Simon Lambert (STFC)|
| Other interested parties
||Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)|
|Possible Solution approaches||Use of the NeXus validation toolkit - developed and used by the NeXus community - as part of the preservation ingest workflow.|
|Datasets|| nexus data files
|Solutions||Reference to the appropriate Solution page(s), by hyperlink|
|Evaluation Objectives||scaleability, automation|
|Success criteria||Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?|
|Automatic measures|| What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* handle up to 100Gb files without crashing
* fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool.
* can check very large data files concurrently with semantic checking of small files (up to 5 concurrent threads).
* can perform semantic checking with a data rate up to 100MB/s
|Manual assessment|| N.A.
|Actual evaluations||links to acutual evaluations of this Issue/Scenario|