Skip to end of metadata
Go to start of metadata

Collection:

Title
Nexus data files from instruments
Description These are data files captured straight from instruments. They contain measurements collected from instrument detectors. There is no typical size or number of a detector that an instrument has. For example, for STFC ISIS facility, the number of detector  ranges from several thousands to a quarter of a million. The typical format of these files are raw or NeXus. The later is an international standard for neutron and synchrotron communities. The former is facility specific: many historic data files are in this format. Increasingly, NeXus format is being adopted as the standard format for instrument data.
Licensing See the STFC Data Policy for the SCAPE project
Owner STFC
Dataset Location https://scapeweb.esc.rl.ac.uk/

(please get in touch with STFC for accessing the data)

Collection expert Erica Yang (STFC)
Issues brainstorm These are individual data files produced by the experiments. These files are readings of invididual experimental runs. They, themselves, do not have enough information to allow anybody to process them because, basically, they are neutron counts in the STFC ISIS facility case. They are raw data because it contains errors and noises that are needed to be removed before it can be analysed. Therefore, first of all, they have to be preserved alongside with the contextual information describing where it was produced (e.g. which instrument), when it was produced (which ISIS cycle), and what experiment it was produced for. All these information allow establishing the linkages between these raw files and relevant files generated at the same time while the files are being produced during an experiment. 

Other types of contextual information needed to be preserved include the software needed to process the files, the samples that are used to produce the files.


List of Issues

Issues:

Issue 1

Title
Characterisation and validation of very large data files
Detailed description In order to ensure that the data to be preserved is of adequate quality, there is a need for structural/syntactical verification and characterisation upon ingesting data into repository
Scalability Challenge
Content size: the traditional nexus format validation tool is not designed for large data files (10s to 100s GBs per file) in the sense that many such tools take a long time to validate a nexus file: even, some will fail in the presence of large files. 

Volume of content: In its peak time, ISIS generates data files concurrently across 40+ instruments, which amounts to 100s MB/s. In the coming years, due to the upgrading of instruments and the introduction of new instruments, the volume is likely to increase to 1GB/s.

Complexity of content: although the main data file format is standarised in ISIS, which is mainly the nexus format, each instrument generates other types of data files, which are also essential for downstream processing.  The complication is that these other types of data files vary between instruments.

Note that: Because of the long time taken to validate nexus files, in practice, it is often the case that validation of the files are done offline upon the data files already on disk, before it is ingested into a data archive. It is also possible to validate these files occassionally once they are in the archive.
Issue champion Erica Yang (STFC)
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches Use of existing characterisation tools, such as JHOVE with appropriate extension or plug-in for NeXus data format.
Context Details of the institutional context to the Issue. (May be expanded at a later date)
Datasets Nexus data files
Solutions Reference to the appropriate Solution page(s), by hyperlink
Evaluation Objectives scaleability, automation
Success criteria the tools work with very large data files 
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* handle up to 100Gb files without crashing
* fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool.
* can characterise very large data files concurrently with characterisation processing of small files (up to 5 concurrent threads).
* can perform characterisation and validation with a data rate up to 100MB/s
Manual assessment N.A.
Actual evaluations links to acutual evaluations of this Issue/Scenario

Issue 2

Title
Fixity capturing and checking of very large data files
Detailed description need for capturing the fixity information (e.g. checksum) to ensure continuous integrity of data files to be preserved
Scalability Challenge
Content size: traditional checksum tools are not designed for large data files (10s to 100s GBs per file) in the sense that many such tools take a long time to generate a checksum, some will fail in the presence of large files. In terms of implementation, many modern programming language, such as Java, is limited by the memory limitation of its virtual machine. For example, JVM can ony process a file up to 4 GB. The processing is also bounded by the size of the machine who runs the program.

Volume of content: In its peak time, ISIS generates data files concurrently across 40+ instruments, which amounts to 100s MB/s. In the coming years, due to the upgrading of instruments and the introduction of new instruments, the volume is likely to increase to 1GB/s.

Complexity of content: although the main data file format is standarised in ISIS, which is mainly the nexus format, each instrument generates other types of data files, which are also essential for downstream processing.  The complication is that these other types of data files vary between instruments.

Note that: Because of the long time taken to generate the checksum, in practice, it is often the case that the fixity capturing and validation of the files are done offline upon the data files already on disk, before it is ingested into a data archive and after it is being catalogued. It is also possible to validate the integrity of these files occassionally once they are in the archive.
Issue champion Erica Yang (STFC)
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches Use of commonly used fixity calculation algorithms (e.g. MD5, SHA-1)
Datasets Nexus files
Solutions Reference to the appropriate Solution page(s), by hyperlink
Evaluation Objectives scaleability, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* handle up to 100Gb files without crashing
* fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool.
* can generate checksum for very large data files concurrently with checksum processing of small files (up to 5 concurrent threads).
* can perform fixity capturing/checking with a data rate up to 100MB/s
Manual assessment N.A.
Actual evaluations links to acutual evaluations of this Issue/Scenario

Issue 3

Title
IS31 Semantic checking of very large data files
Detailed description In order to ensure that the data to be preserved is of adequate quality , the contents of NeXus data files would need to be validated for their correctness against a given data model. Each data model is specified in a NeXus Definition Language (NXDL) file and contains assertions that define the expected content of a NeXus file. For example, a data model could define a metadata element (key-value pair) called “Integral” to represent the total integral monitor counts for grazing incidence small angle diffractometer GISAS for either x-ray or neutrons. In this scenario, the data type of the metadata element “Integral” would be an integer. For a NeXus data file conforming to this data model, it would be necessary to validate the value(s) assigned to “Integral” to ensure it is of appropriate data type.
Issue champion Simon Lambert (STFC)
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches Use of the NeXus validation toolkit - developed and used by the NeXus community - as part of the preservation ingest workflow.
Datasets nexus data files
Solutions Reference to the appropriate Solution page(s), by hyperlink
Evaluation Objectives scaleability, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* handle up to 100Gb files without crashing
* fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool.
* can check very large data files concurrently with semantic checking of small files (up to 5 concurrent threads).
* can perform semantic checking with a data rate up to 100MB/s
Manual assessment N.A.
Actual evaluations links to acutual evaluations of this Issue/Scenario

Solutions:

Solution 1

Title SO20 Extending JHOVE to characterise NeXus data format
Detailed description Use of existing characterisation tools, such as JHOVE with appropriate extension or plug-in for NeXus data format.
Solution Champion
Holly Zhen (STFC)
Corresponding Issue(s)
Evaluation
 

Solution 2

Title Extending the NeXus validation toolkit to cope with very large data files
Detailed description Use of the NeXus validation toolkit - developed and used by the NeXus community - as part of the preservation ingest workflow.
Solution Champion
Erica Yang (STFC)
Corresponding Issue(s)
Evaluation
Any notes or links on how the solution performed. This will be developed and formalised by the Testbed SP.
Labels:
scenario scenario Delete
researchdata researchdata Delete
rdscenarios rdscenarios Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.