Title |
Characterisation and validation of very large data files |
Detailed description | In order to ensure that the data to be preserved is of adequate quality, there is a need for structural/syntactical verification and characterisation upon ingesting data into repository |
Scalability Challenge |
Content size: the traditional nexus format validation tool is not designed for large data files (10s to 100s GBs per file) in the sense that many such tools take a long time to validate a nexus file: even, some will fail in the presence of large files. Volume of content: In its peak time, ISIS generates data files concurrently across 40+ instruments, which amounts to 100s MB/s. In the coming years, due to the upgrading of instruments and the introduction of new instruments, the volume is likely to increase to 1GB/s. Complexity of content: although the main data file format is standarised in ISIS, which is mainly the nexus format, each instrument generates other types of data files, which are also essential for downstream processing. The complication is that these other types of data files vary between instruments. Note that: Because of the long time taken to validate nexus files, in practice, it is often the case that validation of the files are done offline upon the data files already on disk, before it is ingested into a data archive. It is also possible to validate these files occassionally once they are in the archive. |
Issue champion | Erica Yang (STFC) |
Other interested parties |
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven![]() |
Possible Solution approaches | Use of existing characterisation tools, such as JHOVE with appropriate extension or plug-in for NeXus data format. |
Context | Details of the institutional context to the Issue. (May be expanded at a later date) |
Datasets | Nexus data files |
Solutions | Reference to the appropriate Solution page(s), by hyperlink |
Evaluation Objectives | scaleability, automation |
Success criteria | the tools work with very large data files |
Automatic measures | What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important? If possible specify very specific measures and your goal - e.g. * handle up to 100Gb files without crashing * fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool. * can characterise very large data files concurrently with characterisation processing of small files (up to 5 concurrent threads). * can perform characterisation and validation with a data rate up to 100MB/s |
Manual assessment | N.A. |
Actual evaluations | links to acutual evaluations of this Issue/Scenario |
Labels: