Skip to end of metadata
Go to start of metadata
Title
Fixity capturing and checking of very large data files
Detailed description need for capturing the fixity information (e.g. checksum) to ensure continuous integrity of data files to be preserved
Scalability Challenge
Content size: traditional checksum tools are not designed for large data files (10s to 100s GBs per file) in the sense that many such tools take a long time to generate a checksum, some will fail in the presence of large files. In terms of implementation, many modern programming language, such as Java, is limited by the memory limitation of its virtual machine. For example, JVM can ony process a file up to 4 GB. The processing is also bounded by the size of the machine who runs the program.

Volume of content: In its peak time, ISIS generates data files concurrently across 40+ instruments, which amounts to 100s MB/s. In the coming years, due to the upgrading of instruments and the introduction of new instruments, the volume is likely to increase to 1GB/s.

Complexity of content: although the main data file format is standarised in ISIS, which is mainly the nexus format, each instrument generates other types of data files, which are also essential for downstream processing.  The complication is that these other types of data files vary between instruments.

Note that: Because of the long time taken to generate the checksum, in practice, it is often the case that the fixity capturing and validation of the files are done offline upon the data files already on disk, before it is ingested into a data archive and after it is being catalogued. It is also possible to validate the integrity of these files occassionally once they are in the archive.
Issue champion Erica Yang (STFC)
Other interested parties
Any other parties who are also interested in applying Issue Solutions to their Datasets. Identify the party with a link to their contact page on the SCAPE Sharepoint site, as well as identifying their institution in brackets. Eg: Schlarb Sven (ONB)
Possible Solution approaches Use of commonly used fixity calculation algorithms (e.g. MD5, SHA-1)
Datasets Nexus files
Solutions Reference to the appropriate Solution page(s), by hyperlink
Evaluation Objectives scaleability, automation
Success criteria Describe the success criteria for solving this issue - what are you able to do? - what does the world look like?
Automatic measures What automated measures would you like the solution to give to evaluate the solution for this specific issue? which measures are important?
If possible specify very specific measures and your goal - e.g.
* handle up to 100Gb files without crashing
* fail safe - even if it fails, it fails gracefully, rather than crashing, i.e. effective error handling should be in place to allow the environment that hosts the tool to capture the error and notify other services that may interact with the tool.
* can generate checksum for very large data files concurrently with checksum processing of small files (up to 5 concurrent threads).
* can perform fixity capturing/checking with a data rate up to 100MB/s
Manual assessment N.A.
Actual evaluations links to acutual evaluations of this Issue/Scenario
Labels:
issue issue Delete
integrity integrity Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.