Rune Ferneke-Nielsen, SB
In this evaluation, Evaluation 1, we would like to get an indication of how fast we can validate a set of JPEG2000 files against our institutional policy. The validation step is only one part of a larger workflow, but the result should still give us a good indication.
The image files are accessed via a NFS mount, which is a valid environment configuration at SB. How the image files appear on the NFS mount (e.g. from a tape storage) before being accessed is not in scope for this experiment, and it may require several manual steps to be performed. Also, another important aspect is that the provenance data from the validation should be stored in a repository, this is not part of this evaluation. We will look further into this aspect in the second evaluation - Evaluation 2.
|Metric||Description||Metric baseline||Metric goal||February 24, 2014|
|NumberOfObjectsPerHour|| Performance efficiency - Capacity / Time behaviour
Number of newspaper pages (i.e. meta data) being validated against an organisational policy
(split size: 20000)
|NumberOfFailedFiles|| Reliability - Runtime stability
Number of files failed, in jpylyzer step and/or policy comparison step
Note: Zoomed version.
The above graph is generated from data:
|execution time (s)||2592||1344||1065||939||865||861||830||788||828||895||754||895||1710||2045||3960||6360|
When using a spilt size between 10 and 1600, it takes approximately 1000 seconds to process the entire data set (17978 files). This means that we can process 20000+ (the number is closer to 65000) files within an hour, and the metric goal has therefore been reached in the first iteration of this experiment.
In the worst case scenario, where one process will do all the work, it takes 6360 seconds to process the entire data set. This means that we can process approximately 10000 files within an hour, which is also enough to reach the metric goal.
Reliability - Stability indicators
- A Scape component for converting tool-specific output into a scape-generic format is under development, and could therefore not be used in this experiment. Instead a minimal implementation was created as part of the experiment, so it was possible to execute and evaluated the experiment (concern).
- A Scape component for comparing the scape-generic format with an organisational policy is under development, a simple implementation was used. Also, this component is somewhat dependent of the above described tool-specific conversion component (concern).
- It is uncertain whether the Scape modules (for jpylyzer and organisational policy) have an active community (concern).
- Java and Hadoop both have proved usable as real-life systems, and have an active community (no concern).
Functional suitability - Correctness
- Software packages / modules handling organisational policy are under development and as such correctness cannot be verified (concern).
Organisational maturity - Dimensions of maturity: Awareness and Communication; Policies, Plans and Procedures; Tools and Automation; Skills and Expertise; Responsibility and Accountability; Goal Setting and Measurement
- No planning or monitoring is present.
Maintainability - Reusability
- The Jpylyzer tool is placed in a repository, making it easily accessible and reusable. Other Scape packages / modules are still to be made reusable. The OPF organisation is working towards a solution for handling the lifecycle of preservation tools (no concern).
- The policy validation can be handled in a number of different ways, but to make it automatic and machine-readable it still requires technical staff (no concern).
Maintainability - Organisational fit
- Given that the organisation have suited technical staff, such as software developers, the approach is viable (no concern).
- The Jpylyzer tool is in use at the State and University Library, being used actively in a quality assurance process.
Functional suitability - Completeness
- Only one input format is in play and no plans to expand.
Planning and monitoring efficiency - Information gathering and decision making effort
- No planning or monitoring is present.
Remember to include relevant information, links, versions about workflow, tools, APIs (e.g. Taverna, command line, Hadoop, links to MyExperiment, link to tools or SCAPE name, links to distinct versions of specific components/tools in the component registry)
The implementation can be found at github: statsbiblioteket/scape-jp2-qa
Use the tag scape_evaluation_1 for actual code point: git checkout scape_evaluation_1
- Output from the Hadoop job can be found here.
- Timings for running the jpylyzer tool as a single process can be found here
- Timings for running the md5sum tool as a single process can be found here
From this first evaluation, we have found that the solution for policy driven validation is appropriate and can easily handle the forecasted load. We have seen that the processing handles around 65000 image files every hour, which is much more than the metric goal of 5000 image file every hour.
This is a good result, and we can move forward towards an integrated solution that incorporates extracting and storing data via our repository. Further, the large span between 5000 and 65000 is very positive, as it will take time to extract and store data; and we still need to reach the metric goal.