SCAPE has a work package dedicated to evaluation of results: TB.WP4 (WP18) Evaluation of Results.
This document finishes the first milestone in this WP: MS76: Draft SCAPE evaluation methodology
The purpose of this WP is described in the DoW as
- Monitor progress towards project-level objectives through the work caried out in the 3 Testbeds
- Periodically assess the state of progress and feed into development
This is done by carrying out the following tasks
- Develop an evaluation methodology
- Produce a set of metrics, evaluation methods and timetable
- Statistical measurements
- Subjective assessment
- Produce a set of metrics, evaluation methods and timetable
- Define process for hand over of evaluation data from the Testbeds (and all components within the testbed workflows) and and other relevant components (e.g. Planning and Watch)
- Develop an evaluation plan
- Support the execution of the plan in each testbed
- Provide feedback and gap-analysis back to the developments in SCAPE
This initial milestones defines the first draft of the evaluation methodology and outlines the rough plans in an overall timetable. Both elements will be work that evolves with the project througout the entire timeframe of SCAPE
The evaluations will be link to Issues, Datasets, Solutions on the OPF-hosted WIKI where appropriate. But The evaluation methodology also needs to be able to support evaluation of the platform (will most likely be done indirectly through the testbeds) but also Planning and Watch.
We need controlled datasets to be able to monitor progress in workflow and tool development on a set of controlled corpora. This is a similar problem in the Action Components WP who have already dealt with it.
We also need a stable computation environment to be able to analyze evaluation data over time. This should be supported by the central instances hosted at IMF and AIT.
Evaluation Methodology needs to support evaluation of results from different instances of the platform with quite different technical specifications (number of CPUs, number of nodes, amount of RAM per node etc.)
We need to evaluate / measure things like
- Improved functionality
- File format coverage
- Stability / error-handling
- Reliability and robustness
- Number of known properties
- Handling of compound objects
- Scalability in terms of
- Number of objects processed
- Size of objects
- Complexity of objects
- Heterogenety of collections
- Speed of single tools / workflows
- Operational costs – capacity utilization, prioritization
- Technical measures
- Disk I/O-usage
- Usability of workflow design (mentioned in DoW) (COMMENT1)
- Do the solutions provided solve the listed issues (fully, partially, not at all...)
- Complexity of solution, e.g. can the solution be used by small institutions with minimum professional FTEs, as more as the maintenance of the solution requires FTEs the cost will be higher.
- Integration tests – how the different tools act as part of a system, this is very important as we all the time facing bottle necks in a specific tool that damage the entire performanc
As with everything else in SCAPE as much as possible needs to be automated to be able to scale the evaluation to the needed level.
COMMENT1: The meaning is: does the workflow can be changed to become more usable or maybe it’s not usable at all and therefore the solution is not solving the issue. Usability is an important aspect of all software solution which we are testing regularly.
Evaluation in SCAPE is done through the following steps
- Define top-10 goals and objectives that should be evaluated - done at this page
- For each goal/objective pair select how to evaluate - in one or more of 4 possible ways (each instance called an evaluation-point)
- System/Platform level using this template - used when evaluating things on a distributed system - e.g. for performance metrics
- Component level using this template - used when evaluating things on a single machine - e.g. for accuracy metrics
- Registering the evaluation with this template and using the Plato Tool built for evaluation - used primarily for Action Components
- Registering the evaluation with this template and writing up a report with findings and results where no hardcore uniquely defined metrics can be used - e.g. for organisational goals and objectives
Evaluations should be linked to Scenarios where appropriate - see the templates for explanation. In the first round of evaluations we should try not to define too many evaluation-points but for each goal/objective pair selected for evaluation there should be at least one. Some scenarios might be used to evaluation multiple goal/objective pairs.
Each evaluation will use a basic evaluation scheme
- Setup evaluation (see template)
- Define metrics (using the Metrics Catalogue)
- Define baseline (groundtruth) - eg. current state with a tool running on a single machine before SCAPE began
- Define metric goal - what result do we want to achieve during SCAPE
- Result of a given evaluation
For some objectives (e.g. organisational fit) it might me hard or even impossible to define precise measurable uniquely defined measures and goals. For these a more qualitative human understandable (e.g. in form of a report) evaluation will be done.
For many of the evaluations we foresee that they will get evaluated multiple times during the project - ultimately at least until the defined metric goal (4) is reached. Thus ending up with showing the progress of SCAPE developments with multiple values for (5) over time. See example evaluation
The first round of evaluations will be carried out in M20-M22 to be able to write up the first evaluation report as a deliverable for M24. At a later stage results (metrics) to be used in the evaluations will be queried from REF, but REF and integration between components/workflows and REF is still under development and will not be ready for evaluation in the first round. REF will be integrated into the evaluation methodology in year-3. Thus results used for the first round of evaluation will mostly be manually gathered and entered into the evaluation-pages on this WIKI - for this reason we should also for the first round not define too many evaluation-points and corresponding metrics to keep things manageable while still having a method based on manual interaction.