To unify metrics across all evaluations all metrics should be registered in this Metrics Catalogue. So - when picking metrics for an evaluation run through the catalogue and pick any already defined or enter a new metric when needed.
|NumberOfObjectsPerHour||integer|| Number of objects that can be processed per hour
|| Could be used both for component evaluations on a single machine and on entire platform setups
||Defining a statistical measure for binary evaluations - see detailed specification below|| 85 %
|| Between 0 and 100
|| The max file size a workflow/component has handled
|| Specify in Gbytes
|| Number of hours it takes to build one preservation plan with Plato
|| Specify in hours
|| The throughput of data measured in Gybtes per minute
|| Specify in Gbytes per minute
|| Manual asessment on if the experiment performed reliable and stable
|| Number of files that failed in the workflow
|QAFalseDifferentPercent||integer||Number of content comparisons resulting in original and migrated different, even though human spot checking says original and migrated similar.|| 5%
||Between 0 and 100|
An attribute/measure catalogue is also developed in PW - this evaluation metrics catalogue will be merged with the PW catalogue in year-3.
If you want to have a quick glance at the PW catalogue its located here (google docs): https://docs.google.com/spreadsheet/ccc?key=0An_F2fZCFRRtdGZ6NFg0eFI3b3NIdktMSzBtWmhKUHc&pli=1#gid=0
Write to Christhop Becker at [email protected] to ask for access to the google doc
If you already are familiar with the PW catalogue you are off cause most welcome to use already existing metrics from in there - this will make the merging in year-3 much easier. But this is currently NOT a requirement.
We use sensitivity and specificity as statistical measures of the performance of the binary classification test where
Sensitivity = Σ true different / (Σ true different + Σ false similar)
Specificity = Σ true similar / (Σ true similar + Σ false different)
and the F-measure is calculated on this basis as shown in the table below:
This is one suggested way which is nicely applicable if we test for binary correctness of calculations, i.e. it is applicable for characterisation and QA