h2. Metrics catalogue
h4. Picking metrics
When picking metrics for an evaluation, run through the catalogue and pick any already defined, or enter a new metric when needed.
The attribute/measure catalogue developed in PW can be found here [Measures |http://ifs.tuwien.ac.at/dp/vocabulary/quality/measures]
Also, an equivalent attribute/measure source can be found in this google doc [Measures by google doc |https://docs.google.com/spreadsheet/ccc?key=0An_F2fZCFRRtdGZ6NFg0eFI3b3NIdktMSzBtWmhKUHc&pli=1#gid=0] (write to Kresimir Duretec for access to the google doc).
h4. Metrics in use as of first round of evaluations
|| Metric || Previously known as || URL ||
| [number of objects per second|http://purl.org/DP/quality/measures#418] | -NumberOfObjectsPerHour- | http://purl.org/DP/quality/measures#418 |
| [IdentificationCorrectnessInPercent|http://purl.org/DP/quality/measures#417] | -IdentificationCorrectnessInPercent- | http://purl.org/DP/quality/measures#417 |
| [max object size handled in bytes|http://purl.org/DP/quality/measures#404] | -MaxObjectSizeHandledInGbytes- | http://purl.org/DP/quality/measures#404 |
| [min object size handled in bytes|http://purl.org/DP/quality/measures#405] | -MinObjectSizeHandledInMbytes- | http://purl.org/DP/quality/measures#405 |
| [N/A|https://github.com/openplanets/policies/issues/6] | -PlanEfficiencyInHours- | see https://github.com/openplanets/policies/issues/6 |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerMinute- | http://purl.org/DP/quality/measures#406 |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerHour- | http://purl.org/DP/quality/measures#406 |
| [stability judgement|http://purl.org/DP/quality/measures#108] | -ReliableAndStableAssessment- | http://purl.org/DP/quality/measures#108 |
| [failed objects in percent|http://purl.org/DP/quality/measures#407] | -NumberOfFailedFiles- | http://purl.org/DP/quality/measures#407 |
| [N/A|https://github.com/openplanets/policies/issues/11] | -NumberOfFailedFilesAcceptable- | see https://github.com/openplanets/policies/issues/11 |
| [QAFalseDifferentPercent|http://purl.org/DP/quality/measures#416] | -QAFalseDifferentPercent- | http://purl.org/DP/quality/measures#416 |
| [N/A|https://github.com/openplanets/policies/issues/13] | -AverageRuntimePerItemInHours- | see https://github.com/openplanets/policies/issues/13 |
{anchor:fmeasure}
h2. Binary evaluation method (FMeasure)
We use _sensitivity_ and _specificity_ as statistical measures of the performance of the binary classification test where
_Sensitivity_ = Σ {color:#99cc00}true different{color} / (Σ{color:#99cc00} true different{color} \+ Σ {color:#ff0000}false similar{color})
and
_Specificity_ = Σ{color:#99cc00} true similar{color} / (Σ {color:#99cc00}true similar{color} \+ Σ{color:#ff0000} false different{color})
and the F-measure is calculated on this basis as shown in the table below:
!BinaryEvaluation.png|border=1,width=551,height=201!
This is one suggested way, which is nicely applicable, if we test for binary correctness of calculations, i.e. it is applicable for characterisation and QA.
h2. History
h4. This is the previously used evaluation metrics
{code}Use CamelCase notation for metric names - e.g. NumberOfObjectsPerHour{code}
|| Metric \\ || PW catalogue \\
URI || Datatype \\ || Description \\ || Example \\ || Comments \\ ||
| NumberOfObjectsPerHour | | integer | Number of objects that can be processed per hour \\ | 250 \\ | Could be used both for component evaluations on a single machine and on entire platform setups \\ |
| IdentificationCorrectnessInPercent | | integer \\ | Defining a statistical measure for binary evaluations - [see detailed specification below|#Metricscatalogue-fmeasure] | 85 % \\ | Between 0 and 100 \\ |
| MaxObjectSizeHandledInGbytes \\ | | integer \\ | The max file size a workflow/component has handled \\ | 80 \\ | Specify in Gbytes \\ |
| MinObjectSizeHandledInMbytes | | integer | The min file size a workflow/component has handled - illustrates capability of running on heterogeneous file sizes when combined with MaxObjectSizeHandledInGbytes | 20 \\ | Specify in Mbytes |
| PlanEfficiencyInHours | | integer \\ | Number of hours it takes to build one preservation plan with Plato \\ | 20 \\ | Specify in hours \\ |
| ThroughputGbytesPerMinute \\ | | integer \\ | The throughput of data measured in Gybtes per minute \\ | 5 \\ | Specify in Gbytes per minute \\ |
| ThroughputGbytesPerHour | | integer \\ | The throughput of data measured in Gbytes per hour \\ | 25 \\ | Specify in Gbytes per minute \\ |
| ReliableAndStableAssessment | | boolean \\ | Manual assessment on if the experiment performed reliable and stable \\ | true \\ | |
| NumberOfFailedFiles | | integer \\ | Number of files that failed in the workflow \\ | 0 \\ | |
| NumberOfFailedFilesAcceptable | | boolean | Manual assessment of whether the number of files that fail in the workflow is acceptable \\ | true \\ | |
| QAFalseDifferentPercent | | integer | Number of content comparisons resulting in _original and migrated different_, even though human spot checking says _original and migrated similar_. | 5% \\ | Between 0 and 100 |
| AverageRuntimePerItemInHours \\ | | float | The average processing time in hours per item \\ | 15 \\ | Positive floating point number \\ |
h4. Picking metrics
When picking metrics for an evaluation, run through the catalogue and pick any already defined, or enter a new metric when needed.
The attribute/measure catalogue developed in PW can be found here [Measures |http://ifs.tuwien.ac.at/dp/vocabulary/quality/measures]
Also, an equivalent attribute/measure source can be found in this google doc [Measures by google doc |https://docs.google.com/spreadsheet/ccc?key=0An_F2fZCFRRtdGZ6NFg0eFI3b3NIdktMSzBtWmhKUHc&pli=1#gid=0] (write to Kresimir Duretec for access to the google doc).
h4. Metrics in use as of first round of evaluations
|| Metric || Previously known as || URL ||
| [number of objects per second|http://purl.org/DP/quality/measures#418] | -NumberOfObjectsPerHour- | http://purl.org/DP/quality/measures#418 |
| [IdentificationCorrectnessInPercent|http://purl.org/DP/quality/measures#417] | -IdentificationCorrectnessInPercent- | http://purl.org/DP/quality/measures#417 |
| [max object size handled in bytes|http://purl.org/DP/quality/measures#404] | -MaxObjectSizeHandledInGbytes- | http://purl.org/DP/quality/measures#404 |
| [min object size handled in bytes|http://purl.org/DP/quality/measures#405] | -MinObjectSizeHandledInMbytes- | http://purl.org/DP/quality/measures#405 |
| [N/A|https://github.com/openplanets/policies/issues/6] | -PlanEfficiencyInHours- | see https://github.com/openplanets/policies/issues/6 |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerMinute- | http://purl.org/DP/quality/measures#406 |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerHour- | http://purl.org/DP/quality/measures#406 |
| [stability judgement|http://purl.org/DP/quality/measures#108] | -ReliableAndStableAssessment- | http://purl.org/DP/quality/measures#108 |
| [failed objects in percent|http://purl.org/DP/quality/measures#407] | -NumberOfFailedFiles- | http://purl.org/DP/quality/measures#407 |
| [N/A|https://github.com/openplanets/policies/issues/11] | -NumberOfFailedFilesAcceptable- | see https://github.com/openplanets/policies/issues/11 |
| [QAFalseDifferentPercent|http://purl.org/DP/quality/measures#416] | -QAFalseDifferentPercent- | http://purl.org/DP/quality/measures#416 |
| [N/A|https://github.com/openplanets/policies/issues/13] | -AverageRuntimePerItemInHours- | see https://github.com/openplanets/policies/issues/13 |
{anchor:fmeasure}
h2. Binary evaluation method (FMeasure)
We use _sensitivity_ and _specificity_ as statistical measures of the performance of the binary classification test where
_Sensitivity_ = Σ {color:#99cc00}true different{color} / (Σ{color:#99cc00} true different{color} \+ Σ {color:#ff0000}false similar{color})
and
_Specificity_ = Σ{color:#99cc00} true similar{color} / (Σ {color:#99cc00}true similar{color} \+ Σ{color:#ff0000} false different{color})
and the F-measure is calculated on this basis as shown in the table below:
!BinaryEvaluation.png|border=1,width=551,height=201!
This is one suggested way, which is nicely applicable, if we test for binary correctness of calculations, i.e. it is applicable for characterisation and QA.
h2. History
h4. This is the previously used evaluation metrics
{code}Use CamelCase notation for metric names - e.g. NumberOfObjectsPerHour{code}
|| Metric \\ || PW catalogue \\
URI || Datatype \\ || Description \\ || Example \\ || Comments \\ ||
| NumberOfObjectsPerHour | | integer | Number of objects that can be processed per hour \\ | 250 \\ | Could be used both for component evaluations on a single machine and on entire platform setups \\ |
| IdentificationCorrectnessInPercent | | integer \\ | Defining a statistical measure for binary evaluations - [see detailed specification below|#Metricscatalogue-fmeasure] | 85 % \\ | Between 0 and 100 \\ |
| MaxObjectSizeHandledInGbytes \\ | | integer \\ | The max file size a workflow/component has handled \\ | 80 \\ | Specify in Gbytes \\ |
| MinObjectSizeHandledInMbytes | | integer | The min file size a workflow/component has handled - illustrates capability of running on heterogeneous file sizes when combined with MaxObjectSizeHandledInGbytes | 20 \\ | Specify in Mbytes |
| PlanEfficiencyInHours | | integer \\ | Number of hours it takes to build one preservation plan with Plato \\ | 20 \\ | Specify in hours \\ |
| ThroughputGbytesPerMinute \\ | | integer \\ | The throughput of data measured in Gybtes per minute \\ | 5 \\ | Specify in Gbytes per minute \\ |
| ThroughputGbytesPerHour | | integer \\ | The throughput of data measured in Gbytes per hour \\ | 25 \\ | Specify in Gbytes per minute \\ |
| ReliableAndStableAssessment | | boolean \\ | Manual assessment on if the experiment performed reliable and stable \\ | true \\ | |
| NumberOfFailedFiles | | integer \\ | Number of files that failed in the workflow \\ | 0 \\ | |
| NumberOfFailedFilesAcceptable | | boolean | Manual assessment of whether the number of files that fail in the workflow is acceptable \\ | true \\ | |
| QAFalseDifferentPercent | | integer | Number of content comparisons resulting in _original and migrated different_, even though human spot checking says _original and migrated similar_. | 5% \\ | Between 0 and 100 |
| AverageRuntimePerItemInHours \\ | | float | The average processing time in hours per item \\ | 15 \\ | Positive floating point number \\ |