Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History
h2. Metrics catalogue

{info:title=Note}
We are currently in the process of merging the initial _evaluations_ metrics catalogue into the attribute/measure catalogue being developed in the PW work package. When this has happened, all experiments should only use metrics from the latter mentioned catalogue.
Also, preferable this page will describe how to navigate the catalogue, so it will be easy to find the proper metrics and more time can be spent on the experiments.
{info}

h4. Picking metrics

Also, an equivalent attribute/measure source can be found in this google doc [Measures by google doc |https://docs.google.com/spreadsheet/ccc?key=0An_F2fZCFRRtdGZ6NFg0eFI3b3NIdktMSzBtWmhKUHc&pli=1#gid=0] (write to Kresimir Duretec for access to the google doc).

h4. Metrics in use as of first round of evaluations

|| Metric || Previously known as ||
| [number of objects per second|http://purl.org/DP/quality/measures#418] | -NumberOfObjectsPerHour- |
| [IdentificationCorrectnessInPercent|http://purl.org/DP/quality/measures#417] | -IdentificationCorrectnessInPercent- |
| [max object size handled in bytes|http://purl.org/DP/quality/measures#404] | -MaxObjectSizeHandledInGbytes- |
| [min object size handled in bytes|http://purl.org/DP/quality/measures#405] | -MinObjectSizeHandledInMbytes- |
| [N/A|https://github.com/openplanets/policies/issues/6] | -PlanEfficiencyInHours- |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerMinute- |
| [throughput in bytes per second|http://purl.org/DP/quality/measures#406] | -ThroughputGbytesPerHour- |
| [stability judgement|http://purl.org/DP/quality/measures#108] | -ReliableAndStableAssessment- |
| [failed objects in percent|http://purl.org/DP/quality/measures#407] | -NumberOfFailedFiles- |
| [N/A|https://github.com/openplanets/policies/issues/11] | -NumberOfFailedFilesAcceptable- |
| [QAFalseDifferentPercent|http://purl.org/DP/quality/measures#416] | -QAFalseDifferentPercent- |
| [N/A|https://github.com/openplanets/policies/issues/13] | -AverageRuntimePerItemInHours- |

{anchor:fmeasure}

h2. Binary evaluation method (FMeasure)

We use _sensitivity_ and _specificity_ as statistical measures of the performance of the binary classification test where 
_Sensitivity_ = Σ {color:#99cc00}true different{color} / (Σ{color:#99cc00} true different{color} \+ Σ {color:#ff0000}false similar{color}) 
and 
_Specificity_ = Σ{color:#99cc00} true similar{color} / (Σ {color:#99cc00}true similar{color} \+ Σ{color:#ff0000} false different{color}) 
and the F-measure is calculated on this basis as shown in the table below:

  !BinaryEvaluation.png|border=1,width=551,height=201!


This is one suggested way, which is nicely applicable, if we test for binary correctness of calculations, i.e. it is applicable for characterisation and QA.

h2. History

h4. This is the previously used evaluation metrics

| QAFalseDifferentPercent | | integer | Number of content comparisons resulting in _original and migrated different_, even though human spot checking says _original and migrated similar_. | 5% \\ | Between 0 and 100 |
| AverageRuntimePerItemInHours \\ | | float | The average processing time in hours per item \\ | 15 \\ | Positive floating point number \\ |


{anchor:fmeasure}

h2. Binary evaluation method (FMeasure)

We use _sensitivity_ and _specificity_ as statistical measures of the performance of the binary classification test where 
_Sensitivity_ = Σ {color:#99cc00}true different{color} / (Σ{color:#99cc00} true different{color} \+ Σ {color:#ff0000}false similar{color}) 
and 
_Specificity_ = Σ{color:#99cc00} true similar{color} / (Σ {color:#99cc00}true similar{color} \+ Σ{color:#ff0000} false different{color}) 
and the F-measure is calculated on this basis as shown in the table below:

  !BinaryEvaluation.png|border=1,width=551,height=201!


This is one suggested way, which is nicely applicable, if we test for binary correctness of calculations, i.e. it is applicable for characterisation and QA.