|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Changes (9)
View Page History
| *Automatic measures* | The automatic measure will only consider throughput (books/time). We compare the runtime of the data preparation and quality assurance workflows on one machine compared to a Hadoop Map/Recuce job running on a cluster with increasing sample size (50, 500, 5000 books) in various steps up to a very large data set (50000 books). |
| *Manual assessment* | A set of book pairs is annotated and considered as gold standard in order to determine the successful application of the quality assurance workflow. The annotation manually assignes a degree of difference for a set of each an original and a redownloaded book as book pairs which is used for evaluating the solution. Based on a threshold for the degree of difference, two classes of "similar" and "different" book pairs are built. The figure below illustrates a sample of 50 books where the first two rows represent book pairs that are classified as "different" ({color:#ff0000}*≠*{color}), the other rows represent book pairs that are classified as "similar" ({color:#008000}*=*{color}). A possible output of the quality assurance classifier is shown by the red boxes highlighting 11 out of 50 book pairs that are supposed to be different. \\
| *Manual assessment* | A set of book pairs is annotated and considered as gold standard in order to determine the successful application of the quality assurance workflow. The annotation manually assignes a degree of difference for a set of each an original and a redownloaded book as book pairs which is used for evaluating the solution. Based on a threshold for the degree of difference, two classes of "similar" and "different" book pairs are built. The figure below illustrates a sample of 50 books where the first two rows represent book pairs that are classified as "different" ({color:#ff0000}*≠*{color}), the other rows represent book pairs that are classified as "similar" ({color:#008000}*=*{color}). A possible output of the quality assurance classifier is shown by the red boxes highlighting 11 out of 50 book pairs that are supposed to be different. \\
\\ !evaluation_50bookpairs_v2.png|border=1,width=474,height=473!\\ \\
On the one hand, the quality assurance classifier detects 8 out of 10 books correctly as "different" ({color:#008000}Ttrue different{color}), and it misses two books that are "different" but are classified as "similar" ({color:#ff0000}False similar{color}). On the other hand, the classifier detects 5 book pairs as "different" while they are actually "similar" ({color:#ff0000}Ffalse different{color}), and it detects in 37 books correctly as "similar" ({color:#008000}True similar{color}) according to the gold standard. \\
We use _sensitivity_ and _specificity_ as statistical measures of the performance of the binary classification test where \\
\\
We use _precision_ and _recall_ as statistical measures where \\
\\
We use _precision_ and _recall_ as statistical measures where \\
\\
_Sensitivity_ = _precision_= Σ {color:#008000}true different{color} / (Σ {color:#008000}true different{color} + Σ {color:#ff0000}false similar{color}) different{color}) \\
\\
and \\
_Specificity_ = Σ {color:#008000}true similar{color} / (Σ {color:#008000}true similar{color} + Σ {color:#ff0000}false different{color}) \\
and the F-measure is calculated on this basis as shown in the table below: \\
\\ !Books_evaluation_chart.png|border=1,width=480,height=173!\\ \\ |
and the F-measure is calculated on this basis as shown in the table below: \\
\\ !Books_evaluation_chart.png|border=1,width=480,height=173!\\ \\ |
\\
_recall_ = Σ {color:#008000}true {color}{color:#008000}different{color} / (Σ {color:#008000}true {color}{color:#008000}different{color} + Σ {color:#ff0000}false similar{color}) \\
We then calculate the combined f-measure of precision and recall as\\
_f-measure_ = 2 * (precision * recall) / (precision + recall)\\ \\
This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the _precision_ of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the _recall_ of the tool. And the _f-measure_ expresses the balance between precision and recall.\\
\\
Related to the example above, this means that the classification of the tool would give \\
_precision_ = 8 / ( 8 \+ 5 ) = *0.61*\\
_and_ \\
\\
_recall_ = 8 / ( 8 + 2 ) = *0.80*\\
which results in the\\
\\
_f-measure_ = 2 * (_0.61_ \* _0.80_) / (_0.61_ + _0.80_) = *0.68*\\ |
_recall_ = Σ {color:#008000}true {color}{color:#008000}different{color} / (Σ {color:#008000}true {color}{color:#008000}different{color} + Σ {color:#ff0000}false similar{color}) \\
We then calculate the combined f-measure of precision and recall as\\
_f-measure_ = 2 * (precision * recall) / (precision + recall)\\ \\
This means, on the one hand, that the higher the number of different book pairs correctly identified and the lower the number of incorrectly identified different books which are actually similar book pairs is, the better is the _precision_ of the tool. And, on the other hand, the higher the number of different books correctly identified and the lower the number of missed different book pairs is, the better is the _recall_ of the tool. And the _f-measure_ expresses the balance between precision and recall.\\
\\
Related to the example above, this means that the classification of the tool would give \\
_precision_ = 8 / ( 8 \+ 5 ) = *0.61*\\
_and_ \\
\\
_recall_ = 8 / ( 8 + 2 ) = *0.80*\\
which results in the\\
\\
_f-measure_ = 2 * (_0.61_ \* _0.80_) / (_0.61_ + _0.80_) = *0.68*\\ |
| *Actual evaluations* | links to acutual evaluations of this Issue/Scenario |