Evaluator(s)
Tomasz Hofmann (PSNC)
Evaluation points
The main goal of this evaluation was to execute analysis on the medical data stored at Medical Data Center and obtain statistics on the age of patients treated at WCPT in a given period. The age intervals are the input parameters for analysis algorithm. Statistics were gathered using PSNC hadoop cluster using the the map-reduce approach. As the evaluation metric the number of objects per second has been selected (the object is defined as a single record in the HBase table, and each HBase table row stores information about the age of patient who visited WCPT hospital).
Assessment of measurable points
Metric | Description | Metric baseline | Metric goal | July 21, 2014 [Test 1] | July 28, 2014 [Test 2] |
July 30, 2014 [Test 3] |
---|---|---|---|---|---|---|
number of objects per second![]() |
number of records processed per second |
- | - | 2592 [obj/s] | 2417 [obj/s] | 1465 [obj/s] |
Note: Metrics must be registered in the metrics catalogue
Visualisation of results
The chart below presents results of analysis for Test 2. Colours indicate different age ranges for the patients who visited WCPT hospital (the exact age range is given in the middle-right part of the chart and on the chart itself). Each colour on the pie chart has related entry (note). Each entry is composed as follows: X-Y = Z [P], where X-Y is the age range (patients between age X and Y), Z is the number of patient's visits (indicated the number of visits for specified age range and analysed time period) and P is the percentage of the number of patient's visit in the overall context. An example can be an entry for yellow colour: 41-60 = 309 [35%] - it means that yellow represents percentage of patients (35%) in age between 41 and 60 (including) which visited WCPT hospital in the period 2014-01-01 - 2014-05-01.
Additional information
Table 1 presents processing time of each test executed in the evaluation. From the statistics in the table and measurable points it is visible that: a) the processing time depends on the number of records to be processed b) the more records to process the better performance is achieved (more rows per second are processed). Table 2 and 3 provides additional statistics: processing times for map and reduce tasks respectively. It is visible that map task consumes most of the processing time, as the mappers are responsible for processing (reduce tasks calculate summary).
Table 1. Overall statistics
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Analyzed period |
1.07.2012-1.07.2014 | 1.01.2013-31.12.2013 |
1.01.2014-1.05.2014 |
Processing time |
65 [s] | 61 [s] | 7 [s] |
Table 2. Statistics for map task
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Processing time (for all records) |
65 [s] | 61 [s] |
6 [s] |
Number of records |
167 893 | 148 207 | 9 462 |
Table 3. Statistics for reduce task
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Processing time (for all records) |
0,36 [s] | 0,34 [s] |
0,15 [s] |
Number of records |
8 259 | 7 744 |
592 |
Technical details
Workflow
The experiment is composed of the following steps (accordingly to MapReduce schema):
- the map task [age.sh
]:
- for each tuple in visits table:
- if visit belong to the given period, then calculate patients age and add into the context pair: Key=age, Value=visit_id
- for each tuple in visits table:
- the reduce task [age.sh
]:
- for each value of age aggregate all visits ids in hash set - in order to find out the number of different visits
- produce pair Key=age, Value=number of different visits (size of the hash set)
- chart generation [age.sh
]
- additional statistics (see additional information section) are gathered by downloading and parsing log files [test.sh
]
Scripts used to execute evaluation
https://git.man.poznan.pl/stash/projects/SCAP/repos/test-scripts/browse/epidemic-jobs-tests/age.sh
https://git.man.poznan.pl/stash/projects/SCAP/repos/test-scripts/browse/epidemic-jobs-tests/test.sh
Execution commands
./age.sh -hospital wcpit -admission 20140101 -discharge 20140501 -destination ./test1/age.png -intervals "1-20;21-30;31-40;41-60;61-70;71-80;81-100" -width 800 -height 600 ./test.sh age where: -admission : date of patient admission to hospital -discharge : date of patient discharge from hospital -destination : full path to the chart file - hadoop job result (only one per job execution) -width : width of the chart in pixels -height : height of the chart in pixels -intervals : patietnts age intervals
Important note: please change the -destination for each job execution.
Hadoop job
https://git.man.poznan.pl/stash/projects/SCAP/repos/mr-jobs/browse/epidemic-jobs/age