Evaluator(s)
Tomasz Hofmann (PSNC)
Evaluation points
The main goal of this evaluation was to execute analysis on the medical data stored at Medical Data Center and obtain statistics on the average time of visit for patients treated at WCPT in a given period and because of a specific diseases (indicated by ICD10 codes). The period of time and ICD10 codes are the input parameters for analysis algorithm. Statistics were gathered using PSNC+Hadoop+Platform and the map-reduce approach. As the metric the number of objects per second
was used (the number of records processed per second).
Assessment of measurable points
Metric | Description | Metric baseline | Metric goal | July 21, 2014 [Test 1] | July 28, 2014 [Test 2] | July 31, 2014 [Test 2] |
---|---|---|---|---|---|---|
number of objects per second![]() |
number of records processed per second |
- | - | 2812 [obj/s] |
2569 [obj/s] | 1438 [obj/s] |
Note: *as an object we proposed to use one scanned cell in Hbase table (one record)
Metrics must be registered in the metrics catalogue
Visualisation of results
The chart below presents results of the analysis for Test 2. Colours indicate different ICD10 disease codes that. Test has been performed for the patients who visited WCPT hospital between 1-01-2013 and 31-12-2013. Each column indicated the average time of patients visits. Descriptions of the ICD10 codes investigated in this analysis are as follows:
- A15.0 - Tuberculosis of lung, confirmed by sputum microscopy with or without culture
- A15.1 - Tuberculosis of lung, confirmed by culture only
- J85.1 - Abscess of lung with pneumonia
Additional information
Table 1 presents processing time of the whole job per test. Tables 2 and 3 provide information on the execution time and number of processed rows related to map and reduce tasks respectively. From the statistics and measurable points it is visible that: a) the processing time depends on the number of records to be processed b) the more records to process the better performance is achieved (more rows per second are processed).
Table 1. Overall statistics
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Analyzed period |
1.07.2012-1.07.2014 | 1.01.2013-31.12.2013 |
1.01.2014-01.05.2014 |
Processing time |
59 [s] | 57 [s] | 7 [s] |
Table 2. Statistics for map task
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Processing time (for all records) |
59 [s] | 57 [s] |
7 [s] |
Number of records |
165 903 | 146 748 | 9 294 |
Table 3. Statistics for reduce task
Parameter | Test 1 | Test 2 | Test 3 |
---|---|---|---|
Processing time (for all records) |
0,17 [s] | 0,18 [s] |
0,04 [s] |
Number of records |
898 | 642 |
83 |
Technical details
Workflow
The experiment is composed of the following steps (accordingly to the MapReduce schema) [casesaverage.sh]:
- the map task,
- for each tuple in visits table:
- if icd10 code is in the set of given icd10 codes and if visit belong to the given period, then calculate the time of patients vists and add into the context pair: Key=icd10 code, Value=the length of the visit
- for each tuple in visits table:
- the reduce task [casesaverage.sh
]:
- for each icd10 code accumulate the length of visits and then divide by the number of vists
- produce pair Key=icd10 code, Value=average length of the visit
- statistics are gathered by downloading and parsing log files [test.sh
]
Scripts used to execute evaluation
Execution commands
./casesaverage.sh -admission 20140101 -destination ./test2/casesaverage.png -discharge 20140501 -hospital wcpit -icd10s J85.1 -icd10s A15.0 -icd10s A15.1 -width 800 -height 600 ./test.sh caseAvr where: -admission : date of patient admission to hospital -discharge : date of patient discharge from hospital -destination : full path to the result chart file (only one per job execution) -width : width of the chart in pixels -height : height of the chart in pixels -icd10s : list of idc10 codes
Important note: please change the -destination for each job execution.
Hadoop job
https://git.man.poznan.pl/stash/projects/SCAP/repos/mr-jobs/browse/epidemic-jobs/casesaverage