Application of data mining techniques in the prediction of coronary artery disease use of anaesthesia time series and patient risk factor data

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.78 MB, 259 trang )

APPLICATION OF DATA MINING TECHNIQUES IN
THE PREDICTION OF CORONARY ARTERY DISEASE:
USE OF ANAESTHESIA TIME-SERIES AND PATIENT
RISK FACTOR DATA

Ellen Pitt, B.Sc. (Hons), M.B.,B.S. (UQ), M.IT. (QUT)

Dr. Richi Nayak

Submitted in fulfilment of the requirements for the degree of
Master of Information Technology (Research)
School of Information Systems
Faculty of Science and Technology
Queensland University of Technology
[2009]

ii

Keywords
Anaesthesia, physiological data, time-series, clustering, feature selection, predictors of
outcome, anaesthesia complications, cardiac risk factors, data mining.

iii

iv

Abstract

The high morbidity and mortality associated with atherosclerotic coronary vascular
disease (CVD) and its complications are being lessened by the increased knowledge of risk
factors, effective preventative measures and proven therapeutic interventions. However,
significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature
for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause
of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of
10 year risk of CVD events and the cardiovascular variables used to score this test are monitored
peri-operatively. Similar physiological time-series datasets are being subjected to data mining
methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors
of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and
predictive data mining methods are applied to this data.
Physiological time-series data related to anaesthetic procedures are subjected to preprocessing methods for removal of outliers, calculation of moving averages as well as data
summarisation and data abstraction methods. Feature selection methods of both wrapper and filter
types are applied to derived physiological time-series variable sets alone and to the same variables
combined with risk factor variables. The ability of these methods to identify subsets of highly
correlated but non-redundant variables is assessed. The major dataset is derived from the entire
anaesthesia population and subsets of this population are considered to be at increased anaesthesia
risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and
additional ECG leads). Because of the unbalanced class distribution in the data, majority class
under-sampling and Kappa statistic together with misclassification rate and area under the ROC
curve (AUC) are used for evaluation of models generated using different prediction algorithms.
The performance based on models derived from feature reduced datasets reveal the filter
method, Cfs subset evaluation, to be most consistently effective although Consistency derived
subsets tended to slightly increased accuracy but markedly increased complexity. The use of
misclassification rate (MR) for model performance evaluation is influenced by class distribution.
This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of
subsets with under-sampled majority class. The noise and outlier removal pre-processing methods
produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from
which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is
12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary

data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92.
v

However, for all time-series only based datasets, the complexity is high. For most pre-processing
methods, Cfs could identify a subset of correlated and non-redundant variables from the timeseries alone datasets but models derived from these subsets are of one leaf only. MR values are
consistent with class distribution in the subset folds evaluated in the n-cross validation method.
For models based on Cfs selected time-series derived and risk factor (RF) variables, the
MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and
dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on
counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived
variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with
associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with
MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour
(NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and
10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and
9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive
accuracy increase achieved by addition of risk factor variables to time-series variable based
models is significant. The addition of time-series derived variables to models based on risk factor
variables alone is associated with a trend to improved performance.
Data mining of feature reduced, anaesthesia time-series variables together with risk factor
variables can produce compact and moderately accurate models able to predict coronary vascular
disease. Decision tree analysis of time-series data combined with risk factor variables yields rules
which are more accurate than models based on time-series data alone. The limited additional
value provided by electrocardiographic variables when compared to use of risk factors alone is
similar to recent suggestions that exercise electrocardiography (exECG) under standardised
conditions has limited additional diagnostic value over risk factor analysis and symptom pattern.
The effect of the pre-processing used in this study had limited effect when time-series variables
and risk factor variables are used as model input. In the absence of risk factor input, the use of
time-series variables after outlier removal and time series variables based on physiological

variable values’ being outside the accepted normal range is associated with some improvement in
model performance.

vi

Table of Contents
Keywords..........................................................................................................................................iii
Abstract............................................................................................................................................ v
Table of Contents ............................................................................................................................ vii
List of Tables................................................................................................................................... xiii
List of Figures .................................................................................................................................. xv
List of Appendices........................................................................................................................... xix
List of Abbreviations ....................................................................................................................... xxi
Statement of Original Authorship................................................................................................... xxv
Acknowledgements ...................................................................................................................... xxvi
1

CHAPTER 1: INTRODUCTION ........................................................................................ 1

1.1

Background............................................................................................................................ 1

1.2

Context .................................................................................................................................. 2

1.3

Research Objective ................................................................................................................ 2

1.4

Research Questions................................................................................................................ 3

1.5

Thesis Outline ........................................................................................................................ 3

1.6

Significant Results .................................................................................................................. 5

1.7

Other Findings ....................................................................................................................... 6

2
2.1

CHAPTER 2: LITERATURE REVIEW................................................................................. 7
Coronary Vascular Disease ..................................................................................................... 8
2.1.1 Impact of cardiovascular disease ................................................................................. 8
2.1.2 Risk factors and associated vascular disease ................................................................ 9
2.1.3 Diagnostic methods................................................................................................... 11
2.1.4 Risk factor modification and revascularisation ........................................................... 23

2.2

Anaesthesia ......................................................................................................................... 25

vii

2.2.1 Anaesthesia risk and complications ............................................................................ 26
2.2.2 Anaesthesia monitoring ............................................................................................. 27
2.2.3 Choice of anaesthetic agent ....................................................................................... 30
2.2.4 Quality assurance....................................................................................................... 30
2.2.5 Summary ................................................................................................................... 31
2.3

Data Mining Process ............................................................................................................. 32
2.3.1 Data preparation ....................................................................................................... 34
2.3.2 Modelling .................................................................................................................. 38
2.3.3 Evaluation methods ................................................................................................... 41

2.4

Related Work........................................................................................................................ 43
2.4.1 Issues in data mining medical databases .................................................................... 44
2.4.2 Application in medical domain ................................................................................... 45
2.4.3 Summary ................................................................................................................... 50

2.5
3

Implications .......................................................................................................................... 50
CHAPTER 3: RESEARCH DESIGN ................................................................................. 53

3.1

Data Acquisition ................................................................................................................... 54

3.2

Data Selection ...................................................................................................................... 55
3.2.1 Target variable selection ............................................................................................ 55
3.2.2 Case selection and segmentation ............................................................................... 55
3.2.3 Variable selection ...................................................................................................... 56

3.3

Data Pre-Processing .............................................................................................................. 57
3.3.1 Data exploration ........................................................................................................ 57
3.3.2 Pre-processing tasks .................................................................................................. 58

3.4

Data Modelling ..................................................................................................................... 59
3.4.1 Datasets .................................................................................................................... 60
3.4.2 Feature selection / dimension reduction .................................................................... 60
3.4.3 Modelling methods .................................................................................................... 60

3.5

Performance Evaluation........................................................................................................ 62

viii

3.6

Post-Processing .................................................................................................................... 63

3.7

Ethics and Limitations .......................................................................................................... 63
3.7.1 Ethical considerations................................................................................................ 63
3.7.2 Limitations ................................................................................................................ 63

3.8
4
4.1

Conclusion ........................................................................................................................... 65
CHAPTER 4: DATA EXPLORATION ............................................................................... 67
Database Description ........................................................................................................... 67
4.1.1 Variable groups ......................................................................................................... 67

4.2

Time-Series Data .................................................................................................................. 68

4.3

Demographic and Clinical Characteristics ............................................................................. 77
4.3.1 Missing data.............................................................................................................. 78
4.3.2 Gender ...................................................................................................................... 79
4.3.3 Age ......................................................................................................................... 80

4.3.4 ASA class ................................................................................................................... 81
4.3.5 Case duration ............................................................................................................ 84
4.3.6 Vascular disease ........................................................................................................ 86
4.3.7 Risk factor characteristics .......................................................................................... 91
4.3.8 Weight distribution ................................................................................................... 94
4.3.9 Primary diagnosis group ............................................................................................ 95

4.4
5
5.1

Conclusion ........................................................................................................................... 95
CHAPTER 5: DATA PRE-PROCESSING ........................................................................... 97
Data Selection ...................................................................................................................... 97
5.1.1 Target selection......................................................................................................... 97
5.1.2 Variable selection ...................................................................................................... 97
5.1.3 Case selection and risk segmentation ........................................................................ 98

5.2

Data Preparation................................................................................................................ 101
5.2.1 Outlier and noise removal ....................................................................................... 102
5.2.2 Time-series data reduction ...................................................................................... 103

ix

5.2.3 Imputation of missing values.................................................................................... 116
5.2.4 Feature selection ..................................................................................................... 117
5.2.5 Time-series dimension reduction methods ............................................................... 117

5.3
6
6.1

Conclusions ........................................................................................................................ 120
CHAPTER 6: DATA MODELLING AND ANALYSIS........................................................ 121
Feature Selection Methods ................................................................................................. 122
6.1.1 Model accuracy........................................................................................................ 122
6.1.2 Model complexity .................................................................................................... 124
6.1.3 Effect of feature selection on other measures of model performance ....................... 125

6.2

Datasets ............................................................................................................................. 126

6.3

Evaluation Measures .......................................................................................................... 129
6.3.1 Misclassification rate for balanced and unbalanced data .......................................... 129
6.3.2 Area under ROC curve in balanced and unbalanced data .......................................... 131
6.3.3 Sensitivity, specificity and predictive values.............................................................. 133
6.3.4 Kappa statistic in unbalanced data ........................................................................... 135

6.4

Effect of ASA and its imputation.......................................................................................... 139

6.5

Prediction Methods ............................................................................................................ 141

6.5.1 Comparison of methods from each class of prediction algorithms ............................ 141
6.5.2 Comparison of decision tree and rule based prediction algorithms ........................... 145

6.6

Effect of A Priori Risk Stratification ...................................................................................... 147

6.7

Model Complexity .............................................................................................................. 151
6.7.1 Effect of dataset....................................................................................................... 151
6.7.2 Effect of pre-processing method and risk factor data................................................ 153

6.8

Comparison of Models for Prediction of corVD and anyVD .................................................. 163

6.9

Effect of Non Coronary Vascular Disease Status on Prediction of corVD ............................... 168

6.10

Primary hypotheses and Statistical analyses ........................................................................ 170

6.11

Summary of Findings .......................................................................................................... 171

6.12

Summary ............................................................................................................................ 173

x

7

CHAPTER 7: DISCUSSION AND CONCLUSIONS .......................................................... 175

7.1

Comparison with Exercise ECG ........................................................................................... 175

7.2

Known Difficulties with Prediction Based on Stress/Exercise ECG ........................................ 178

7.3

Challenges of Time-Series Datasets .................................................................................... 181
7.3.1 Pre-processing methods .......................................................................................... 182
7.3.2 Predictive method ................................................................................................... 184

7.4

Limitations ......................................................................................................................... 185

7.5

Research Design Lessons and Future Studies ...................................................................... 186
7.5.1 Research Study Design Lessons................................................................................ 186
7.5.2 Future work ............................................................................................................ 187

7.6

Conclusions........................................................................................................................ 188

BIBLIOGRAPHY ............................................................................................................................... 191
APPENDICES ................................................................................................................................... 203
Appendix A: Time-series characteristics for Phase 1 and Phase 2 ...................................... 203
Appendix B: Description of ICD code groups ...................................................................... 204
Appendix C: Examples of ST depression and elevation (Yanowitz 1996) ............................. 205
Appendix D: Variations of ST depression (Yanowitz 1996) ................................................. 208
Appendix E: ECG lead placement (Yanowitz 1996)............................................................. 209
Appendix F: Distribution of risk factor variables (shows proportion of type 1 diabetes, current
and past smoking and various lipid disorders) .......................................................... 210
Appendix G: Distribution of vascular diseases burden per case. .......................................... 211
Appendix H: Examples of time-series derived variables ...................................................... 212
Appendix I: Effect of data pre-processing on count of heart rate values reaching adequate
predicted maximum heart rate (phase 2, general populations and high risk subset) . 213
Appendix J: Distribution of HR values in different patient groups ....................................... 214
Appendix K: Effect of removal of outliers and noise on the number of cases considered to have
significant ST deviation ............................................................................................ 216
Appendix L: Comparison of DT and rule based methods for selected datasets (AUC) .......... 217
Appendix M: Comparison of DT and rule based methods for selected datasets (MR) .......... 218

xi

Appendix N: Comparison of methods and dataset for risk category subsets (MR) ................ 219
Appendix O: Comparison of methods and datasets for risk category subsets (AUC) ............ 220
Appendix P: Comparison of methods and datasets for risk category subsets (Kappa statistic) ...
........................................................................................................................ 221
Appendix Q: Comparison of methods in prediction of anyVD in selected subsets................. 222
Appendix R: Description of ST classes ................................................................................. 225
Appendix S: Examples of J48 decision tree rule sets............................................................ 226
Appendix T: Examples of decision trees .............................................................................. 230
Appendix U: Comparison of DT and rule based methods for prediction of anyVD ............... 233
(MR and AUC) ..................................................................................................................... 233

xii

List of Tables
Table 2-1: Diagnostic tests for coronary artery disease ................................................................... 13
Table 2-2: Confusion matrix for binary classification ....................................................................... 42
Table 4-1: Variable groups.............................................................................................................. 68
Table 4-2: Number of data rows in Phase 1 and Phase 2 datasets ................................................... 70
Table 4-3: Count of cases for which each of trend variables was measured ..................................... 71
Table 4-4: Comparison of time-series characteristics for total valid time-series and time-series with
clinical data available (Phase 1)............................................................................................... 72
Table 4-5: Time-series characteristics for risk groups ...................................................................... 72
Table 4-6: Outlier characteristics (at 3 standard deviations from mean)........................................... 73
Table 4-7: Available demographic data for each of the datasets ..................................................... 78
Table 4-8: Available data (count and percent) with ranges for age, gender, ASA class and weight.... 79
Table 4-9: Distribution of diagnoses cardiovascular diagnoses of relevance to the study ................. 87
Table 4-10: Distribution of Vascular disease count.......................................................................... 88
Table 4-11: Vascular diagnoses....................................................................................................... 89
Table 5-1: Components of case selection process .......................................................................... 100

Table 5-2: Study populations and description ............................................................................... 101
Table 5-3: Summary of datasets with variable examples ............................................................... 105
Table 6-1: Other performance measures for models based on all dataset either with or without RF
variables (general population). ............................................................................................. 133
Table 6-2: Other performance measures for models based on all dataset either with or without RF
variables (high risk population) ............................................................................................. 134
Table 6-3: Other performance measures for models based on all dataset either with or without RF
variables (general population) .............................................................................................. 134
Table 6-4: Other performance measures for models based on all dataset either with or without RF
variables (high risk population) ............................................................................................. 135
Table 6-5: Model complexity for general population (all variables, corVD prediction) ................... 153
Table 6-6: Model complexity for general population (all variables, corVD prediction) ................... 154
Table 6-7: Model complexity for all variables in high risk population (corVD) ................................. 155
Table 6-8: Model complexity for Cfs variables in high risk subset (corVD)...................................... 155
Table 6-9: Summary of model complexity for general population and prediction of corVD showing Cfs
variables and subset merit .................................................................................................... 157

xiii

Table 6-10: Summary of model complexity for high risk population and prediction of corVD showing
Cfs variables and subset merit ............................................................................................... 159
Table 6-11: Summary of model complexity for general population and prediction of anyVD showing
Cfs variables and subset merit ............................................................................................... 160
Table 6-12: Summary of model complexity for high risk population and prediction of anyVD showing
Cfs variables and subset merit ............................................................................................... 161
Table 6-13: Statistical analysis of hypotheses ................................................................................ 170

xiv

List of Figures
Figure 2-1: Example of an electrocardiogram showing a normal heart beat (Yanowitz 1996) ......... 16
Figure 2-2: Examples of ST/HR correlation plots associated with exercise stress tests. (Hamasaki,
Nakano et al. 1998) ................................................................................................................ 19
Figure 2-3: The CRISP-DM process model (CRISP-DM 2000) ............................................................. 33
Figure 2-4: Symbolic representation of a time-series dataset (Lin, Keogh et al. 2003)....................... 37
Figure 3-1: Study design flow diagram (dataset B contains variables related to time-series from which
outliers have been removed and dataset E contains counts of outliers and abnormal values) .. 62
Figure 4-1: Distribution of monitored variables (Phase 1=Test; Phase 2=Validn)............................... 69
Figure 4-2: Distribution of NIBP measurements .............................................................................. 73
Figure 4-3: Distribution of NIBP measurements as a percent of case duration.................................. 74
Figure 4-4: Heart Rate control chart for Phase 1dataset.................................................................. 74
Figure 4-5: ST segment level control chart for Phase 1 dataset ....................................................... 75
Figure 4-6: SpO2 control chart for cases in Phase 1 dataset ........................................................... 75
Figure 4-7: Mean HR values for entire case and for initial one fifth of case in VD categories ............ 76
Figure 4-8: Standard deviation for heart rates in the VD categories ................................................ 76
Figure 4-9: Mean ST values for entire case and for initial one fifth of case in VD categories............. 77
Figure 4-10: Gender distribution for all cases compared to low, high and very high risk cases ......... 80
Figure 4-11: Age distribution in phase 1 and phase 2 ...................................................................... 80
Figure 4-12: Age distribution in risk subsets.................................................................................... 81
Figure 4-13: Distribution of ASA classes in phase 1 and phase 2 cases ............................................. 81
Figure 4-14: Distribution of emergency ASA class cases as a percent of cases in each class (phase 1
and phase 2 cases). ................................................................................................................ 82
Figure 4-15: ASA class distribution for risk categories ..................................................................... 82
Figure 4-16: ASA class distribution for emergency cases as percentage of total cases (risk subsets) . 83
Figure 4-17: Patient characteristics in relation to vascular disease burden ...................................... 84
Figure 4-18: Distribution of case duration for cases of Trend Length 30 minutes or greater and
without evidence of case segmentation .................................................................................. 85
Figure 4-19: Case duration for risk subsets ..................................................................................... 85

Figure 4-20: Case duration in relation to vascular disease status ..................................................... 86
Figure 4-21: Distribution of vascular disease location (phase1 and phase 2) .................................. 88

xv

Figure 4-22: Distribution of vascular disease, general, low risk, high risk and very high risk populations
............................................................................................................................................... 90
Figure 4-23: Distribution of vascular disease ................................................................................... 90
Figure 4-24: Risk factor distribution in phase 1 and phase 2 ............................................................ 92
Figure 4-25: Distribution of risk factor as percentage of cases ......................................................... 92
Figure 4-26: Distribution of risk factor count (phase1 and phase 2) ................................................. 93
Figure 4-27: Distribution of risk factor count for risk subsets ........................................................... 93
Figure 4-28: Distribution of risk factor as percentage of cases ........................................................ 94
Figure 4-29: Weight distribution for phase 1 and phase 2 cases ....................................................... 94
Figure 4-30: Distribution of primary diagnostic groups (ICD code groups) ........................................ 95
Figure 5-1: Derived variables......................................................................................................... 106
Figure 5-2: Distribution of corVD in stClass_4A subsets ................................................................. 107
Figure 5-3: Distribution of anyVD amongst stClass_4a (high risk population) .................................. 108
Figure 5-4: Distribution of corVD amongst stClass_4a categories (low risk population) ................... 109
Figure 5-5: Example of time-series and correlation plots for HR and ST segments .......................... 110
Figure 5-6: Examples of ST/HR correlation based on raw time-series data (dataset A, upper section),
outlier removed data (dataset B, mid section) and data with outliers removed and smoothed
(dataset C, lower section) ...................................................................................................... 111
Figure 5-7: Time-series and correlation plots for raw data, data with outliers removed and data with
outliers and noise removed (Case A) ..................................................................................... 112
Figure 5-8: Time-series and ST/HR correlation plot (Case B) .......................................................... 112
Figure 5-9: Time-series and ST/HR correlation plot (Case C) .......................................................... 113
Figure 5-10: Time-series and ST/HR correlation plots (case D) ....................................................... 113
Figure 5-11: Time-series and correlation plots (Case E).................................................................. 114

Figure 5-12 : Time-series and ST/HR correlation plots (Case F) ....................................................... 114
Figure 5-13: Example time-series plots consistent with repeated episodes of stress and recovery but
magnitude of ST depression was not significant (Case F) ........................................................ 115
Figure 5-14: Box plot of ST segment levels in Dataset A, Dataset B and Dataset C for cases in which
three ECG leads were monitored ........................................................................................... 115
Figure 5-15: Box plot of HR in Dataset A, Dataset B and Dataset C for cases in which three ECG leads
were monitored .................................................................................................................... 116
Figure 5-16: Clusters for heart rate time-series patterns in Phase 1cases in general population...... 119
Figure 5-17: ST segment time-series clusters (phase 2, high risk subset) ........................................ 119
Figure 6-1: Comparison of feature selection method for high risk subset cases using dataset A and MR
............................................................................................................................................. 123

xvi

Figure 6-2: Comparison of feature selection method for high risk subset cases using dataset A and
AUC...................................................................................................................................... 124
Figure 6-3: Comparison of feature selection methods (model size) ............................................... 125
Figure 6-4: comparison of feature selection methods (other model performance measures) ......... 125
Figure 6-5: Comparison of models based on time-series data alone or in combination with risk factor
variables (corVD, general population) (MR)........................................................................... 127
Figure 6-6: Comparison of models based on time-series data alone or in combination with risk factor
variables (corVD, general population) (AUC) ......................................................................... 128
Figure 6-7: Comparison of models based on time-series data alone or in combination with risk factor
variables (corVD, high risk population) (MR) ......................................................................... 128
Figure 6-8: Comparison of models based on time-series data alone or in combination with risk factor
variables (corVD, high risk population) (AUC) ........................................................................ 129
Figure 6-9: Prediction of coronary vascular disease using both stratified cross validation and
SpreadSubsample cross validation with J48 decision tree (misclassification rate) .................. 130
Figure 6-10: Prediction of coronary vascular disease using both stratified cross validation and

SpreadSubsample cross validation with J48 decision tree (misclassification rate) .................. 131
Figure 6-11: Prediction of coronary vascular disease using both stratified cross validation and
SpreadSubsample cross validation with J48 decision tree (AUC) ............................................ 132
.Figure 6-12: Prediction of any vascular disease using both stratified cross validation and
SpreadSubsample cross validation with J48 decision tree (AUC) ............................................ 132
Figure 6-13: Use of MR for evaluation of model performance based on selected subsets (RF_only,
RF_A and RF_F) in the general and risk subset groups (DT J48, corVD)................................... 136
Figure 6-14: Use of AUC for evaluation of model performance based on selected subsets (RF_only,
RF_A and RF_F) in the general and risk subset groups (DT J48, corVD ) .................................. 136
Figure 6-15: Use of Kappa statistic for evaluation of model performance based on selected subsets
(RF_only, RF_A and RF_F) in the general and risk subset groups (DT J48, corVD )................... 137
Figure 6-16: Use of MR for evaluation of model performance based on selected subsets (RF_only,
RF_A and RF_F) in the general and risk subset groups (DT J48, anyVD ) ................................. 137
Figure 6-17: Use of AUC for evaluation of model performance based on selected subsets (RF_only,
RF_A and RF_F) in the general and risk subset groups (DT J48, anyVD ) ................................ 138
Figure 6-18: Effect of inclusion of ASA class in corVD prediction models for general populations (MR)
............................................................................................................................................ 139
Figure 6-19: Effect of ASA class and ASA class imputation in prediction of corVD in general population
(AUC) ................................................................................................................................... 140
Figure 6-20: Effect of ASA class and ASA class imputation in prediction of corVD in general populations
(Kappa statistic).................................................................................................................... 140
Figure 6-21: Performance evaluation using MR for low and very high risk subsets and selected
datasets ............................................................................................................................... 142

xvii

Figure 6-22: Performance evaluation using AUC for low risk and very high risk subsets and selected
datasets ................................................................................................................................ 142
Figure 6-23: Performance evaluation using the Kappa statistic for select data subsets in the low risk

and very high risk categories ................................................................................................. 143
Figure 6-24: Comparison of methods in prediction of corVD in selected datasets and using stratified
cross validation in general population and high risk subset (MR).......................................... 144
Figure 6-25: Comparison of methods in prediction of corVD in selected datasets and using stratified
cross validation (AUC) ........................................................................................................... 144
Figure 6-26: Comparison of DT and rule based methods in the prediction of coronary VD in low risk
and high risk populations following Cfs feature reduction (MR) ............................................. 145
Figure 6-27: Comparison of DT and rule based methods in the prediction of coronary VD in low and
high risk subsets following Cfs feature reduction (AUC) ......................................................... 146
Figure 6-28: Effect of risk category (RiskC) on prediction of corVD (MR) ........................................ 148
Figure 6-29: Effect of risk category on prediction of corVD (Sensitivity and PPV)) ........................... 148
Figure 6-30: Effect of risk category on prediction of corVD (AUC and Kappa) .................................. 149
Figure 6-31: Effect of risk category on prediction of corVD (Specificity and NPV) ............................ 149
Figure 6-32: Measure of model complexity in prediction of corVD, all variables.............................. 152
Figure 6-33: Measures of model complexity in prediction of corVD using Cfs data ......................... 152
Figure 6-34: Comparison models for prediction of corVD and anyVD in general population (Sensitivity)
............................................................................................................................................. 164
Figure 6-35: Comparison models for prediction of corVD and anyVD in general population (AUC) . 165
Figure 6-36: Comparison models for prediction of corVD and anyVD in general population (PPV)... 165
Figure 6-37: Comparison of models based on time-series data alone or in combination with risk factor
variables (anyVD, general population, Cfs) (MR) .................................................................... 166
Figure 6-38: Comparison of models based on time-series data alone or in combination with risk factor
variables (anyVD, general population) (AUC) ......................................................................... 166
Figure 6-39: Comparison of models based on time-series data alone or in combination with risk factor
variables (anyVD, high risk population, Cfs) (MR) ................................................................... 167
Figure 6-40: Comparison of models based on time-series data alone or in combination with risk factor
variables (anyVD, high risk population, Cfs) (AUC) ................................................................. 167
Figure 6-41: Effect of nonCorVD status and risk category on prediction of corVD ........................... 168
Figure 6-42: Effect of nonCorVD status and risk category on prediction of corVD (AUC) ................. 169
Figure 6-43: Effect of nonCorVD status and risk category on prediction of corVD (Kappa) .............. 169

xviii

List of Appendices
Appendix A: Time-series characteristics for Phase 1 and Phase 2 .................................................. 203
Appendix B: Description of ICD code groups ................................................................................. 204
Appendix C: Examples of ST depression and elevation (Yanowitz 1996) ....................................... 205
Appendix D: Variations of ST depression (Yanowitz 1996) ............................................................ 208
Appendix E: ECG lead placement (Yanowitz 1996) ....................................................................... 209
Appendix F: Distribution of risk factor variables (shows proportion of type 1 diabetes, current and
past smoking and various lipid disorders).............................................................................. 210
Appendix G: Distribution of vascular diseases burden per case. ..................................................... 211
Appendix H: Examples of time-series derived variables ................................................................ 212
Appendix I: Effect of data pre-processing on count of heart rate values reaching adequate predicted
maximum heart rate (phase 2, general populations and high risk subset).............................. 213
Appendix J: Distribution of HR values in different patient groups .................................................. 214
Appendix K: Effect of removal of outliers and noise on the number of cases considered to have
significant ST deviation ......................................................................................................... 216
Appendix L: Comparison of DT and rule based methods for selected datasets (AUC) ..................... 217
Appendix M: Comparison of DT and rule based methods for selected datasets (MR) .................... 218
Appendix N: Comparison of methods and dataset for risk category subsets (MR) .......................... 219
Appendix O: Comparison of methods and datasets for risk category subsets (AUC) ...................... 220
Appendix P: Comparison of methods and datasets for risk category subsets (Kappa statistic) ...... 221
Appendix Q: Comparison of methods in prediction of anyVD in selected subsets........................... 222
Appendix R: Description of ST classes ........................................................................................... 225
Appendix S: Examples of J48 decision tree rule sets ...................................................................... 226
Appendix T: Examples of decision trees ........................................................................................ 230
Appendix U: Comparison of DT and rule based methods for prediction of anyVD.......................... 233

xix

xx

List of Abbreviations
Abbreviation

Description

AA
AAA
AARK
ABPI
ACB
ACS
AHA/ACC
AIM
ANN
ASA
AUC
BP / SBP
bpm
Bradycardia
CABG
CAD
CART
CCTA
cerVD

CHD
Chronotropic
HR response
Claudication

Aortic aneurysm
Abdominal aortic aneurysm
Automated anaesthesia record keeping
Ankle brachial pressure index
Aorto-coronary bypass
Acute coronary syndrome
American Heart Association / American College of Cardiologists
Anaesthetic information management
Artificial neural networks
American Society of Anesthesiologists
Area under ROC curve
Blood pressure / systolic blood pressure
Beats per minute
Heart rate below normal range
Coronary artery bypass grafting
Coronary artery disease = CVD
Classification and regression tree
Coronary computer tomographic angiography
Cerebral vascular disease
Coronary heart disease = CVD
Ability to increase heart rate in response to increased demand

CO2ET, FI
CombinVD
corVD

CRP
CVD
CVE
CVP
DESET, FI
DM
DT

Pain related to ischaemic tissue associated with atherosclerotic
vascular disease, frequently in legs and initially with exercise only
Inspired (FI) and expired (end tidal, ET) concentration of carbon
dioxide
Coronary vascular disease and non coronary vascular disease
Term representing the presence of coronary vascular disease in the
models developed here. It has the same meaning as CVD, CHD
C-reactive protein, a marker of inflammation
Cardiovascular disease
Cardiovascular events
Central venous pressure measured via a venous line extended to a
central vein
Inspired (fraction inspired, FI) and expired (end-tidal, ET)
concentration of Desflurane, a volatile aneasthetic agent
Diabetes mellitus / data mining
Decision tree
xxi

Electrocardiography/ electrocardiogram
Exercise/stress electrocardiography/ exercise tolerance test
Fraction of inspired oxygen

Random forests
Framingham risk score
Feature selection
Heart rate
Rate at which the HR returns to baseline level following exercise or
pharmacological stimulus
HR at which there is least variability in ST segment level
HRminDiff
Heart rate recovery
HRR
Heart rate variability
HRV
Health stability measure
HSM
Hypertension
HT
BP above normal range
Hypertension
Hyperventilation Rapid respiratory rate
Low serum potassium level
Hypokalemia
BP below normal range
Hypotension
Reduced volume of blood in the intravascular space
Hypovolaemia
Interleukin – 6, a marker of inflammation
IL-6
Logistic regression
LogR/ LR
Left ventricle

LV
Left ventricular ejection fraction, a measure of heart pump function
LVEF
Maximum variability of ST segment level at particular heart rate
maxDiff
Multi-criteria decision analysis
MCDA
Metabolic equivalents, a measure of exercise performed
MET
Myocardial infarction
MI
Minimum variability of ST segment level at specific HR
minDiff
Naïve Bayes
NB
Non invasive Blood Pressure (systolic)
NIBP(sys)
Non coronary vascular disease (either cerebral or peripheral)
nonCorVD
Inspired (FI) and expired (end tidal, ET) concentration of oxygen
O2ET, FI
Odds ratio
OR
Piecewise aggregate approximation
PAA
Peripheral arterial disease
PAD
Partial regression tree
PART
Piecewise constant approximation

PCA
Peripheral vascular disease
perVD
Piecewise linear approximation
PLA
Piecewise linear regression
PLR
Prolonged mechanical ventilation
PMV
Percutaneous transluminal coronary angioplasty (reducing lipid
PTCA
ECG
exECG
FiO2
FORF
FRS
FS
HR
HR recovery

xxii

r/o
RCRI
RF
ROC

plaque intrusion into vessel lumen using a balloon catheter
introduced via a peripheral artery)

Component of electrocardiogram, suggestive of myocardial
necrosis
removal of
Revised Cardiac Risk Index
Risk factor
Receiver operating characteristics

Abbreviation

Description

RPP
RR
R-R interval
SAX
SCD
Serum
creatinine
SEVET /FI

Rate pressure product, (HR x SBP)
Respiratory rate
Distance between two consecutive R waves on the ECG
Symbolic aggregate approximation
Sudden cardiac death
Measure of kidney disease

Q wave

SIRS

SMO
SpO2
SRI
ST
SVM
T, A, T/A
T wave
Tachycardia
TAN
TWA
Urea
VD

Inspired (fraction inspired, FI) and expired (end-tidal, ET)
concentration of Sevoflurane
Systemic inflammatory response syndrome
Optimised support vector machines
Peripheral oxygen saturation
Stress recovery index
Level of ST segment in ECG tracing
Support vector machines
Thoracic, abdominal, thoraco-abdominal (aorta)
Final component of ECG tracing, reflects repolarisation
Heart rate above normal range
Tree augmented Naïve Bayes
T-wave alternans, represents a beat to beat alteration in the shape,
amplitude and timing of the ST segment and T wave
Another measure of kidney function
Vascular disease

xxiii

xxiv

Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements
for an award at this or any other higher education institution. To the best of my knowledge and
belief, the thesis contains no material previously published or written by another person except
where due reference is made.

( Ellen Pitt )

Date:

13 August, 2009

xxv

Application of data mining techniques in the prediction of coronary artery disease use of anaesthesia time series and patient risk factor data

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về