Tải bản đầy đủ (.pdf) (31 trang)

Bài tập lớn Xác suất thống kê ĐH BK

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.19 MB, 31 trang )

Probability and Statistics Project

CC04

Table of Contents
I. Topic............................................................................................................................ 4
1. Context .................................................................................................................... 4
2. Attribute Information .............................................................................................. 4
II. Theory ........................................................................................................................ 4
1. Logistic Regression Analysis .................................................................................. 4
2. Method of Maximum Likelihood ............................................................................ 6
3. Optimal model selections ........................................................................................ 8
4. Information about data ............................................................................................ 8
a. Types of Angina (ChestPainType)....................................................................... 8
b. Serum cholesterol concentration .......................................................................... 9
c. Resting ECG results (RestingECG) ..................................................................... 9
d. Exercise intensity (ST_Slope) ............................................................................. 9
III. Code: ......................................................................................................................... 9
1. Import data .............................................................................................................. 9
2. Clean the data ........................................................................................................ 10
3. Data structure overview ........................................................................................ 11
4. A table of quantity statistics .................................................................................. 12
a. Sex ...................................................................................................................... 12
b. ChestPainType ................................................................................................... 12
c. FastingBS ........................................................................................................... 12
d. RestingECG ....................................................................................................... 13
e. ExerciseAngina .................................................................................................. 13
f. ST_Slope ............................................................................................................ 13
g. HeartDisease ...................................................................................................... 13
5. Plot the Histogram and Plot the Barplot ............................................................... 14
a. Age and RestingBP histogram ........................................................................... 14


b. Cholesterol and MaxHR histogram ................................................................... 15
c. Oldpeak histogram ............................................................................................. 16
d. Sex and CPT barplots ........................................................................................ 17
e. FastingBloodSugar and Resting ECG barplots .................................................. 17
f. ExerciseAngina, ST_Slope and HeartDisease barplots ...................................... 18
g. Sex and CPT barplots for HeartDisease ............................................................ 19
h. FastingBS and RestingECG barplots for HeartDisease ..................................... 20
2
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

i. ExerciseAngina and ST_Slope barplots for HeartDisease ................................. 21
j. Age vs HeartDisease histogram .......................................................................... 22
k. RestingBP vs HeartDisease histogram .............................................................. 23
l. Cholesterol vs HeartDisease histogram .............................................................. 24
m. MaxHR vs HeartDisease histogram ................................................................. 25
n. Oldpeak vs HeartDisease histogram .................................................................. 26
6. Model .................................................................................................................... 27
a. ............................................................................................................................. 27
b.............................................................................................................................. 28
c. ............................................................................................................................. 28
d.............................................................................................................................. 29
e. ............................................................................................................................. 29
f. The prediction for training set ............................................................................ 31
g. The prediction for testing set ............................................................................. 31
IV. References .............................................................................................................. 32


3
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

I. Topic
1.
Context
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an
estimated 17.9 million lives each year, which accounts for 32% of all deaths
worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and onethird of these deaths occur prematurely in people under 70 years of age. Heart failure
is a common event caused by CVDs and this dataset contains 11 features that can be
used to predict a possible heart disease.
People with cardiovascular disease or who are at high cardiovascular risk (due to the
presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia
or already established disease) need early detection and management wherein a
machine learning model can be of great help.
2.
Attribute Information
Age: age of the patient [years]
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical
Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0:

otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST:
having ST-T wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV), LVH: showing probable or definite left ventricular
hypertrophy by Estes' criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and
202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping,
Flat: flat, Down: downsloping]
HeartDisease: output class [1: Heart disease, 0: Normal]
II. Theory
1.
Logistic Regression Analysis
Regression methods have become an integral component of any data analysis
concerned with describing the relationship between a response variable and one or
more explanatory variables. It is often the case that the outcome variable is discrete,
taking on two or more possible values. Over the last decade, the logistic regression
model has become, in many fields, the standard method of analysis in this situation.
Linear regression analysis and analysis variance, we find the model and the
relationship between a dependent variable is the continuous dependent variable, and
one or more independent variable is either continuous or discontinuous. But in many
cases, the dependent variable is not a constant variable but a binary measure tool:
yes/no, ill/healthy, deceased/alived, occurred/didn't happen, etc., and the independent
variables can be continuous or discontinuous. We want to find out the relationship
4
Instructor: Prof. Nguyễn Tiến Dũng



Probability and Statistics Project

CC04

between independent and dependent variables.
We’ll focus on the case of binary responses variable coded using 0 and 1. In practice,
these 0 and 1s will code for two classes such as yes/no, win/lose, ill/healthy, etc.
1, 𝑦𝑒𝑠
𝑌={
0, 𝑛𝑜
Given an event frequency x recorded from n subjects, we can calculate the probability
𝑥
of that event as: 𝑝 =
𝑛

First, we define some notation that we will use throughout.
𝑝(𝑥) = 𝑃[𝑌 = 1|𝑋 = 𝑥]
With a binary (Bernoulli) response, we’ll mostly focus on the case when 𝑌 = 1, since
with only two possibilities, it is trivial to obtain probabilities when 𝑌 =0.
𝑃[𝑌 = 0|𝑋 = 𝑥] + 𝑃[𝑌=1| 𝑋 = 𝑥] = 1
𝑃[𝑌 = 0|𝑋 = 𝑥] = 1- 𝑝(𝑥)
Another way of expressing risk is odds, roughly translated as possibility. The
probability of an event is simply defined as the ratio of the probability of the event
occurring to the probability of the event not occurring:
𝑝
𝑜𝑑𝑑𝑠 =
1−𝑝
We now define the logistic regression model.
𝑝
logit(𝑝) = 𝑙𝑜𝑔 (

)
1−𝑝
The relationship between p and logit(p) is a continuous relationship and has the
following form

𝑝 (𝑥 )
) = 𝛽0 + 𝛽1 𝑥1 +. . . +𝛽𝑝−1 𝑥𝑝−1
1 − 𝑝 (𝑥 )
Immediately we notice some similarities to ordinary linear regression, in particular,
the right-hand side. This is our usual linear combination of the predictors.
𝑙𝑜𝑔 (

5
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

The left-hand side is called the log(odds), which is the log of the odds. The odds are
the probability for a positive event (𝑌 = 1) divided by the probability of a negative
event (𝑌 = 0). So when the odds are 1, the two events have equal probability. Odds
higher than 1 favor a positive event. The opposite is true when the odds are lower than
1.
With logistic regression, which uses the Bernoulli distribution, we only need to
estimate the Bernoulli distribution’s single parameter 𝑝(x), which happens to be its
mean.
So even though we introduced ordinary linear regression first, in some ways, logistic
regression is actually simpler. Note that applying the inverse logit transformation

allow us to obtain an expression for 𝑝(x)
 +  x +...+ 

x

p −1 p −1
e 0 11
p ( x ) = P[Y = 1| X = x]=
 +  x +...+  p −1x p −1
1+ e 0 1 1

With 𝑛 observations, we write the model indexed with 𝑖 to note that it is being applied
to each observation.

 p ( x) 
log 
 =  0 + 1 x1 + ... +  p −1 x p −1
1

p
x
(
)


We can apply the inverse logit transformation to obtain 𝑃[𝑌𝑖 = 1|𝑋𝑖 = 𝑥𝑖 ] for each
observation. Since these are probabilities, it’s good that we used a function that
returns values between 0 and 1
 +  x +...+ 


x

p −1 i ( p −1)
e 0 1 i1
p ( xi ) = P[Yi = 1| X i = xi ]=
 +  x +...+  p −1 xi ( p −1)
1 + e 0 1 i1
1
1 − p ( xi ) = P[Yi = 0 | X i = xi ]=
 +  x +...+  p −1 xi ( p −1)
1 + e 0 1 i1

Given an independent variable x (x can be continuous or discontinuous), the logistic
regression model states that:
logit(p) = α + βx
Similar to the linear regression model, α and β are two linear parameters that need to
be estimated from the research data. But the meaning of this parameter, especially the
parameter β, is very different from the meaning we are used to with linear regression
models.
𝑝
logit(𝑝) = 𝑙𝑜𝑔 (
) 𝛼 + 𝛽𝑥
1−𝑝
𝑝
𝑜𝑑𝑑𝑠(𝑝) =
= 𝑒 𝛼+𝛽𝑥
1−𝑝
2.
Method of Maximum Likelihood
The parameters of the model can be estimated by the method of maximum likelihood.

This is a quite general technique, similar to the least-squares method in that it finds a
set of parameters that optimizes a goodness-of-fit criterion (in fact, the least-squares
method itself is a slightly modified maximum-likelihood procedure). The likelihood
function L(β) is simply the probability of the entire observed data set for varying
parameters.
6
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

To “fit” this model, that is estimate the 𝛽 parameters, we will use maximum
likelihood.

 = [ 0 , 1 ,  2 ,...,  p −1 ]

We first write the likelihood given the observed data.
n

L (  ) =  P[Yi = yi | X i = xi ]
i =1

This is already technically a function of the 𝛽 parameters, but we’ll do some
rearrangement to make this more explicit
n

n


L(β) = ∏ p(xi ) ∏ (1 − p(xj )) (1 − yi )
i;yi =1
𝑛

j;yj =0
𝑛

𝐿(𝛽) = ∏ 𝑝(𝑥𝑖 ) ∏ (1 − 𝑝(𝑥𝑗 ))
𝑖;𝑦𝑖 =1
𝑛

𝐿 (𝛽 ) = ∏
𝑖;𝑦𝑖 =1

𝑗;𝑦𝑗 =0

𝑒

𝛽0 +𝛽1 𝑥1 +...+𝛽𝑝−1 𝑥𝑖(𝑝−1)

1+𝑒

𝛽0 +𝛽1 𝑥1 +...+𝛽𝑝−1 𝑥𝑖(𝑝−1)

𝑛

1


𝑗;𝑦𝑗 =0


1+𝑒

𝛽0 +𝛽1 𝑥1 +...+𝛽𝑝−1 𝑥𝑗(𝑝−1)

Unfortunately, unlike ordinary linear regression, there is no analytical solution for this
maximization problem. Instead, it will need to be solved numerically. This is where
we will require the help of a computer software. For our problem, R can take care of
this for us using an iteratively reweighted least squares algorithm. We’ll leave the
details for a machine learning or optimization course, which would likely also discuss
alternative optimization strategies.
Logistic regression analysis belongs to the class of generalized linear models. These
models are characterized by their response distribution (here the binomial distribution)
and a link function, which transfers the mean value to a scale in which the relation to
background variables is described as linear and additive. In a logistic regression
𝑝
analysis, the link function is logit(𝑝) = 𝑙𝑜𝑔 ( )
1−𝑝

7
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

3.
Optimal model selections
One of the difficult and sometimes difficult problems in multivariable logistic

regression analysis is choosing a model that can adequately describe the data. Suppose
we are conducting a study on a model with a dependent variable y and 3 independent
variables x1, x2 and x3, we can obtain the following models to predict y: y = f(x1), y =
f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is
a function. In general, with k independent variables x1, x2, x3 ... xk, we can have up to
2k - 1 different models to choose from.
It is often the case that when a model has too many independent variables, some of
these will not contribute to the outcome of the model as there may be no correlation
between the independent variable and the dependent variable. In this situation, it is
recommended that we detect and remove these unnecessary variables from our
calculation so our calculation will not be skewed. One further advantage of removing
these variables is making the interpretation of our data much easier to manage. When
comparing a model with 3 independent variables that has the same ability to predict
data or better with a model with 5 independent variables, the first model is chosen for
ease and accuracy. It must be noted, however, that these removed independent
variables must be confirmed to have no statistical importance on the dependent
variable, that is there is no observed correlation, as it is counterproductive to give up
accuracy for clarity.
Adequate criterion here means that the model must describe the data satisfactorily, i.e.
must predict close (or as close as possible) to the actual observed value of the
dependent variable y. If the observed value of y is 10, and if there is a predictive
model of 9 and a predictive model of 6, the former must be considered more complete.
The criterion of “practical significance”, as it is called, means that the model must be
supported by theory or have biological significance (if biological research), clinical
significance (if research study), clinical studies, etc. It's possible that phone numbers
are somehow related to fracture rates, but of course such a model makes no senseas
correlation does not imply causation. This is an important criterion, because if a
statistical analysis results in a model that makes a lot of mathematical sense but has no
practical significance, then the model is just a numbers game, without any real
meaning. real scientific value. The third criterion (of practical significance) belongs to

the realm of theory, and we will not discuss it here. An important and useful metric for
us to decide on a simple and complete model is the Akaike Information Criterion
(AIC). The formula for calculating the AIC value:
𝐴𝐼𝐶 = −2 𝑙𝑜𝑔( 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2𝑘
𝑤ℎ𝑒𝑟𝑒 𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
A simple and complete model should be one with as low an AIC value as possible and
the independent variables must be statistically significant. Thus, the problem of
finding a simple and complete model is actually looking for the one (or more) model
with the lowest or near lowest AIC value.
4.
Information about data
a.
Types of Angina (ChestPainType)
+ Typical angina (TA): Typical angina consists of the following 3 features: Feeling
pain like strangulation, pain like tightness or pressure in the left chest or behind the
8
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

breastbone, which may radiate to the chin, left hand. Appears with a regular nature,
increases after exertion, strong emotions, encounters cold... The time is usually 3 to 5
minutes.
+ Atypical angina (ATA): Atypical angina is different from classic angina. There are
no signs of an ischemic heart attack. The symptoms of atypical angina can range from
dull, sharp pain to tearing or presence symptoms such as shortness of breath or back
pain.

+ Non-anginal angina (NAP): Unstable chest pain caused by blockage of a coronary
artery without having a heart attack. Symptoms include chest discomfort with or
without shortness of breath, nausea, and sweating foul. Diagnosis is by
electrocardiogram and the presence or absence of serologic findings.
+ Asymptomatic (ASY)
b.
Serum cholesterol concentration
The higher the level of HDL cholesterol in the blood, the lower the risk of
cardiovascular disease. Still, when HDL - cholesterol levels fall below 40 mg/dL, the
risk of heart disease increases circuit.
c.
Resting ECG results (RestingECG)
+ ST: The ST segment indicates that the depolarization of the ventricular myocardium
has completed. Usually, paragraph the ST is level with the isoelectric line as the PR
(or TP) interval. Sometimes it's a little higher isoelectric line slightly.
d.
Exercise intensity (ST_Slope)
+ Up: the intensity of the exercise increases from light to heavy.
+ Flat: the intensity of the exercise does not change.
+ Down: the intensity of the exercise decreases from heavy to light.
And some other medical information.
III. Code:
1. Import data

9
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project


CC04

2. Clean the data

Comments: There is no missing value to process in the file df

10
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

3. Data structure overview

11
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

4. A table of quantity statistics

a. Sex

There are 178 female patients participating in the test.
There are 655 male patients participating in the test.

b. ChestPainType

There are 450 patients being asymptomatic.
There are 160 patients having atypical angina.
There are 185 patients haing non-anginal pain.
There are 38 patients having typical angina.
c. FastingBS

There are 647 patients having fasting blood sugar <= 120 mg/dl.
There are 186 patients having fasting blood sugar > 120 mg/dl.

12
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

d. RestingECG

There are 171 patients having resting electrocardiogram results are probable or
defining left ventricular hypertrophy by Estes’ criteria.
There are 499 patients having resting ECG results are normal.
There are 163 patients having resting ECG results are having ST-T wave abnormality.
e.
ExerciseAngina

There are 498 patients not having exercise-induced angina.
There are 335 patients having exercise-induced angina.

f.
ST_Slope

There are 62 patients having the slope of the peak exercise ST segment is down.
There are 413 patients having the slope of the peak exercise ST segment is flat.
There are 358 patients having the slope of the peak exercise ST segment is up.
g.
HeartDisease

There are 378 patients not having heart disease.
There are 455 patients having heart disease.

13
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

5.
a.

CC04

Plot the Histogram and Plot the Barplot
Age and RestingBP histogram

Comments:
This age histogram shows that most of patients aged between 50 and 60. The
highest number of patients is in 55-60 age group (184 people) while the quantity of
patients aged around 30 is the lowest (12 people). It has a symmetrical bell shape of

normal distribution.
This histogram shows the majority of patients having resting BP from 125 to
150 mmHg. The most and least common resting BP among patients are 110 – 125
mmHg (344 people) and 75 - 90 mmHg (1 people), respectively. There is unusual to
have 1 patient having resting BP at 0 – 10 mmHg

14
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

b. Cholesterol and MaxHR histogram

Comments:
In the cholesterol histogram, most of patients have serum cholesterol from 200
to 300 mm/dl. The number of people having serum cholesterol between 250 to 275
mm/dl is the highest (238 patients) compared to that of 500 – 575 mm/dl is the lowest
(2 patients). There is unusual to have 148 people having serum cholesterol at 0 – 25
mm/dl.
This histogram shows the majority of patients having max HR from 125 to 175.
The most and least common max HR among patients are 125 – 150 (224 people) and
50 – 75, around 200 (4 people), respectively. It has a symmetrical bell shape of normal
distribution.

15
Instructor: Prof. Nguyễn Tiến Dũng



Probability and Statistics Project

CC04

c. Oldpeak histogram

Comments:
This oldpeak histogram shows that most of patients having oldpeak between 0
and 3.
The highest number of patients is having oldpeak from -1 to 0 (404 people)
while the quantity of patients having it under -2 is the lowest (1 people). It has a
symmetrical bell shape of normal distribution.

16
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

d. Sex and CPT barplots

e. FastingBloodSugar and Resting ECG barplots

17
Instructor: Prof. Nguyễn Tiến Dũng



Probability and Statistics Project

CC04

f. ExerciseAngina, ST_Slope and HeartDisease barplots

18
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

CC04

g.Sex and CPT barplots for HeartDisease

Comments:
The percentage of patients with heart disease in the male group was higher than
the percentage of patients with heart disease in the female group.
The ratio of patients with heart disease in the group of patients with chest pain
is higher than patients with other symptoms.

19
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

h.


CC04

FastingBS and RestingECG barplots for HeartDisease

Comments:
The ratio of patients with heart disease in the group of patients with blood
sugar above 120 mg/dl was higher than the proportion of patients with heart disease in
the group of patients with blood sugar from less than 120 mg/dl.
Percentage of patients with heart disease in the group of patients with abnormal
ECG results is higher than in patients with other ECG results. Due to all three groups
of resting electrocardiograms have a higher proportion of people with the disease, so it
does not help us to predict the probability of people getting the disease.

20
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

i.

CC04

ExerciseAngina and ST_Slope barplots for HeartDisease

Comments:
The percentage of patients with heart disease in the group of patients with
angina during exercise is higher than the proportion of patients with heart disease in
the group of patients without exercise angina.
The proportion of patients with heart disease in the group of patients

participating in flat aerobic exercise is higher than the patients in the other group.

Variables ST_Slope, ExerciseAngina, FastingBS, ChestPainType, Oldpeak,
MaxHR, Cholesterol and Age have an effect on predicting disease probability based
on plots plotted.

21
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

j.

CC04

Age vs HeartDisease histogram

Comments:
The average age of patients with the disease was higher than that of patients
without the disease. It can be said that older people have a higher risk of getting heart
disease.
The highest percentage of having heart disease is the group of age 56, 57. In
constrast, the smallest comes from less than 30 age group.

This is not the normal distribution because there is a growth in the right side of
the mean point.

22
Instructor: Prof. Nguyễn Tiến Dũng



Probability and Statistics Project

k.

CC04

RestingBP vs HeartDisease histogram

Comments:
Mean resting blood pressure of patients with disease was higher than that of
patients without disease. In general, the frequency distributions of people with and
without disease are comparable (entirely equal). Therefore, measuring resting blood
pressure does not predict the probability of a person developing cardiovascular
disease.

This is not the normal distribution because in the right side of the mean point,
there is a higher value and it fluctuates.

23
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

l.

CC04


Cholesterol vs HeartDisease histogram

Comments:
The mean cholesterol of the patients with the disease was lower than that of the
patients without the disease. However, the number of patients with the disease having
more than 300 cholesterol is always higher than that of patients without disease.
Therefore, it is relative exact to predict the probability of a person having a heart
disease.

This is not a normal distribution because in the left side of the mean point,
there is a higher value and it changes unpredictable.

24
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

m.

CC04

MaxHR vs HeartDisease histogram

Comments:
The mean maximum heart rate of patients with the disease was lower than that
of patients without the disease. The maximum heart rate of people with the disease has
a lower distribution than that of people without the disease, thereby helping us to
predict the probability of that person having heart disease or not.


This is not a normal distribution because it does not look like a slope and goes
down gradually to both two sides.

25
Instructor: Prof. Nguyễn Tiến Dũng


Probability and Statistics Project

n.

CC04

Oldpeak vs HeartDisease histogram

Comments:
The mean depression index of patients with the disease was higher than that of
patients without the disease. The distribution of depression index of people with the
disease is higher than that of people without the disease. Therefore, this is a factor that
helps us predict the probability of people getting the disease.

This is not a normal distribution because it does not look like a slope and goes
down gradually to both two sides.

26
Instructor: Prof. Nguyễn Tiến Dũng


×