Tải bản đầy đủ (.pdf) (42 trang)

Bài tập lớn Xác suất thống kê ĐH BK

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.1 MB, 42 trang )

PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

TABLE OF CONTENTS
LIST OF FIGURE ...................................................................................................................................... 3
ACKNOWLEDGMENT ............................................................................................................................ 5
1. INTRODUCTION................................................................................................................................... 6
1.1. Topic introduction and requirements ............................................................................................ 6
1.1.1. Subject ........................................................................................................................................ 6
1.1.2. R-studio....................................................................................................................................... 6
1.1.3. Our problems .............................................................................................................................. 6
1.2. Theoretical basis............................................................................................................................... 7
2. DATA CORRECTION ......................................................................................................................... 12
2.1. Improt data ..................................................................................................................................... 12
2.2. Data cleaning .................................................................................................................................. 12
2.3. Data clarification ............................................................................................................................ 13
2.4. Logistics Regression ....................................................................................................................... 26
2.5. Prediction ........................................................................................................................................ 36
3. CODE R ................................................................................................................................................. 39
REFERENCES .......................................................................................................................................... 43

2|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

LIST OF FIGURE
Figure 1: R code and results after reading data .......................................................................................... 12


Figure 2: R code and results when checking missing data in file "diabetes" .............................................. 12
Figure 3: R code and results when performing descriptive statistics ......................................................... 13
Figure 4: R code and results when performing quantitative statistics for the variable "Outcome" .......... 13
Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose” .............. 14
Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness”
.................................................................................................................................................................... 15
Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI” ............................. 16
Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and ...... 17
Figure 9: R code........................................................................................................................................... 18
Figure 10: The result of histogram shows the distribution of the number of pregnancies for people
having and not having diabetes .................................................................................................................. 18
Figure 11: R code......................................................................................................................................... 19
Figure 12: The result of histogram shows the distribution of skin thickness for people having and not
having diabetes ........................................................................................................................................... 19
Figure 13: R code......................................................................................................................................... 20
Figure 14: The result of histogram shows the distribution of glucose level for people having and not
having diabetes ........................................................................................................................................... 20
Figure 15: R code......................................................................................................................................... 21
Figure 16: The result of histogram shows the distribution of blood pressure for people having and not
having diabetes ........................................................................................................................................... 21
Figure 17: R code......................................................................................................................................... 22
Figure 18: The result of histogram shows the distribution of insulin level for people having and not
having diabetes ........................................................................................................................................... 22
Figure 19: R code......................................................................................................................................... 23
Figure 20: The result of histogram shows the distribution of BMI (Body mass index) for people having
and not having diabetes.............................................................................................................................. 23
Figure 21: R code......................................................................................................................................... 24
Figure 22: The result of histogram shows the distribution of diabetes pedigree function for people
having and not having diabetes .................................................................................................................. 24
Figure 23: R code......................................................................................................................................... 25

Figure 24: The result of histogram shows the distribution of age for people having and not having
diabetes....................................................................................................................................................... 25
Figure 25: R code and results ...................................................................................................................... 26
Figure 26: R code and results when removing skinthickness variable from model 1................................. 28
Figure 27: R code and results when removing Insulin variable from model 2 ........................................... 29
Figure 28: R code and results when removing age variable from model 3 ................................................ 30
Figure 29: R code and results when comparing the efficiency between model 1 and model 2................. 30
Figure 30: R code and results when comparing the efficiency between model 2 and model 3................. 31
Figure 31: R code and results when comparing the efficiency between model 3 and model 4................. 31
Figure 32: R code......................................................................................................................................... 32
Figure 33: Results when building an equation with all 8 variables ............................................................. 32
Figure 34: Result when Skinthickness variable from the first model......................................................... 32
3|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Figure 35: R code and summary of model results ...................................................................................... 33
Figure 36: R code and results ...................................................................................................................... 34
Figure 37: R code and results ...................................................................................................................... 35
Figure 38: R code and the results of forecasting based on the original data set, save the results in the file
diabetes....................................................................................................................................................... 36
Figure 39: R Code and Statistical Results .................................................................................................... 36
Figure 40: R Code and Statistical Results .................................................................................................... 37
Figure 41: R code and comparison results .................................................................................................. 37
Figure 42: R Code and test results .............................................................................................................. 37
Figure 43: R Code and test results .............................................................................................................. 38
Figure 44: R code and evaluation results .................................................................................................... 38


4|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

ACKNOWLEDGMENT
First of all, we would like to express our gratitude to Professor Nguyen Tien Dung for having
enabled our group to have a chance to interact with R studio software. We are also grateful that
you have shown us an abundant amount of knowledge about Probability and Statistics. This is an
opportunity for us to operate the R studio. We also understand that R studio is important material
to the world of Mathematics nowadays. The software increases not only our knowledge but also
the ideas for future projects.

5|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

PROJECT OF PROBABILITY AND STATISTIC FOR CHEMICAL
ENGINEERING (MT2013)

1. INTRODUCTION
1.1. Topic introduction and requirements
1.1.1. Subject
Probability is a part of mathematics that deals with numerical descriptions of the probability that

an event will occur, or the probability that a proposition is true. The probability of an occurrence
is a number between 0 and 1, where 0 denotes the impossibility of the event and 1 represents
certainty. It's usually applied in fields such as mathematics, statistics, economics, gambling,
science (particularly physics), artificial intelligence, machine learning, computer science,
philosophy, and so on.
Statistics is the study of several disciplines, including data analysis, interpretation, presentation,
and organization. It plays a critical part in the research process by providing analytically significant
statistics to assist statistical analysts in obtaining the most correct results to address associated
difficulties with social activities.
To sum up, Probability and Statistics nowadays is becoming significant in our modern life,
especially with student whose major is in natural science, technology, and economy, ...

1.1.2. R-studio
R is a programming language and environment that is widely used in statistical computing, data
analysis, and scientific research. It is a popularly used programming language for data collecting,
cleaning, analysis, graphing, and visualization.
R is the next generation language of the “S language” in reality. The S programming language
allows users and students of engineering and technology university to calculate and modify data.
As a language, one can use R to develop specialized software for a particular computational
problem.

1.1.3. Our problems
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes,
based on certain diagnostic measurements included in the dataset. Several constraints were placed
on the selection of these instances from a larger database. In particular, all patients here are females
at least 21-year-old of Pima Indian heritage.
Information about dataset attributes:







Pregnancies: To express the Number of pregnancies
Glucose: To express the Glucose level in blood
Blood Pressure: To express the Blood pressure measurement
Skin Thickness: To express the thickness of the skin
Insulin: To express the Insulin level in blood
6|Page


PROBABILITY AND STATISTICS’S PROJECT





GROUP 05

BMI: To express the Body mass index
Diabetes Pedigree Function: To express the Diabetes percentage
Age: To express the age
Outcome: To express the final result 1 is Yes and 0 is No

Implementation steps:







Import data: diabetes.csv
Data cleaning: NA (missing data)
Data visualization
o Convert the variable (if necessary)
o Descriptive statistics: using sample statistics and graphs
Logistic regression model: Using a suitable logistic regression model to evaluate factors
affecting of diabetes

1.2. Theoretical basis
Logistic regression (often referred to simply as binomial logistic regression) is used to predict the
probability that an observation falls into one of the categories of the dependent variable based on
one or more independent variables that may be continuous or classified. On the other hand, if your
sea of dependencies is a count, the statistical method that should be considered is Poisson
regression. Also, if you have more than two types of dependent variables, that is when multinomial
logistic regression should be used. For example, you can use binomial logistic regression to
understand whether test performance can be predicted based on review time and test anxiety (i.e.,
where belonging to “test performance”, measured on a proportional scale – “pass” or “fail” – and
you have two independent variables: “review time” and “test anxiety”).
Logistics regression model
Logistic regression models are used to predict a categorical variable by one or more continuous or
categorical independent variables. The dependent variable can be binary, ordinal or
multicategorical.
The independent variable can be interval/scale, dichotomous, discrete or a mixture of all. The
logistic regression equation (in case the dependent variable is binary) is:
𝑃(𝑌𝑖 = 1) =

𝑒 −(𝛽0+𝛽1𝑥1𝑖 +𝛽2𝑥2𝑖 +⋯+𝛽𝑘𝑥𝑘𝑖 )
1 + 𝑒 −(𝛽0+𝛽1𝑥1𝑖 +𝛽2𝑥2𝑖 +⋯+𝛽𝑘𝑥𝑘𝑖 )


Where:
- P is the probability of observing a case i in the outcome variable Y with a value = 1
- e is an Euler mathematical constant with a value close to 2.71828;
- And the regression coefficients β corresponding to the observed variables.

7|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

We often use regression models to estimate the effect of X variables on an Odds (Y=1).
Effects in logistic regression
For estimation and prediction purposes, the probabilities are severely limited. First, they are bound
to the range 0 to 1. This implies that if the real effect of variable X on the outcome of variable Y
exceeds 1, interpretation may be problematic. The second limit, the probability cannot be negative.
Assuming that the effect of an independent variable on variable Y is negative, the logistic
regression coefficient interpretation is meaningless. One problem is that the regression coefficient
should only be positive.
To solve the above two problems, we have a two-step approach through the implementation of two
variables change. First, we convert the probabilities in Odds (O) to:

0=

𝑃
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 ℎ𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔
=
1 − 𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑛𝑜𝑡 ℎ𝑎𝑝𝑒𝑛𝑖𝑛𝑔


𝑂
; 𝑤ℎ𝑒𝑟𝑒 𝑂 𝑖𝑠 𝑂𝑑𝑑; 𝑃 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
1+𝑂
That is, the odds that an event will occur is the ratio of the number of times it is expected that the
event will happen to the number of times it is expected that the event will not happen. This is a
direct relationship between Odds (Y=1) and probability Y=1.
𝑃=

Thus, given that Odds can be infinite, the probability with Odds now allows the regression
coefficient to have any value.
The next step is to solve the second problem. Relationship between Odds and Probability, slightly
expanded.
Algebraically, we can restate the Odds (O) formula above in terms of the logarithm of Odds (Y=1):
𝑙𝑜𝑔𝑒 [𝑂𝑑𝑑(𝑌𝑖 = 1)] = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖
To calculate the logarithm for a random case in the population for the value of an independent
variable or a covariate. Add to the dependent sea Y the value 1 (for example, 1 (vote for Obama
in 2008), 0 (vote for McCain in 2008, in the US election). Obama P(Y=1) is 0.218 ; and so 1-P =
0.782. We calculate Odds as: Odds=0.218/0.782=0.279. This value just shows us the resulting
Odds, now we have to continue to assume that the logistic regression coefficients involved are in
the correct direction, so we need to use the logarithmic formula of Odds.
Accordingly, the natural logarithm (loge, symbol ln) of Odds (eg ln0, 279 = −1, 276). Therefore,
the logarithm of the probability of voting for Obama is -1.276'. So, if we just stop at probabilistic
predictions, we can get false results (a positive number).
8|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05


Second, the true effect of the covariates involved is underrated (underestimated). The main
advantage of logarithmic Odds is that the coefficients are constrained, and that they can be negative
as well as positive, ranging from negative infinity to positive infinity. Stated this way, logistic
regression looks exactly like multiple regression on the right side of the logarithmic Odds equation.
The left side of the equation is not the score of Y. It is the logarithm of Odds (Y=1). This means
that each unit of X has the effect of β on the logarithm of Odds of Y.
Estimation of a logistic regression model with maximum likelihood (Maximum Likelihood)
Because logistic regression operates on a categorical variable, the ordinary least squares (OLS)
method is unusable (it assumes a normally distributed dependent variable). Therefore, a more
general estimator is used to detect a good fit of the parameters. This is called the maximum
likelihood estimation. Maximum likelihood is an interactive estimation technique to select
parameter estimates that maximize the likelihood of a sample dataset being observed. In logistic
regression, maximum reasonably selects coefficient estimates that maximize the logarithm of the
probability of observing a particular set of values of the dependent variable in the sample for a
given set of X values.
Because logistic regression uses the method of maximum likelihood, the coefficient of
determination (R-) may not be directly estimated. Thus, we have two dilemmas for the
interpretation of logistic regression: First, how can we also measure the goodness of fit – a general
null hypothesis? Second, how do we estimate the partial effect of each variable X?
Statistical inference and null hypothesis
First question, how can we also measure the goodness of fit – a general null hypothesis? The
statistical inferences, together with the null hypothesis, are interpreted according to the following
steps:
• The first step in the regression interpretation is to evaluate the global null hypothesis that the
independent seas do not have any relationship with Y. In the OLS regression method, we perform
This is equal to testing whether R2 must be 0 in the population using an F-test. While logistic
regression uses the method of maximum likelihood (non-OLS): The null hypothesis H0 is β0 = β1
= β2 = 0. We measure the size of the residuals from this model with a statistical logarithm.
likelihood statistic.

• We then estimate the model again, assuming that the null hypothesis is false, that we find the
maximum reasonable value of the coefficients β in the sample. Again, we measure the size of the
residuals from this model with a statistical logarithm of reasonableness.
• Finally, we compare the two statistics by computing a test statistic: −2[ln(𝐿𝑛𝑢𝑙𝑙) −
ln (𝐿𝑚𝑜𝑑𝑒𝑙)]
This statistic tells us how much residual (or prediction error) can be reduced using X variables.
The null hypothesis suggests that the reduction is 0 ; if the statistic is large enough (in a chi-square
9|Page


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

test with df = number of independent variables), we reject the null hypothesis. Here, we conclude
that at least one independent variable has a logarithmic Odds effect.
SPSS also runs R2 statistics to help evaluate the strength of associations. But it as a pseudo R2 ,
should not be interpreted because logistic regression does not use R2 like linear regression.
Second question, how do we estimate the partial effect of each variable X? When the general null
hypothesis is rejected, we evaluate the partial effects of the predictors.
As in multiple linear regression, in logistic regression this implies that the null hypothesis for each
independent variable included in the equation. The null hypothesis is that each regression
coefficient is zero, or it has no effect on the logarithm of Odds.
Each coefficient estimator B has a standard error – the extent to which, on average, we would
expect B to vary from one sample to another by chance. To check the significance of B, a test
statistic (not a t-test, but a Wald Chi-squared) is calculated, with 1df – degrees of freedom.
It should be remembered that the coefficient B expresses the effects of a unit change of X on
logarithmic Odds.
In education, the effect is positive, as education increases, the logarithm of Odds also increases.
The Exp(B) value of an independent variable X is used to predict the probability of an event

occurring based on the change in one unit change in an independent variable when all other
independent variables are held constant. It indicates that when it is increased by one, the Odds for
the "yes" event is multiplied by one value of the value Exp(B) (this is a function e to the power B,
say 1.05, which is an increase of 5%).
Optimal model selections
One of the difficult and sometimes difficult problems in multivariable logistic regression analysis
is choosing a model that can adequately describe the data. A study with a dependent variable y and
3 independent variables x1, x2 and x3, we can have the following models to predict y : y = f(x1), y
= f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is a function
number. In general with k independent variables x1, x2, x3, . . . , xk, we have many models (2k ) to
predict y.
An optimal model must meet the following three criteria:
• Simple
• Full
• Has the practical meaning
The simple criterion requires a model with few independent variables, because too many
variables make interpretation difficult, and sometimes impractical. In a simile, if we spend 50,000
VND to buy 500 pages of a book, it is better than spending 60,000 VND to buy the same number
10 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

of pages. Similarly, a model with 3 independent variables that has the same ability to describe data
as a model with 5 independent variables, then the first model is chosen. A pattern is simply a . . .
save! (English is called a parsimonious model).
The adequate criterion here means that the model must describe the data satisfactorily, i.e. it
must predict close (or as close as possible) to the actual observed value of the dependent variable

y. If the observed value of y is 10, and if there is a predictive model of 9 and a predictive model
of 6, the former must be considered more complete.
A criterion of “practical significance”, as it is called, means that the model has to be supported
by theory or has biological significance (if it is biological research), and clinical significance (if it
is a research study). clinical studies), etc. It's possible that phone numbers are somehow related to
fracture rates, but of course, such a model makes no sense. This is an important criterion, because
if a statistical analysis results in a model that is very mathematically meaningful but has no
practical significance, then the model is just a numbers game, with no real meaning. real scientific
value.
The third criterion (of practical significance) belongs to the theoretical realm, and we will not
discuss it here.
We will discuss the standard simple and complete. An important and useful metric for us to decide
on a simple and complete model is the Akaike Information Criterion (AIC).
The formula for calculating the AIC value:
𝐴𝐼𝐶 = −2 × log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2 × 𝑘 = 2[𝑘 − log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑)]
A simple and complete model should be one with as low AIC value as possible and the independent
variables must be statistically significant. So, the problem of finding a simple and complete model
is really looking for the one (or more) with the lowest or near lowest AIC value.

11 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

2. DATA CORRECTION
2.1. Import data
Read file "diabetes.csv" and assign with name diabetes
diabetes <- read.csv(“~/Desktop/diabetes.csv

head(diabetes)

Figure 1: R code and results after reading data

2.2. Data cleaning
Check for missing data in file
apply(is.na(diabetes),2,which

Figure 2: R code and results when checking missing data in file "diabetes"

Comment: We see that in the file ''diabetes'' there is no missing data to be processed

12 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

2.3. Data clarification
Calculate descriptive statistics for variables
For continuous variables “Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
"BMI", "DiabetesPedigreeFunction", and "Age" descriptive statistics are performed and the results
are output in tabular form.
mean <- apply(diabetes[,1:8],2,mean)
sd <- apply(diabetes[,1:8],2,sd)
min <- apply(diabetes[,1:8],2,min)
max <- apply(diabetes[,1:8],2,max)
Q1<- apply(diabetes[,1:8],2,quantile,probs=0.25)
median <- apply(diabetes[,1:8],2,median)

Q3<- apply(diabetes[,1:8],2,quantile,probs=0.75)
t(data.frame(mean,sd,min,max,Q1,median,Q3))

Figure 3: R code and results when performing descriptive statistics

Make a statistical table for each categorical variable:
For categorical variables "Outcome", make a statistical table.
table(diabetes$Outcome)

Figure 4: R code and results when performing quantitative statistics for the variable "Outcome"

13 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Comment:
• There are 500 survey participants who do not have diabetes.
• There are 268 survey participants who have diabetes
Draw a histogram showing the distribution of quantitative variables
• Pregnancies and Glucose:
par(mfrow = c(1,2))
hist(diabetes$Pregnancies,xlab="Pregnancies",main="Histogram of Pregnancies",col="pink",label=T,ylim=c(0,400))
hist(diabetes$Glucose,xlab=“Glucose”,main=“Histogram of Glucose”,col=“pink”,label=T,,ylim=c(0,250))

Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose”

Comment:

• From the graph of the variable ”Pregnancies”, we can see that the number of pregnancies is
concentrated mostly in the range of 0 - 5 times, the highest at 0 - 2 times (349 people) and the
lowest at the range of 10 to 15. The graph tends to skew left.
• The graph does not have a normal distribution, the values from 0- 2 are too concentrated which
have a bad influence on the logistic regression model. From the graph of the variable ”Glucose”,
we can recognize that the glucose level is highly concentrated from 80 to 160 mg/dL, the highest
14 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

at 100 - 120 mg/dL, and the lowest at the range of 0-60 mg/dL. There is an anomaly (possibly
extraneous because it is unlikely) at 0 - 20 mg/dL. Besides, the graph has the relative shape of the
normal distribution.
• Blood Pressure and Skin Thickness:
par(mfrow = c(1,2))
hist(diabetes$BloodPressure,xlab="BloodPressure",main="Histogram of BloodPressure",col="pink",label=T,ylim=c(0,250))
hist(diabetes$SkinThickness,xlab=“SkinThickness”,main=“Histogram of SkinThickness”,col=“pink”,label=T,,ylim=c(0,250))

Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness”

Comment:
• Based on the graph of the variable ” Blood Pressure”, we find that the value of blood pressure is
mostly concentrated from 50-90 mmHg, the highest at 70-80 mmHg and the lowest at 10- 40 and
110-130 mmHg. The graph has the relative shape of normal distribution. However, there is an
abnormality of the graph that the number of people with blood pressure in the range from 0 to 10
mmHg is quite high (35 people).
• Based on the graph of the variable ” Skin Thickness”, we find that the value of skin thickness is

highly concentrated at 0-50 mm, the highest at 0-10 mm and the lowest at 50-100 mm. The graph
does not have a normal distribution..
15 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

• Insulin and BMI:
par(mfrow = c(1,2))
hist(diabetes$Insulin,xlab="Insulin",main="Histogram of Insulin",col="pink",label=T,ylim=c(0,600))
hist(diabetes$BMI,xlab=“BMI”,main=“Histogram of BMI”,col=“pink”,label=T,,ylim=c(0,250))

Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI”

Comment:
• Based on the graph of the variable ” Insulin”, we find that the value of Insulin is concentrated
mainly at 0-200 mu U/ml, the highest at 0-100 mu U/ml and the lowest at 300-900 mu U/ml. The
graph tends to skew left.
• Based on the graph of the variable ”BMI ”, we can see that the value of BMI (body mass index)
is strongly concentrated at 20-40 kg/ m2, the highest at 30-35 kg/ m2 and the lowest at 5-15 and
55-70 kg/m2. The graph has the relative shape of normal distribution.Besides, there is an anomaly
(possibly extraneous because it is unlikely) at 0 - 10 kg/m2

16 | P a g e


PROBABILITY AND STATISTICS’S PROJECT


GROUP 05

• Diabetes Pedigree Function and Age:
par(mfrow = c(1,2))
hist(diabetes$DiabetesPedigreeFunction,xlab="DiabetesPedigreeFunction",main="Histogram of
DiabetesPedigreeFunction",col="pink",label=T,ylim=c(0,300))
hist(diabetes$Age,xlab=“Age”,main=“Histogram of Age”,col=“pink”,label=T,,ylim=c(0,300))

Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and

Comment:
• From the graph of the variable ”Diabetes Pedigree Function”, we can see the value of the diabetes
pedigree is concentrated mainly at 0 - 1, the highest at the level of 0.2 - 0.4 and the lowest at the
range of 1.5-2.5. The graph does not have a normal distribution, the values from 0.2- 0.4 are too
concentrated.
• From the graph of the variable ”Age”, we can recognize that the value of age is highly
concentrated from 20 -45, the highest at 20-30, and the lowest at the range of 70-80. Besides, the
graph does not have a normal distribution, the values from 20-30 are too concentrated.

17 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of the number of pregnancies of people with/without
diabetes:
library(ggplot2)
library(plyr)

mu_Pregnancies <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Pregnancies))
ggplot(diabetes, aes(x=Pregnancies, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Pregnancies, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Pregnancies for
diabetes”,x=“Pregnancies”, y = “Frequency”) + theme_classic()

Figure 9: R code

Figure 10: The result of histogram shows the distribution of the number of pregnancies for people having and not having
diabetes

Comment: The average value of the number of pregnancies of people having diabetes is higher
than for those who do not have diabetes. It can be said that people with more pregnancies have a
18 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

higher risk of diabetes.Besides, because two lines are different, this factor is able to identify
diabetes
Plot a histogram showing the distribution of skin thickness of people with/without diabetes:
library(ggplot2)
library(plyr)
mu_SkinThickness <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(SkinThickness))
ggplot(diabetes, aes(x=SkinThickness, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_SkinThickness, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +

scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of SkinThickness for
diabetes”,x=“SkinThickness”, y = “Frequency”) + theme_classic()

Figure 11: R code

Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes

Comment: The average skin thickness of people with diabetes is higher than for those who not
having diabetes. In general, the frequency distributions of people with and without disease are
19 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

comparable. Therefore, measuring skin thickness does not predict the probability of a person
having diabetes disease.
Plot a histogram showing the distribution of glucose level of people with/without diabetes:
library(ggplot2)
library(plyr) mu_Glucose <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Glucose))
ggplot(diabetes, aes(x=Glucose, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Glucose, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Glucose for
diabetes”,x=“Glucose”, y = “Frequency”) + theme_classic()

Figure 13: R code

Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes


Comment: The average skin thickness of people with diabetes is higher than for those who not
having diabetes. Because the two lines are different, this factor can determine diabetes.

20 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of blood pressure of people with/without diabetes:
library(ggplot2)
library(plyr) mu_BloodPressure <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(BloodPressure))
ggplot(diabetes, aes(x=BloodPressure, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_BloodPressure, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of BloodPressure for
diabetes”,x=“BloodPressure”, y = “Frequency”) + theme_classic()

Figure 15: R code

Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes

Comment: The average value of blood pressure of people with diabetes is higher than for those
who not having diabetes. Because the two lines are almost the same, this factor is not able to
determine diabetes.

21 | P a g e



PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of insulin level of people with/without diabetes:
library(ggplot2)
library(plyr) mu_Insulin <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Insulin))
ggplot(diabetes, aes(x=Insulin, color=as.factor(Outcome), fill=as.factor(Outcome))) + geom_histogram(position=“identity”,
alpha=0.5) + geom_vline(data=mu_Insulin, aes(xintercept=grp.mean, color=as.factor(Outcome)), linetype=“dashed”)+
scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) + scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”))
+ labs(title=“Histogram of Insulin for diabetes”,x=“Insulin”, y = “Frequency”) + theme_classic()

Figure 17: R code

Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes

Comment: The average value of insulin level for people with diabetes is higher than for those who
not having diabetes. Because the two lines are different, this factor can determine diabetes.

22 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of Body mass index (BMI) of people with/without
diabetes:
library(ggplot2)

library(plyr) mu_BMI <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(BMI))
ggplot(diabetes, aes(x=BMI, color=as.factor(Outcome), fill=as.factor(Outcome))) + geom_histogram(position=“identity”,
alpha=0.5) + geom_vline(data=mu_BMI, aes(xintercept=grp.mean, color=as.factor(Outcome)), linetype=“dashed”)+
scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) + scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”))
+ labs(title=“Histogram of BMI for diabetes”,x=“BMI”, y = “Frequency”) + theme_classic()

Figure 19: R code

Figure 20: The result of histogram shows the distribution of BMI (Body mass index) for people having and not having diabetes

Comment: The average value of BMI for people with diabetes is higher than for those who not
having diabetes. Because the two lines are different, this factor can determine diabetes

23 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of diabetes pedigree function of people with/without
diabetes:
library(ggplot2)
library(plyr) mu_DiabetesPedigreeFunction <- ddply(diabetes, “Outcome”, summarise,
grp.mean=mean(DiabetesPedigreeFunction))
ggplot(diabetes, aes(x=DiabetesPedigreeFunction, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_DiabetesPedigreeFunction,
aes(xintercept=grp.mean, color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”,
“#56B4E9”)) + scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of
DiabetesPedigreeFunction for diabetes”,x=“DiabetesPedigreeFunction”, y = “Frequency”) + theme_classic()


Figure 21: R code

Figure 22: The result of histogram shows the distribution of diabetes pedigree function for people having and not having
diabetes

Comment: The average value of diabetes pedigree function for people with diabetes is higher than
for those who not having diabetes. Because the two lines are different, this factor can determine
diabetes.
24 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

Plot a histogram showing the distribution of age of people with/without diabetes:
library(ggplot2)
library(plyr) mu_Age <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Age))
ggplot(diabetes, aes(x=Age, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Age, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Age for diabetes”,x=“Age”, y =
“Frequency”) + theme_classic()

Figure 23: R code

Figure 24: The result of histogram shows the distribution of age for people having and not having diabetes

Comment: The average value age for people with diabetes is higher than for those who not having

diabetes. Because the two lines are different, this factor can determine diabetes.

25 | P a g e


PROBABILITY AND STATISTICS’S PROJECT

GROUP 05

2.4. Logistics Regression
Building a logistic regression model to predict diabetes using the glm command (General Linear Model).
The general linear model (GLM) is mathematically identical to multiple regression analysis but emphasizes
its suitability for both qualitative and multiple quantitative variables.
model_1data = diabetes,family = binomial)
summary(model_1)

Figure 25: R code and results

Therefore, the optimal logistic regression model has the form:
𝑝
𝑙𝑛(𝑜𝑑𝑑𝑠) = 𝑙𝑛 (
) = 𝛽0 + 𝛽1 , 𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑖𝑒𝑠 + 𝛽2 , 𝑆𝑘𝑖𝑛𝑡ℎ𝑖𝑐𝑘𝑛𝑒𝑠𝑠 + 𝛽3 , 𝐺𝑙𝑢𝑐𝑜𝑠𝑒 + 𝛽4 𝐵𝑙𝑜𝑜𝑑𝑃𝑟𝑒𝑠𝑠𝑢𝑟𝑒
1−𝑝
+𝛽5 , 𝐼𝑛𝑠𝑢𝑙𝑖𝑛 𝛽6 , 𝐵𝑀𝐼 + 𝛽7 , 𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠𝑃𝑒𝑑𝑖𝑔𝑟𝑒𝑒𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 + 𝛽8 , 𝐴𝑔𝑒 + 𝜀

From the analysis results, we get:
𝛽̂0 = −8.4046964; 𝛽̂1 = 0.1231823; 𝛽̂2 = 0.0006190; 𝛽̂3 = 0.0351637; 𝛽̂4 = −0.0132955
𝛽̂5 = −0.0011917; 𝛽̂6 = 0.0897010; 𝛽̂7 = 0.9451797; 𝛽̂8 = 0.0148690


Thus, the estimated regression line is given by the following equation:
26 | P a g e


×