Tải bản đầy đủ (.pdf) (22 trang)

Báo cáo xác suất thống kê trên dữ liệu gia cầm

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (730.21 KB, 22 trang )

Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Probability & Statistics
Project Report

A Poultry Dataset Analysis
Lecturer: Prof. Nguyen Tien Dung
Team members: Nguyen Nho Gia Phuc – 2052214
Tran Pham Minh Hung – 2053067
Le Khanh Duy – 2052003
Pham Nguyen Nam Khoa – 2052009
Nguyen Minh Khoa – 2052538

Ho Chi Minh City, May 25, 2022


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Contents
1 Introduction

2

1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2 Steps to analyze the given dataset . . . . . . . . . . . . . . . . . . . . . .


2

2 Data Processing and Visualization

3

3 Data analysis

9

3.1 Population average weight test . . . . . . . . . . . . . . . . . . . . . . .

9

3.2 Evaluation of feed effect on chicks weight . . . . . . . . . . . . . . . . .

11

4 Normality test

16

4.1 Q-Q plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.2 Shapiro-Wilk test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18


Probability and Statistics - CC03 - Semester 212

Page 1


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

1 Introduction
1.1

Data description

The dataset chicken_feed.csv provides information of weight measurements and the
feedstuff fed to a total of 71 chickens.
The main variables in the dataset are as follows:
• weight: provides the weight measurement of a chicken. Early observation witnesses
some entries with N/A data.
• feed: provides the feedstuff fed to a chicken. There are 6 items in total: horsebean,
linseed, soybean, sunflower, meatmeal, and casein.
The data entries are grouped by the feed item of the previously described order. Further
description is showcased in Section 2 and Section 3.

1.2

Steps to analyze the given dataset

• Data import – chicken_feed.csv.
• Data cleaning: clearing out entries with N/A information.
• Summarization of data – before and after cleaning.

• Data visualization.
• t-test to compare the chosen data with results of other research.
• One-way ANOVA to find the correlation between the chickens’ weight and the foodstuff fed to them.
• Normality test using Q-Q plots and Shapiro-Wilk test.
The function setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) is
included at the top of our R-script for better code portability. When the dataset and
the script are in the same folder, calling this function would set the working directory to
wherever the R-script is, avoiding the need of explicitly declaring dataset path, which is a
native link for a particular computer running the analysis. For the solution to work, the
use of RStudio is mandatory since an API of it is called.
Probability and Statistics - CC03 - Semester 212

Page 2


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

2 Data Processing and Visualization
Null data can affect statistical calculations. Data processing aims to clear our null data
entries that provide little to none use to our analysis.
The data collection done on the chicken_feed.csv dataset made it convenient for us to
identify such missing pieces. In the weight column, there are some entries marked with
N/A (Not Available). With the help of R, we can remove them. First, the dataset is
imported into RStudio.
# load data
setwd ( dirname ( rstudioapi :: g e t A c t i v e D o c u m e n t C o n t e x t () $ path ) )
chickFeedData <- read . csv ( " chicken _ feed . csv " )
summary ( chickFeedData )


Listing 1: Importing chicken_feed.csv and first-step summary.
The summary() function summarizes the key information of our dataset. We need not to
pay attention to the first column of the output because it is used to index the data entries
without actually containing any values that we aim to analyze.
> summary ( chickFeedData )
X
Min .

weight
: 1.0

Min .

:108.0

feed
Length :71

1 st Qu .:18.5

1 st Qu .:206.0

Class : character

Median :36.0

Median :260.0

Mode


Mean

:36.0

Mean

:263.6

3 rd Qu .:53.5

3 rd Qu .:325.0

Max .

Max .

:423.0

NA ’s

:2

:71.0

: character

Listing 2: Summary of imported dataset.
We can see that in the weight column, there are two NA’s, so we aim to clear those entries.
This is done using the na.omit() function, which returns a list without any rows that
contain NA values. This is a simple way to purge incomplete records from an analysis.

cleanData = na . omit ( chickFeedData )
summary ( cleanData )

Listing 3: Eliminating NA entries using na.omit().
Once again, we use the summary() function to see what changes have been made to the
cleaned version of our dataset.
Probability and Statistics - CC03 - Semester 212

Page 3


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

> summary ( cleanData )
X
Min .

weight
: 1.00

Min .

:108.0

feed
Length :69

1 st Qu .:20.00


1 st Qu .:206.0

Class : character

Median :37.00

Median :260.0

Mode

Mean

Mean

:36.72

:263.6

3 rd Qu .:54.00

3 rd Qu .:325.0

Max .

Max .

:71.00

: character


:423.0

Listing 4: Summary of cleaned dataset.
Notice that the NA entries have been cleared, and the number of entries has been reduced
from 71 to 69 in the feed column.
It is also worth it to take a look at individual feed item’s information by subsetting the data
based on the feed variable and then call summary() on each of the subsets. The source
code is as follows:
horsebean <- subset ( cleanData , cleanData $ feed == " horsebean " )
linseed <- subset ( cleanData , cleanData $ feed == " linseed " )
soybean <- subset ( cleanData , cleanData $ feed == " soybean " )
sunflower <- subset ( cleanData , cleanData $ feed == " sunflower " )
meatmeal <- subset ( cleanData , cleanData $ feed == " meatmeal " )
casein <- subset ( cleanData , cleanData $ feed == " casein " )
summary ( horsebean )
summary ( linseed )
summary ( soybean )
summary ( sunflower )
summary ( meatmeal )
summary ( casein )

Listing 5: Data subsetting.
> summary ( horsebean )
X
Min .

weight
: 1.000

Min .


:108.0

feed
Length :9

1 st Qu .: 3.000

1 st Qu .:136.0

Class : character

Median : 6.000

Median :143.0

Mode

Mean

Mean

: 5.556

:153.9

3 rd Qu .: 8.000

3 rd Qu .:168.0


Max .

Max .

:10.000

: character

:227.0

> summary ( linseed )
X
Min .

weight
:11.00

Min .

:141.0

feed
Length :11

Probability and Statistics - CC03 - Semester 212

Page 4


Ho Chi Minh City University of Technology

Faculty of Computer Science and Engineering

1 st Qu .:13.50

1 st Qu .:192.0

Class : character

Median :16.00

Median :229.0

Mode

Mean

Mean

:16.45

:225.2

3 rd Qu .:19.50

3 rd Qu .:258.5

Max .

Max .


:22.00

: character

:309.0

> summary ( soybean )
X
Min .

weight
:23.00

Min .

:158.0

feed
Length :14

1 st Qu .:26.25

1 st Qu .:206.8

Class : character

Median :29.50

Median :248.0


Mode

Mean

Mean

:29.50

:246.4

3 rd Qu .:32.75

3 rd Qu .:270.0

Max .

Max .

:36.00

: character

:329.0

> summary ( sunflower )
X
Min .

weight
:37.00


Min .

:226.0

feed
Length :12

1 st Qu .:39.75

1 st Qu .:312.8

Class : character

Median :42.50

Median :328.0

Mode

Mean

Mean

:42.50

:328.9

3 rd Qu .:45.25


3 rd Qu .:340.2

Max .

Max .

:48.00

: character

:423.0

> summary ( meatmeal )
X
Min .

weight
:49.0

Min .

:153.0

feed
Length :11

1 st Qu .:51.5

1 st Qu .:249.5


Class : character

Median :54.0

Median :263.0

Mode

Mean

Mean

:54.0

:276.9

3 rd Qu .:56.5

3 rd Qu .:320.0

Max .

Max .

:59.0

: character

:380.0


> summary ( casein )
X
Min .

weight
:60.00

Min .

:216.0

feed
Length :12

1 st Qu .:62.75

1 st Qu .:277.2

Class : character

Median :65.50

Median :342.0

Mode

Mean

Mean


:65.50

:323.6

3 rd Qu .:68.25

3 rd Qu .:370.8

Max .

Max .

:71.00

: character

:404.0

Listing 6: Information of each feed item’s data.
Let us plot the proportions of feed item by a 3D pie chart with the following source code
by first installing and mounting the two libraries plotrix and colorspace, then choosing a
color palette, and plotting with a legend table. The percentage of each item is calculated

Probability and Statistics - CC03 - Semester 212

Page 5


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

and labeled with piepercent. The numbers are rounded to 2 decimal places.
library ( plotrix )
library ( colorspace )
x <- c ( length ( horsebean $ feed ) , length ( linseed $ feed ) , length (
soybean $ feed ) , length ( sunflower $ feed ) , length ( meatmeal $ feed )
, length ( casein $ feed ) )
labels <- c ( " horsebean " , " linseed " , " soybean " , " sunflower " , "
meatmeal " , " casein " )
piepercent <- round (100 * x / sum ( x ) , 2)
pie3D (x , mar = c (5 , 5 , 5 , 5) , height = .1 , theta = pi / 3 ,
radius = 1 , explode = 0.1 , shade = 0.6 , labels =
piepercent , main = " feed item proportions " , col = heat _ hcl
(6 , h = c (0 , -100) ,l = c (75 ,40) ,c = c (40 ,80) , power =6) )
legend ( " right " , xpd = TRUE , inset = c ( -.4 ,0) , labels , cex = 1.5 ,
fill = heat _ hcl (6 , h = c (0 , -100) ,l = c (75 ,40) ,c = c (50 ,80) , power
=6) , bty = ’n ’)

Listing 7: 3D pie chart source code.
The generated plot is as follows:

Figure 1: 3D pie chart of feed item proportions.

It can be seen that chickens fed with soybean take up the most of the recorded figures,
while those that consume horsebean are the least. Furthermore, there are the same numProbability and Statistics - CC03 - Semester 212

Page 6


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

ber of chickens fed with soybean and casein. The same is also true for meatmeal and
linseed groups.
To aid the upcoming testing process, we go on to plot the box plot and histogram of our
dataset. The box plot is generated with the following command:
boxplot ( cleanData $ weight ~ cleanData $ feed , xlab = " feed item " ,
ylab = " Chicken Weight " , main = " Box plot of chicken weight
by feedstuff " , col = " pink " )

Listing 8: Plotting the box plot of chicken weight by feed type.

Figure 2: Box plot of chicken weight by feedstuff.

We can see great variability within the groups while interpreting this box plot. What
is more, we can identify one outlier in the horsebean group as well as 3 outliers in the
sunflower group (outliers are observations that are numerically distant from the rest of
the data). From the box plot as a whole, it is likely that there are differences between the
groups in terms of weight measurements because the median line of a box plot tends to
lie outside of a comparison box plot, for example, horsebean and linseed, or linseed and
meatmeal.

Probability and Statistics - CC03 - Semester 212

Page 7


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
Next, the histogram of our dataset is generated as follows:
hist ( cleanData $ weight , main = " Histogram of weight measurements
" , xlab = " Chicken weight measurements " , col = " pink " , ylim = c

(0 ,20) )
abline ( h = seq (0 ,20 ,2) , col = " orange " , lty = " dotted " )

Listing 9: Source code to generate histogram.

Figure 3: Histogram of weight measurements.

The weight measurements form a slightly symmetric and unimodal1 distribution, with
observations cluster towards the middle and lie fewer at the two tails. This resembles the
normal distribution in a sense, however, it is not perfect. Usually, one can determine if
the data follows a normal distribution by looking at the histogram but we decided it is best
to carry out tests for such condition in Section 4. That way, we can guarantee normality,
which is a crucial requirement in our tests, in individual groups as well as the dataset as a
whole.

1

one peak

Probability and Statistics - CC03 - Semester 212

Page 8


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

3 Data analysis
In Section 1, we learnt the basic categories of the chicken_feed dataset, which includes
weight and feed. Further investigation suggests that the orginal source dates back to 1948,

in volume 4 of the renowned journal Biometrics (not to be mistaken for Biometrika like in
the R document) [7]. There we found a proper description for the dataset. It encompasses
body weight detail of chicks from an unknown strain after a 6-week feeding trial on 6 types
of protein supplements given in feed. The weight column is also collected in grams.
Given the date, duration and subjects of the experiment, we can compare its results with
similar data.

3.1

Population average weight test

Since the strain is unknown, one might be interested in testing the population mean weight
of the birds. In an article published in 2014, M. J. Zuidhof and his colleagues investigated
the growth of broiler chickens from 1957 to 2005 [6]. One helpful result states that the
body weight of the Alberta Meat Control (AMC) strain (a genetically unmodified strain)
in 1957 has the mean of 586 grams at 6 weeks. Another statistic is the body weight of
the Athens-Canadian Random Bred (ACRB) strain in the same year 1957, which stayed
at 508 grams at 6 weeks [2]. We take the mean of these figures at 547 grams as the
population mean to conduct hypothesis test on the average weight at 6 weeks of the strain
from our dataset.

Testing
We first review some core statistics of the dataset.
> summary ( cleandata $ weight )
Min .
108.0

1 st Qu .

Median


Mean

206.0

260.0

263.6

3 rd Qu .
325.0

Max .
423.0

sdall <- sd ( cleandata $ weight )
> print ( sdall )
[1] 77.79334

It can be seen from the result that our sample mean is 263.6, with a standard deviation
of 77.79334. These can be employed to perform the test. Since chicks’ weight from the
two 1957 strains doubles ours’, we suspect the population mean must be much different
from the proposed figure. The hypotheses that follow are:
H0 : µ = 547 grams at week 6

H1 : µ ̸= 547 grams at week 6

Probability and Statistics - CC03 - Semester 212

Page 9



Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
Since the population variance is unknown, plus normality is granted, T-test best suits our
purpose here. The result is demonstrated below, via the function t.test():
> t . test ( cleandata $ weight , mu = 547)
One Sample t - test
data :

cleandata $ weight

t = -30.262 , df = 68 , p - value < 2.2 e -16
alternative hypothesis : true mean is not equal to 547
95% confidence interval :
244.9062 282.2822
sample estimates :
mean of x
263.5942

Conclusion
We can see from the test result that µ fell out of the 95% confidence interval along with a
near 0 p-value. Hence, we can reject the null hypothesis and conclude that the population
mean for our 42-day birds does not equal 547.
As a fascinating piece of fact, the
average weight of chickens has increased drastically overtime, according to Zuidhof and Havenstein. The current breeds outweigh our strain figure by roughly
four times. Their 6-week weight
now can reach over 1600 grams
bare minimum. This means no
Figure 4


more test should be conducted
to compare the 1948 chicks to
monstrous poultries today, as the

statistic most likely will fall out of the accepted range. The reason for such huge difference is that chicken genetics post World War II were constantly selected and improved,
in conjunction with advancements in antibiotics technology and industrial feed. Figure 4
illustrates the evolvement of poultry body at market time over the last 50 years.

Probability and Statistics - CC03 - Semester 212

Page 10


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

3.2

Evaluation of feed effect on chicks weight

In this part, we investigate the effect of the given feed on the birds weight. Seventy one
specimens are chosen and fed one of the six feed types comprising horsebean, linseed,
soybean, sunflower, meatmeal, and casein in the chicken_feed.csv dataset. We assume
the selection of specimens and their corresponding feed type is done randomly.

Test selection
If one’s interest is to compare two types of feed to find the superior one, then two-sample
T-test is the way to go. However, there are in total 6 types of feed in our dataset, which
necessitates a more flexible, robust technique to juxtapose them with one another. For

this purpose, a one-way ANOVA test probably fits better.
One-way ANOVA is used when one has collected data about one categorical independent variable and one quantitative dependent variable. The test indicates whether
the dependent variable changes according to the level of the independent variable. Since
we want to investigate the effect of chicken feed on their weight, ANOVA test is a good
choice. Note that the test choice for this dataset is an unbalanced ANOVA design since
the number of specimens for each type of feed is slightly different.
As the feedtypes vary in protein content, we suspect that there are better feed than the
other. The hypothesis may be interpreted as:
H 0 : τ 1 = τ 2 = · · · = τ6 = 0

H1 : τi ̸= 0 for at least one i

Wherein τ stands for the effect of each treatment on chicks weight.

Assumptions
Before proceeding further, we have to ensure the three major assumptions of one-way
ANOVA are reserved in this test:
• Independence
There is no way to verify independence of observations. This assumption can only
be satisfied by correctly randomizing experimental design. In our case, we assume
the selection of specimens and their corresponding feed types is done randomly.
• Homogeneity of variance (homoscedasticity)
Homoscedasticity is a formal requirement for one-way ANOVA. This requirement
is not too critical for balanced test with same treatment sizes. In our case, an unProbability and Statistics - CC03 - Semester 212

Page 11


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

balanced version is applied, leading to greater risk of obtaining inaccurate results
without the property.
• Normality
ANOVA is based on the F-statistic, where the F-statistic requires that the dependent
variable is normally distributed in each group. Thus, ANOVA requires that the
dependent variable is normally distributed in each group. We will justify the validity
of normality in the next section and take this assumption as granted for the moment.

Homoscedasticity test
To test the homogeneity of variance, we may use a ratio test between the standard deviations or a more advanced technique called Bartlett’s test.
For the ratio test, we compute the ratio between two treatments’ standard deviations
then make decisions based on that number . If it stays within the range [0.5, 2], we assume the two standard deviations are equal. This is the same as assuming the variances
are equal. The standard deviation for each treatment is calculated as follows:
# calculate the sd
sdHorsebean <- sd ( horsebean $ weight )
sdLinseed <- sd ( linseed $ weight )
sdSoybean <- sd ( soybean $ weight )
sdSunflower <- sd ( sunflower $ weight )
sdMeatmeal <- sd ( meatmeal $ weight )
sdCasein <- sd ( casein $ weight )
> sdHorsebean

> sdSunflower

[1] 35.0765

[1] 48.83638

> sdLinseed


> sdMeatmeal

[1] 49.55163

[1] 64.90062

> sdSoybean

> sdCasein

[1] 54.12907

[1] 64.43384

As seen from the table, the highest sample standard deviation is 64.90062 and the smallest one is 35.0765. If we examine the ratio of these 2 number, we get 64.90062/35.0765
= 1.85025, which falls within [0.5, 2]. Therefore, our homogeneity assumption is justified.
The ratio test is just a rough technique for confirming homogeneity between variances. To better evaluate this property, we will describe the Barlett’s test. Bartlett’s test of
homoscedasticity works by using the test statistic B, which is approximately Chi-square,
to compare the group variances to one another. This test operates on the null hypothesis
Probability and Statistics - CC03 - Semester 212

Page 12


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
that the categorical data vary to the same extent. Because Chi-square is beyond the scope
of this report, we will not focus on the Bartlett’s test detail. Instead, we will just see how
is it applied in R studio. Below is the demonstration for Bartlett’s test:


# make feed _ type vector
feedType <- c ( cleanData $ feed )
# create data frame
df <- data . frame ( cleanData $ weight , feedType )
# homogeneity of variance
bartlett . test ( cleanData $ weight ~ feedType , data = df )
> bartlett . test ( cleanData $ weight ~ feedType , data = df )
Bartlett test of homogeneity of variances
data :

cleanData $ weight by feedType

Bartlett ’s K - squared = 4.0099 , df = 5 , p - value = 0.548

Since the p-value is much larger than 0.05, we fail to reject the null hypothesis and thus
supports the homogeneity assumption, consistent with the previous conclusion derived
from the ratio test.

ANOVA test
Next, we will perform the one-way ANOVA test on the treatments. This is done using
the function aov() provided by R.
# ANOVA
> chickenFeedData . aov = aov ( cleanData $ weight ~ feedType , data =
df )
> summary ( chickenFeedData . aov )
feedType
Residuals

Df


Sum Sq

Mean Sq

F value

5

225012

45002

15.2

63

186511

2960

Pr ( > F )
8.83 e -10 * * *

--Signif . codes :

0 * * * 0.001 * * 0.01 * 0.05 . 0.1

1

With the p-value scoring 8.83e-10 (much smaller than the significant level α = 0.05), we

reject the null hypothesis H0 , thereby supporting the alternative hypothesis H1 , i.e. the
feed has an effect on chicks weight. However the test does not tell us further which feed
actually imparts stronger growth (or weaker).

Probability and Statistics - CC03 - Semester 212

Page 13


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

Post-hoc comparison
It remains our task to analyse statistics from the ANOVA test in order to find out which
treatment yields the best effect. The process is called post-hoc comparison. This is usually
achieved by computing a specific measure named LSD and comparing it to the difference
of each pair of treatment means.
Since this test is an unbalanced ANOVA design, the function "LSD.test" usually employed to calculate the LSD does not function properly. Calculating LSD without a supported function is tedious due to the recalculation for each couple of treatments we wish
to compare. This can be eased by employing a single LSD formula that does not affect
the power of the method when the treatment sizes are close to each other, as shown in this
recent research [1]:


LSDd = t/2,df

2 ì M SE
n


Where:

ã LSDd is the LSD for different number of replication.
• n
¯ is the mean of samples observations.
• tα/2,df is the critical value with respect to a confidence level 100(1 − a)% (two sided)
and a degree of freedom df .
• M SE is the mean squared error.
Both df and M SE can be taken from the summary of chickenFeedData.aov. The calculation of LSD proceeds as follows:
MSE <- 2960
tValue <- qt (0.05 ,68 , lower . tail = FALSE )
meanN <- length ( cleanData $ weight ) / 6
LSD <- tValue * sqrt (2 * MSE / meanN )
> LSD
[1] 37.83525

Once LSD is obtained, we move on to performing the comparison. To fast produce result,
we arrange the mean weight in increasing order and only consider the difference between
two consecutive feed (sometimes one column apart). If the difference is larger than LSD,
we place the better feed in a higher group. For example, E(linseed) − E(horsebean) =

Probability and Statistics - CC03 - Semester 212

Page 14


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
71.3 > 37.835 = LSD, so horsebean is in group 1 and linseed group 2. The mean weight
(Section 1) and group numbers of treatments can be summarized as:
Feed types
Horsebean Linseed Soybean Meatmeal Casein


Sunflower

Mean

153.9

225.2

246.4

276.9

323.6

328.9

Group

1

2

2

2, 3

4

4


There is a bit of ambiguity here as meatmeal is both considered as group 2 and group 3
element. This is because the average chick grown by meatmeal does not deviate much
in weight from those fed with soybean, which falls in group 2, but is considerably larger
than birds in the linseed category. For better interpretation, a visualization of the table
might help.
soybean

horsebean

150

200

casein

250
linseed

300
meatmeal

350

grams

sunflower

The graph tells us that casein and sunflower pose the greatest effect on chicks weight, and
theirs do not differ so much from each other. It also reaffirms the difficulty of categorizing meatmeal where the distance from soybean scores just slightly less than LSD but that

from linseed spaces out quite significantly. Finally, horsebean yields the least effect on
the growth of our birds, netting an average weight of just over 150 grams after 6 weeks,
which deviates subtantially from data of other groups. This wide discrepancy between
horsebean and sunflower might be attributable to their nutrition content. We know first
hand that all rations here share the same amount of crude protein, according to the original source. However, other contributions like fat and calories account for much larger
portion in sunflower seed than horsebean. In fact, there are 584 kcal worth of energy per
100 grams of sunflower seed, a figure six-fold larger than that of horsebean, according to
wikipedia.

Probability and Statistics - CC03 - Semester 212

Page 15


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

4 Normality test
In use of statistical tools like ANOVA and t-test for analysing the poultry feed data set, we
assumed its population follows a normal distribution. This may not be true, so a test is
much needed to verify the normality assumption. In this section, we present two major
techniques to perform this test: Quantile-Quantile (Q-Q) plots and Shapiro-Wilk test.

4.1

Q-Q plots

A Q-Q plot is a scatterplot generated by plotting two set of quantiles against each other.
If we can roughly fit a line through the plotted points, then both set of quantiles can be
said to come from the same distribution provided one of them is normal.

PROPOSITION: If a Q-Q plot is approximately linear and one of the datasets is
normal, then both are normally distributed.
PROOF:
Let X and Y respectively denote the random variables of the first and second datasets.
Consider a common quantile α and its associated point (xi , yi ).
Since the plot is almost linear, we expect yi ∼ mxi + n for some m, n ∈ R. Then:
(
)
Y −n
P (X ≤ xi ) = α = P (Y ≤ yi ) ∼ P (Y ≤ mxi + n) = P
≤ xi
m
From this expression, we have:
X∼

Y −n
m

or Y ∼ mX + n

Without loss of generality, assume X is normally distributed. Then mX + n is also normally distributed due to linearity, leading to Y being the same. ■

Demonstration
R provides a number of functions that visualize Q-Q plots like qqnorm() and qqplot().
We can also measure how well the data set approximates normal distribution by applying trend lines via qqline(). This linear model differs from linear regression in that it
optimizes the sum of 2-dimensional distance from the data points to the line, not the
y-distance. An example of console code for horsebean is given as follows:

Probability and Statistics - CC03 - Semester 212


Page 16


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

qqnorm ( horsebean $ weight , main = " horsebean " , xlab = " " ,
ylab = " " ) ; qqline ( horsebean $ weight , col = " blue " )

The rest is similar: just replace horsebean with any other label name and display them on

Sample quantiles

screen. All the resulted graphs are displayed below:

Theoretical quantiles
One may conclude that not much can be drawn from the visuals because the sample sizes
are too small and the lines barely fit. The only plot with good estimation is the combined
cleanData one, namely total.
Probability and Statistics - CC03 - Semester 212

Page 17


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

We thus need a quantitative method that yields accurate evaluation of a data set against
normal distribution when its sample size is insignificant. Shapiro-Wilk test is one such
method.


4.2

Shapiro-Wilk test

The Shapiro-Wilk test is a statistical test for the hypothesis that a group of values come
from a normal distribution (the mean and variance of this normal distribution need not
be 0 or 1, respectively). Empirically, this test appears to have the best power [5] (among
all normality tests).
Assume that the data are y1 ≤ y2 ≤ · · · ≤ yN ∈ R and that we want to test whether
they are normally distributed or not. The test statistic is:

(N

W =

)2
ai y i

i=1
N


(yi − y¯)2

i=1

wherein:
• y¯ =


N
1 ∑
yi is the sample mean.
N i=1

Probability and Statistics - CC03 - Semester 212

Page 18


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering
• a1 , a2 , . . . , aN are constants generated via a complex formula out of the scope of this
text.
The null hypothesis is that the data are normally distributed. We reject it when W is
significantly smaller than 1, unlike most tests we have dealt with so far.
The basic idea behind this statistic roots from an analysis of variance of the mentioned
Q-Q plots. That is, we want to measure how well the regressed line (or trend line) fits the
data. For this we track back to the proof of the Proposition in Section 4.1, therein if the
x-dataset stands for standard normal quantiles, we can model yi as mxi + n, with xi being
the ith -quantile data point: P (X ≤ xi ) = i/(N + 1). In fact, when y distributes closed to
normal, (m, n) becomes (µ, σ), the population mean and standard deviation respectively.
The statistic numerator can then be interpreted as an estimator of the slope squared σ 2
up to a constant, while the denominator is also known for approximating (N − 1)σ 2 . S. S.
Shapiro and M. B. Wilk proved in their paper [4] that the numerator and denominator
are independent of one another together with some other properties. If the data truly
approximate normal distribution, W approaches 1, otherwise we reject the null hypothesis
when W deviates substantially from 1.

Test result

For this project, we divided the chicken_feed dataset into 6 subsets by the feed variable.
Test is performed on each of them to see whether their data form a normal distribution.
# shapiro - wilk test for each feed
shapiro . test ( horsebean $ weight )
shapiro . test ( linseed $ weight )
shapiro . test ( soybean $ weight )
shapiro . test ( sunflower $ weight )
shapiro . test ( meatmeal $ weight )
shapiro . test ( casein $ weight )
shapiro . test ( cleanData $ weight )

Listing 10: Shapiro-Wilk test in R.
After perfomring the test on these sub-datasets, we receive the results:
# Shapiro - Wilk normality test
data :

horsebean $ weight

W = 0.93885 , p - value = 0.5698
data :

linseed $ weight

W = 0.98672 , p - value = 0.9921
data :

soybean $ weight

W = 0.9464 , p - value = 0.5064


Probability and Statistics - CC03 - Semester 212

Page 19


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

data :

sunflower $ weight

W = 0.92809 , p - value = 0.3603
data :

meatmeal $ weight

W = 0.97914 , p - value = 0.9612
data :

casein $ weight

W = 0.91663 , p - value = 0.2592
data :

cleanData $ weight

W = 0.97785 , p - value = 0.2585

The default significant level in 0.05. From the output, the p-values of all the different’s

feeding variants is larger than 0.05, thereby implying that the distribution of the data are
not significantly different from normal distribution. In other words, we can assume the
normality.
Closely inspecting the result, one might see that cleanData.W > casein.W (0.97785 >
0.91663) but cleanData.p < casein.p (0.2585 < 0.2592). This seems to contradict our belief, that the further the statistic deviates from 1, the smaller its p-value becomes. The idea
is almost correct; however, the sample size N does affect W , of which cleanData.N = 69
and casein.N = 12. When N starts to get large, the distribution gets clumped towards
1, as shown in Figure 5. Consequently, for large samples, W does not have to be much
smaller than 1 in order for the test to be significant.

Figure 5: Distribution of W based on sample sizes N .

Probability and Statistics - CC03 - Semester 212

Page 20


Ho Chi Minh City University of Technology
Faculty of Computer Science and Engineering

References
[1] Al-Fahham, A. A. (2018). Development of new LSD formula when numbers of observations are unequal. Open Journal of Statistics, 8(2), 258-263.
[2] Havenstein, G. B. (2006). Performance changes in poultry and livestock following 50
years of genetic selection. Lohmann Information 41: 3037.
[3] R color cheatsheet - National Center for Ecological Analysis ... (n.d.). Retrieved May
11, 2022, from />colorPaletteCheatsheet.pdf
[4] Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4), 591-611.
[5] Razali, N. M., & Wah, Y. B. (2011). Power comparisons of shapiro-wilk,
kolmogorov-smirnov, lilliefors and anderson-darling tests. Journal of statistical modeling and analytics, 2(1), 21-33.
[6] Zuidhof, M. J., Schneider, B. L., Carney, V. L., Korver, D. R., & Robinson, F. E.

(2014). Growth, efficiency, and yield of commercial broilers from 1957, 1978, and
2005. Poultry Science, 93(12), 2970-2982.
[7] 60. Query. (1948). Biometrics, 4(3), 213215. />
Probability and Statistics - CC03 - Semester 212

Page 21



×