Tải bản đầy đủ (.docx) (318 trang)

Phan Tich Kinh Te Bang Excel.docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (24.73 MB, 318 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

CHAPTER 1: DATA CLEANSING

Kiểm tra missing data của biến trong excel

Chúng ta có thể dùng hàm countblank để đếm số missing value trong một biến của excel

Chúng ta thấy biến miles có 1 missing value.

Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao. Các missing value luôn nằm ở cuối bảng.

Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu

Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng.

Trong ví dụ dữ liệu hình trên, chúng ta gõ vào ơ H3 cơng thức =AVERAGE(C2:C457)

Sau đó chúng ta gõ vào ơ H4 công thức =STDEV.S(C2:C457)

Chúng ta làm tương tự cho các ơ cịn lại

Kết quả cho thấy các biến này đều có trung bình và độ lệch chuẩn ổn định không phát hiện bất thường về outlier.

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

Chúng ta cũng có thể tính giá trị tối đa và tối thiểu cho từng biến. Nhập công thức vào ô H5

Nhập công thức vào ô H6

<i>= MAX(C2:C457)</i>

<i>Chúng ta thấy giá trị tối thiểu và tối đa của biến life of tires là 1.8 months và</i>

601.0. Giá trị tối đa này (50 năm) là không phù hợp đối với biến life of tires. Để xác định xe nào có outlier này chúng ta cần sort toàn bộ dữ liệu theo biến Life of Tire (Months) và cuốn đến các dòng cuối cùng của dữ liệu.

We see in Figure 2.34 that the observation with Life of Tire (Months) value of

Tire (Months) for the other three tires from this automobile is 60.1. This suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

rear tire value is also misplaced. Both of these erroneous entries can now be corrected By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the minimum and maximum values

are in error and if so, what might be the correct value.

Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find. However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can use this knowledge

to look for erroneous values. Here we will consider the variables Tread Depth and Miles;

because more miles driven should lead to less tread depth on an automobile tire, we expect these two variables to have a negative relationship. A scatter chart will

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

whether any of the tires in the data set have values for Tread Depth and Miles that are

counter to this expectation

The red ellipse in Figure 2.35 shows the region in which the points representing

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Closer examination of outliers and potential erroneous values may reveal an error

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

technique. Dimension reduction is the process of removing variables from the analysis without losing crucial information. One simple method for reducing the number of variables is to examine pairwise correlations to detect variables or

may supply similar information. Such variables can be aggregated or removed to allow

more parsimonious model development.

A critical part of data mining is determining how to represent the measurements of the

variables and which variables to consider. The treatment of categorical variables is particularly important. Typically, it is best to encode categorical variables with 0–1

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

different categories results in a large number of variables. In these cases, the use of PivotTables is helpful in identifying categories that are similar and can possibly be

reduce the number of 0–1 dummy variables. For example, some categorical

code, product model number) may have many possible categories such that, for the purpose of model building, there is no substantive difference between multiple

therefore the number of categories may be reduced by combining categories.

Often data sets contain variables that, considered separately, are not particularly insightful but that, when appropriately combined, result in a new variable that reveals an important relationship. Financial data supplying information on stock

may be as useful as the derived variable representing the price/earnings (PE) ratio.

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

Lập bảng phân phối tần suất cho biến định tính trong excel Giả sử chúng ta có bảng dữ liệu như sau

Chúng ta muốn lập bảng phân phối tần suất của từng loại thức uống, chúng ta sử dụng hàm countif như sau

Kết quả như sau

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

Lập bảng phân phối tần suất cho biến định lượng trong Excel

Để lập bảng phân phối tần suất cho biến định lượng chúng ta dùng hàm frequency

Vẽ tổ chức đồ (histogram) cho biến định lượng

Để có thể vẽ histogram chúng ta cần phải có công cụ Data Analysis Toolpak.

<b>Step 1. Click the Data tab in the RibbonStep 2. Click Data Analysis in the Analyze groupStep 3. When the Data Analysis dialog box opens, choose Histogram from the</b>

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

list of

<b>Analysis Tools, and click OK</b>

<b>Trong hộp Input Range: chúng ta nhập chuỗi dữ liệu vào. ở đây ví dụ chúng ta</b>

<i>nhập A2:A21</i>

<b>Trong hộp Bin Range: chúng ta nhập các giới hạn trên của từng bin. Ở đây chúng</b>

<i>ta nhập C2:C6</i>

Under <b> Output Options:, select New Worksheet Ply:</b>

Select the check box for <b> Chart Output (see Figure 2.13)Click OK </b>

Kết quả là chúng ta được một sheet mới với histogram

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

Vẽ box plot cho biến định lượng

The step-by-step directions below illustrate how to create boxplots in Excel for

single variable and multiple variables. First we will create a boxplot for a single variable

<b>Step 1. Select cells B1:B13</b>

<b>Step 2. Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu</b>

The resulting boxplot created in Excel is shown in Figure 2.24. Comparing this figure to

Figure 2.22, we see that all the important elements of a boxplot are generated here. Excel

orients the boxplot vertically, and by default it also includes a marker for the mean.

<i>Next we will use the HomeSalesComparison file to create boxplots in Excel for</i>

variables similar to what is shown in Figure 2.26.

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

<b>Step1. Select</b> cells B1:F11

<b>Step 2. Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu</b>

The boxplot created in Excel is shown in Figure 2.25. Excel again orients the

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

CHƯƠNG 2: THỐNG KÊ MÔ TẢ

Tính các giá trị tập trung cho biến định lượng

Chúng ta có thể tính trung bình, trung vị, yếu vị của biến định lượng bằng cách dùng các hàm evarge, median, mode.

Trong bảng trên dãy dữ liệu của chúng ta có hai yếu vị là 138 và 25400000 do đó chúng ta phải dùng hàm MODE.MULTI (MULTI tượng trưng cho multimodal). Còn nếu dãy dữ liệu của chúng ta chỉ có 1 yếu vị, chúng ta sử dụng hàm MODE.SNGL.

Chúng ta có thể tính geometric mean với Excel.

The geometric mean is often used in analyzing growth rates in financial data. In these

types of situations, the arithmetic mean or average value will provide misleading results.

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

To illustrate the use of the geometric mean, consider Table 2.10, which shows the percentage annual returns, or growth rates, for a mutual fund over the past 10 We refer to 0.779 as the growth factor for year 1 in Table 2.10. We can compute the balance at the end of year 1 by multiplying the value invested in the fund at the

year 1 by the growth factor for year 1: $100(0.779) $ 5 77.90. The balance in the fund at the end of year 1, $77.90, now becomes the beginning balance in year 2. So, with a percentage annual return for year 2 of 28.7%, the

In other words, the balance at the end of year 2 is just the initial investment at the beginning of year 1 times the product of the first two growth factors. This result can be generalized to show that the balance at the end of year 10 is the initial

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

investment times the

compute the balance at the end of year 10 for any amount of money invested at the beginning of year 1 by multiplying the value of the initial investment by 1.335. For

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

What was the mean percentage annual return or mean rate of growth for this investment over the 10-year period? The geometric mean of the 10 growth factors

answer this question. Because the product of the 10 growth factors is 1.335, the geometric

The geometric mean tells us that annual returns grew at an average annual rate of (1.029 2 1) 100, or 2.9%. In other words, with an average annual growth rate of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029) $ 10 5 133.09 at the end of 10 years. We can use Excel to calculate the

geometric mean for the data in Table 2.10 by using the function GEOMEAN. In Figure 2.17, the value for the geometric mean in cell C13 is found using the formula

<i>= GEOMEAN(C2:C11). </i>

It is important to understand that the arithmetic mean of the percentage annual returns does not provide the mean annual growth rate for this investment. The sum of

the 10 percentage annual returns in Table 2.10 is 50.4. Thus, the arithmetic mean

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

mean is appropriate only for an additive process. For a multiplicative process, such as applications involving growth rates, the geometric mean is the appropriate

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

applications include changes in the populations of species, crop yields, pollution

death rates. The geometric mean can also be applied to changes that occur over any number of successive periods of any length. In addition to annual changes, the

often applied to find the mean rate of change over quarters, months, weeks, and even days.

Tính các số phân tán cho biến định lượng

Chúng ta có thể dùng các cơng thức sau để tính các số phân tán cho biến định lượng

Lưu ý để tính độ lệch chuẩn cũng như variance chúng ta dùng .S nghĩa là tính cho sample chứ khơng phải cho toàn bộ dân số.

<b>Variance is a measure of variability in the values of a random variable. It is a</b>

weighted

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

average of the squared deviations of a random variable from its mean where the

<i>referred to as the variance. The notations Var(x) and s 2 are both used to denote the</i>

variance of a random variable

The calculation of the variance of the number of payments made per year by a mortgage

customer is summarized in Table 4.12. We see that the variance is 42.360. The

<b>deviation, s , is defined as the positive square root of the variance. Thus, the</b>

standard deviation for the number of payments made per year by a mortgage

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

The Excel function SUMPRODUCT can be used to easily calculate equation

a custom discrete random variable. We illustrate the use of the SUMPRODUCT

We can also use Excel to find the variance directly from the data when the values

occur with relative frequencies that correspond to the probability distribution of the random

variable. Cell F305 in Figure 4.12 shows that we use the Excel formula

<i>=VAR.P(F2:F301) </i>to calculate the variance from the complete data. This formula

which is the same as that calculated in Table 4.12 and Figure 4.13. Similarly, we

<i>formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.</i>

As with the AVERAGE function and expected value, we cannot use the Excel functions

<i>VAR.P and STDEV.P directly on the x values to calculate the variance andstandard deviation of a custom discrete random variable if the x values are not</i>

Instead we must either use the formula from equation (4.14) or use the Excel

the entire data set as shown in Figure 4.12

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

variables, each with different standard deviations and different means . Tính các chỉ số phân phối của biến định lượng

A z-score allows us to measure the relative location of a value in the data set. More

<i>specifically, a z-score helps us determine how far a particular value is from the</i>

mean relative to the data set’s standard deviation. Suppose we have a sample of n

values denoted by x1 2 , , x x , <i>, n</i>. In addition, assume that the sample mean, x , and

<i>the sample standard deviation, s, are already computed. Associated with each</i>

<i>value called its z-score. Equation (2.8) shows how the z-score is computed for eachxi: </i>

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<i>The z-score is often called the standardized value. The z-score, zi, can be</i>

that the value of the observation is equal to the mean.

<i>The z-scores for the class size data are computed in Table 2.13. Recall the</i>

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

deviations below the mean.

<i>The z-score can be calculated in Excel using the function STANDARDIZE. Figure</i>

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

Tính đồng phương sai (covariance) giữa hai biến định lượng

<b>Covariance is a descriptive measure of the linear association between two</b>

the deviation of each xi from its sample mean ( ) x x i 2 by the deviation of the corresponding yi from its sample mean ( ) y y i 2 ; this sum is then divided by n 2 1

<i>To measure the strength of the linear relationship between the high temperature x</i>

that for our calculations, x 5 84.6 and y 5 26.3

The covariance calculated in Table 2.15 is <i>s 12.8</i>

than 0, it indicates a positive relationship between the high temperature and sales

water. This verifies the relationship we saw in the scatter chart in Figure 2.26 that

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

high temperature for a day increases, sales of bottled water generally increase. The sample covariance can also be calculated in Excel using the COVARIANCE.S function. Figure 2.27 shows the data from Table 2.14 entered into an Excel

<i>covariance is calculated in cell B17 using the formula 5COVARIANCE.S(A2:A15,B2:B15) A2:A15 defines the range for the x variable (high temperature), and</i>

<i>range for the y variable (sales of bottled water). </i>

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

For the bottled water, the covariance is positive, indicating that higher

<i>are associated with higher sales (y). If the covariance is near 0, then the x and y</i>

<i>are not linearly related. If the covariance is less than 0, then the x and y variablesare negatively related, which means that as x increases, y generally decreases.</i>

Figure 2.28 demonstrates several possible scatter charts and their associated

One problem with using covariance is that the magnitude of the covariance value is difficult to interpret. Larger sxy values do not necessarily mean a stronger linear

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

Tính correlation coefficient giữa hai biến định lượng

The correlation coefficient measures the relationship between two variables, and, unlike

covariance, the relationship between two variables is not affected by the units of

<i>measurement for x and y. For sample data, the correlation coefficient is defined as</i>

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

scales the correlation coefficient so that it will always take values between 21 and 11.

Let us now compute the sample correlation coefficient for bottled water sales at Queensland Amusement Park. Recall that we calculated sxy 5 12.8 using equation (2.9).

<i>Using data in Table 2.14, we can compute sample standard deviations for x and y </i>

The sample correlation coefficient is computed from equation (2.10) as follows: 12.8

The correlation coefficient can take only values between 21 and 11. Correlation

<i>coefficient values near 0 indicate no linear relationship between the x and y</i>

variables. Correlation coefficients greater than 0 indicate a positive linear

<i>variables. The closer the correlation coefficient is to 11, the closer the x and y</i>

to forming a straight line that trends upward to the right (positive slope). Correlation coefficients less than 0 indicate a negative linear relationship between

<i>The closer the correlation coefficient is to 21, the closer the x and y values are to</i>

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

we can see in Figure 2.26, one could draw a straight line with a positive slope that would

close to all of the data points in the scatter chart. Because the correlation coefficient defined here measures only the strength of the

cooling) and the daily high outside temperature for 100 consecutive days. The sample correlation coefficient for these data is rxy 5 20.007 and indicates that there is no linear relationship between the two variables. However, Figure 2.29 provides

strong visual evidence of a nonlinear relationship. That is, we can see that as the daily high

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

outside temperature increases, the money spent on environmental control first

less heating is required and then increases as greater cooling is required. We can compute correlation coefficients using the Excel function CORREL. The correlation coefficient in Figure 2.27 is computed in cell B18 for the sales of

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

CHƯƠNG 3: THỐNG KÊ SUY DIỄN

Sampling from a Finite Population

Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical

<i>random sample of size n from a finite population of size N is defined as follows.</i>

Procedures used to select a simple random sample from a finite population are

generated is called a random number because the mathematical procedure used by the RAND function guarantees that every number between 0 and 1 has the same

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

<i>N involves two steps.</i>

<b>Step 1. Assign a random number to each element of the population.</b>

<i><b>Step 2. Select the n elements corresponding to the n smallest random numbers.</b></i>

<i>Because each set of n elements in the population has the same probability of beingassigned the n smallest random numbers, each set of n elements has the same </i>

probability of

being selected for the sample. If we select the sample using this two-step procedure, every

<i>sample of size n has the same probability of being selected; thus, the sample </i>

selected satisfies the definition of a simple random sample.

Let us consider the process of selecting a simple random sample of 30 EAI

<i><b>Step 1. In cell D1, enter the text Random Numbers</b></i>

<b>Step 2. In cells D2:D2501, enter the formulaRAND()Step 3. Select the cell range D2:D2501</b>

<b>Step 4. In the Home tab in the Ribbon:Click Copy in the Clipboard group</b>

<b>Click the arrow below Paste in the Clipboard group. When the Pastewindow appears, click Values in the Paste Values area</b>

<b>Press the Esc key</b>

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

<b>Step 5. Select cells A1:D2501</b>

<b>Step 6. In the Data tab on the Ribbon, click Sort in the Sort & Filter groupStep 7. When the Sort dialog box appears:</b>

<b>Select the check box for My data has headers</b>

<b>In the first Sort by dropdown menu, select Random Numbers</b>

30 random numbers that were generated. Hence, this group of 30 employees is a simple random sample. Note that the random numbers shown on the right in Figure

sample, and employee 13 in the population (see row 14 of the worksheet on the left) has been

included as the 22nd observation in the sample (row 23 of the worksheet on the right)

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

Sampling from an infinite Population

Sometimes we want to select a sample from a population, but the population is

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

infinite population case. With an infinite population, we cannot select a simple random

sample because we cannot construct a frame consisting of all the elements. In the infinite

population case, statisticians recommend selecting what is called a random sample.

Care and judgment must be exercised in implementing the selection process for obtaining a random sample from an infinite population. Each case may require a different

selection procedure. Let us consider two examples to see what we mean by the conditions: (1) Each element selected comes from the same population, and (2)

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

With a production operation such as this, the biggest concern in selecting a random sample is to make sure that condition 1, the sampled elements are selected from the

selecting some boxes when the process is operating properly and other boxes when the process is not operating properly and is underfilling or overfilling the boxes. With a production process such as this, the second condition, each element is selected independently,

is satisfied by designing the production process so that each box of cereal is filled independently. With this assumption, the quality-control inspector need only worry about satisfying the same population condition.

As another example of selecting a random sample from an infinite population,

</div>

×