Tải bản đầy đủ (.pptx) (40 trang)

Business analytics data analysis and decision making 5th by wayne l winston chapter 11

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (954.91 KB, 40 trang )

part.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in

Business Analytics:

Data Analysis and

Chapter

Decision Making

11
Regression Analysis: Statistical Inference


Introduction
 Two basic problems are discussed in this chapter:
 Population regression model
 Inferring its characteristics—that is, its intercept and slope term(s)—from the
corresponding terms estimated by least squares

 Determining which explanatory variables belong in the equation
 Inferring whether there is any population regression equation worth pursuing

 Prediction
 Predicting values of the dependent variable for new observations
 Calculating prediction intervals to measure the accuracy of the predictions

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.



The Statistical Model
(slide 1 of 7)

 To perform statistical inference in a regression context, a statistical
model is required—that is, we must first make several assumptions
about the population.

 These assumptions represent an idealization of reality and are never
likely to be entirely satisfied for the population in any real study.

 From a practical point of view, all we can ask is that they represent a close
approximation to reality.

 If the assumptions are grossly violated, statistical inferences that are based
on these assumptions should be viewed with suspicion.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


The Statistical Model
(slide 2 of 7)

 Regression assumptions:
 There is a population regression line.
 It joins the means of the dependent variable for all values of the explanatory
variables.

 For any fixed values of the explanatory variables, the mean of the errors is zero.


 For any values of the explanatory variables, the variance (or standard
deviation) of the dependent variable is a constant, the same for all such
values.

 For any values of the explanatory variables, the dependent variable is
normally distributed.

 The errors are probabilistically independent.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


The Statistical Model
(slide 3 of 7)

 The first assumption is probably the most important.
 It implies that for some set of explanatory variables, there is an exact linear
relationship in the population between the means of the dependent variable
and the values of the explanatory variables.

 Equation for population regression line joining means:
 α is the intercept term, and the βs are the slope terms. (Greek letters are used to
denote that they are unobservable population parameters.)

 Most individual Ys do not lie on the population regression line.
 The vertical distance from any point to the line is an error.
 Equation for population regression line with error:

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.



The Statistical Model
(slide 4 of 7)

 Assumption 2 concerns variation around the population regression
line.
 It states that the variation of the Ys about the regression line is the same,
regardless of the values of the Xs.

 The technical term for this property is homoscedasticity.
 A simpler term is constant error variance.

 This assumption is often questionable—the variation in Y often increases as
X increases.

 Heteroscedasticity means that the variability of Y values is larger for
some X values than for others.

 A simpler term for this is nonconstant error variance.
 The easiest way to detect nonconstant error variance is through a visual
inspection of a scatterplot.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


The Statistical Model
(slide 5 of 7)

 Assumption 3 is equivalent to stating that the errors are normally
distributed.

 You can check this by forming a histogram (or a Q-Q plot) of the residuals.
 If assumption 3 holds, the histogram should be approximately symmetric and bellshaped, and the points of a Q-Q plot should be close to a 45 degree line.

 If there is an obvious skewness or some other nonnormal property, this indicates a
violation of assumption 3.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


The Statistical Model
(slide 6 of 7)

 Assumption 4 requires probabilistic independence of the errors.
 This assumption means that information on some of the errors provides no
information on the values of the other errors.

 For cross-sectional data, this assumption is usually taken for granted.
 For time-series data, this assumption is often violated.
 This is because of a property called autocorrelation.
 The Durbin-Watson statistic is one measure of autocorrelation and thus measures
the extent to which assumption 4 is violated.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


The Statistical Model
(slide 7 of 7)

 One other assumption is important for numerical calculations: No
explanatory variable can be an exact linear combination of any other

explanatory variables.
 The violation occurs if one of the explanatory variables can be written as a
weighted sum of several of the others.

 This is called exact multicollinearity.
 If it exists, there is redundancy in the data.

 A more common and serious problem is multicollinearity, where explanatory
variables are highly, but not exactly, correlated.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Inferences about the
Regression Coefficients
 In the equation for the population regression line, α and the βs are
called the regression coefficients.
 There is one other unknown constant in the model: the variance of the
errors, labeled σ2.
 The choice of relevant explanatory variables is almost never obvious.
 Two guiding principles are relevance and data availability.
 One overriding principle is parsimony—to explain the most with the least.
 It favors a model with fewer explanatory variables, assuming that this model
explains the dependent variable almost as well as a model with additional
explanatory variables.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Sampling Distribution of the

Regression Coefficients
 The sampling distribution of any estimate derived from sample data is
the distribution of this estimate over all possible samples.
 Sampling distribution of a regression coefficient:
If the regression assumptions are valid, the standardized value

has a t distribution with n − k − 1 degrees of freedom.

 This result has three important implications:
 The estimate b is unbiased in the sense that its mean is β, the true but
unknown value of the slope.

 The estimated standard deviation of b is labeled sb.
 It is usually called the standard error of a regression coefficient, or the
standard error of b.

 It measures how much the bs would vary from sample to sample.

 The shape of the distribution of b is symmetric and bell-shaped.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Example 11.1:
Overhead Costs.xlsx

(slide 1 of 2)

 Objective: To use standard regression output to make inferences about the
regression coefficients of machine hours and production runs in the equation
for overhead costs.


 Solution: The dependent variable is Overhead and the explanatory variables
are Machine Hours and Production Runs.

 The output from StatTools’s Regression procedure is shown below.
 The estimates of the regression coefficients appear under the label Coefficient.
 The column labeled Standard Error shows the sb values.
 Each b represents a point estimate of the corresponding β. The corresponding sb
indicates the accuracy of this point estimate.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Example 11.1:
Overhead Costs.xlsx

(slide 2 of 2)

 The sample data can be used to obtain a confidence interval for a
regression coefficient.

 A confidence interval for any β is of the form:
where the t-multiple depends on the confidence level and the degrees of
freedom.

 StatTools always provides these 95% confidence intervals for the regression
coefficients automatically, as shown at the bottom right of the figure on the
previous slide.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.



Hypothesis Tests for the Regression Coefficients
and p-Values
 There is another important piece of information in regression outputs:
the t-values for the individual regression coefficients.
 Each t-value is the ratio of the estimated coefficient to its standard error.

 It indicates how many standard errors the regression coefficient is from
zero.

 A t-value can be used in a hypothesis test for a regression
coefficient.
 If a variable’s coefficient is zero, there is no point in including this variable
in the equation.

 To run this test, simply compare the t-value in the regression output with a
tabulated t-value and reject the null hypothesis only if the t-value from the
computer output is greater in magnitude than the tabulated t-value.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


A Test for the Overall Fit:
The ANOVA Table (slide 1 of 3)
 It is conceivable that none of the explanatory variables in the regression
equation explains the dependent variable.
 An indication of this problem is a very small R2 value.

 An equation has no explanatory power if the the same value of Y will be predicted

regardless of the values of the Xs.

 The null hypothesis is that all coefficients of the explanatory variables are zero.
 The alternative is that at least one of these coefficients is not zero.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


A Test for the Overall Fit:
The ANOVA Table (slide 2 of 3)
 To test the null hypothesis, use an F test, a formal procedure for
testing whether the explained variation is large compared to the
unexplained variation.
 This is also called the ANOVA (analysis of variance) test because the
elements for calculating the required F-value are shown in an ANOVA table
for regression.

 The ANOVA table splits the total variation of the Y variable (SST):
into the part unexplained by the regression equation (SSE):
and the part that is explained (SSR):

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


A Test for the Overall Fit:
The ANOVA Table (slide 3 of 3)
 The required F-ratio for the test is:

where


and

 If the F-ratio is small, the explained variation is small relative to the
unexplained variation, and there is evidence that the regression equation
provides little explanatory value.

 The F-ratio has an associated p-value that allows you to run the test
easily; it is reported in most regression outputs.
 Reject the null hypothesis—and conclude that the X variables have at least
some explanatory value—if the F-value in the ANOVA table is large and the
corresponding p-value is small.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Multicollinearity
 Multicollinearity occurs when there is a fairly strong linear
relationship among a set of explanatory variables.
 In this case, the relationship between the explanatory variable X and the
dependent variable Y is not always accurately reflected in the coefficient of
X; it depends on which other Xs are included or not included in the
equation.

 There are various degrees of multicollinearity, but in each of them, there is
a linear relationship between two or more explanatory variables.

 The symptoms of multicollinearity can be “wrong” signs of the coefficients,
smaller-than-expected t-values, and larger-than-expected (insignificant) pvalues.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.



Example 11.2:
Heights Simulation.xlsx

(slide 1 of 2)

 Objective: To illustrate the problem of multicollinearity when both
foot length variables are used in a regression for height.

 Solution: The dependent variable is Height, and the explanatory
variables are Right and Left, the length of the right foot and the left
foot, respectively.

 Simulation is used to generate a hypothetical data set of heights and
left and right foot lengths.

 Height is approximately 31.8 plus 3.2 times foot length (all expressed
in inches).

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Example 11.2:
Heights Simulation.xlsx

(slide 2 of 2)

 The regression output when both Right and Left are entered in the equation
for Height appears at the bottom right of the figure below.


© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


INCLUDE/EXCLUDE DECISIONS
 The t-values of regression coefficients can be used to make
include/exclude decisions for explanatory variables in a regression
equation.
 Finding the best Xs to include in a regression equation is the most
difficult part of any real regression analysis.
 You are always trying to get the best fit possible, but the principle of
parsimony suggests using the fewest number of variables.

 This presents a trade-off, where there are not always easy answers.
 To help with this decision, several guidelines are presented on the next
slide.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Guidelines for Including/Excluding Variables in a
Regression Equation
 Look at a variable’s t-value and its associated p-value. If the
p-value is above some accepted significance level, such as 0.05, this
variable is a candidate for exclusion.
 Check whether a variable’s t-value is less than 1 or greater than 1 in
magnitude. If it is less than 1, then it is a mathematical fact that se
will decrease (and adjusted R2 will increase) if this variable is
excluded from the equation.
 Look at t-values and p-values, rather than correlations, when making

include/exclude decisions. An explanatory variable can have a fairly
high correlation with the dependent variable, but because of other
variables included in the equation, it might not be needed.
 When there is a group of variables that are in some sense logically
related, it is sometimes a good idea to include all of them or exclude
all of them.
 Use economic and/or physical theory to decide whether to include or
exclude variables, and put less reliance on t-values and/or p-values.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Example 11.3:
Catalog Marketing.xlsx

(slide 1 of 2)

 Objective: To see which potential explanatory variables are useful for
explaining current year spending amounts at HyTex with multiple
regression.

 Solution: Data file contains data on 1000 customers who purchased
mail-order products from HyTex Company.

 For each customer, data on several variables are included.
 Base the regression on the first 750 observations and use the other
250 for validation.

 Enter all of the potential explanatory variables.
 Then exclude unnecessary variables based on their t-values and pvalues.
 Four variables, Age, Gender, Own Home, and Married, have p-values well

above 0.05 and are obvious candidates for exclusion.

 Exclude variables one at a time, starting with the variable that has the
highest p-value, and rerun the regression after each exclusion.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Example 11.3:
Catalog Marketing.xlsx

(slide 2 of 2)

 The resulting output appears below.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


Stepwise Regression
 Many statistical packages provide some assistance in include/exclude
decisions by including automatic equation-building options.
 These options estimate a series of regression equations by successively
adding (or deleting) variables according to prescribed rules.

 Generically, these methods are referred to as stepwise regression.
 There are three types of equation-building procedures:
 Forward—begins with no explanatory variables in the equation and
successfully adds one at a time until no remaining variables make a
significant contribution.


 Backward—begins with all potential explanatory variables in the equation
and deletes them one at a time until further deletion would do more harm
than good.

 Stepwise—is much like a forward procedure, except that it also considers
possible deletions along the way.

 All of these procedures have the same basic objective—to find an
equation with a small se and a large R2 (or adjusted R2).
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.


×