Tải bản đầy đủ (.pdf) (672 trang)

SÁCH xác SUẤT và THỐNG kê TRONG KHOA học kỹ THUẬT tài CHÍNH

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.19 MB, 672 trang )


Introduction to

Probability and Statistics
for Science, Engineering,
and Finance

Walter A. Rosenkrantz
Department of Mathematics and Statistics
University of Massachusetts at Amherst


Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487‑2742
© 2009 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid‑free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number‑13: 978‑1‑58488‑812‑3 (Hardcover)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid‑
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti‑
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy‑
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the


publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978‑750‑8400. CCC is a not‑for‑profit organization that provides licenses and registration for a variety of users. For orga‑
nizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Rosenkrantz, Walter A.
Introduction to probability and statistics for science, engineering, and finance / Walter A.
Rosenkrantz.
p. cm.
Includes bibliographical references and index.
ISBN 978‑1‑58488‑812‑3 (alk. paper)
1. Probabilities. 2. Mathematical statistics. I. Title.
QA273.R765 2008
519.5‑‑dc22
Visit the Taylor & Francis Web site at

and the CRC Press Web site at


2008013044


Preface

Student Audience and Prerequisites
This book is written for undergraduate students majoring in engineering, computer science, mathematics, economics, and finance who are required, or urged, to take a one- or twosemester course in probability and statistics to satisfy major or distributional requirements.
The mathematical prerequisites are two semesters of single variable calculus. Although some

multivariable calculus is used in Chapter 5, it can be safely omitted without destroying the
continuity of the text. Indeed, the topics have been arranged so that the instructor can
always adjust the mathematical sophistication to a level the students can feel comfortable
with. Chapters and sections marked with an asterisk (*) are optional; they contain material
of independent interest, but are not essential for a first course.

Objectives
My primary goal in writing this book is to integrate into the traditional one- or twoterm statistics course some of the more interesting and widely used concepts in financial
engineering. For example, the volatility of a stock is the standard deviation of its returns;
value at risk (VaR) is essentially a confidence interval; a stock’s β is the slope of the
regression line obtained when one performs a linear regression of the stock’s returns against
the returns of the S&P500 index (the S&P500 index, itself, is used as a proxy for the market
portfolio). The binomial distribution, it is worth noting, plays a fundamental role in the
Cox-Ross-Rubinstein (CRR) model, also called the binomial lattice model, of stock price
fluctuations. A passage to the limit via the central limit theorem yields the lognormal
distribution for stock prices as well as the famous Black-Scholes option pricing formula.

Organization of the Book
Beginning with the first chapter on data analysis, I introduce the basic concepts a student needs in order to understand and create the tables and graphs produced by standard
statistical software packages such as MINITAB, SAS, and JMP. The data sets themselves
have been carefully selected to illustrate the role and scope of statistics in science, engineering, public health, and finance. The text then takes students through the traditional
topics of a first course in statistics. Novel features include: (i) applications of traditional
statistical concepts and methods to the analysis and interpretation of financial data; (ii) an
introduction to modern portfolio theory; (iii) mean-standard deviation (r − σ) diagram of
a collection of portfolios; and (iv) computing a stock’s β via simple linear regression.
For the benefit of instructors using this text, I have included technical, even tedious
details, needed to derive various theorems, including the famous Black-Scholes option pricing
formula, because, in my opinion, one cannot explain this formula to students without a
thorough understanding of the fundamental concepts, methods, and theorems used to derive
them. These computational details, which can safely be omitted on a first reading, are



contained in a section titled “Mathematical Details and Derivations,” put at the end of
most chapters.

Examples
The text introduces the student to the most important concepts by using suitably chosen
examples of independent interest. Applications to engineering (queueing theory, reliability
theory, acceptance sampling), computer performance analysis, public health, and finance are
included as soon as the statistical concepts have been developed. Numerous examples, using
both statistical software packages and scientific calculators, help to reinforce the student’s
mastery of the basic concepts.

Problems
The problems (there are 675 of them), which range from the routine to the challenging,
help students master the basic concepts and give them a glimpse of the vast range of
applications to a variety of disciplines.

Supplements
An Instructor’s Solutions Manual containing carefully worked out solutions to all 675
problems is available to adopters of the textbook. All data sets, including those used in the
worked out examples, are available in a CD-ROM to users of this textbook.

Contacting the Author
In spite of the copy editor’s and my best efforts to eliminate all errors and typos it is an
almost impossible task to eliminate them all. I, therefore, encourage all users of this text
to send their comments and criticisms to me at

Acknowledgments
The publication of a statistics textbook, containing many tables and graphs, is not possible without the cooperation of a large number of highly talented individuals, so it is a great

pleasure for me to have this opportunity of thanking them. First, I want to thank my editors
at Chapman-Hall: Sunil Nair for initiating this project, and Theresa Delforn and Michele
Dimont for guiding and prodding me to a successful conclusion. Shashi Kumar’s technical
advice with LaTeX is deeply appreciated and Theresa Gandolph, of the Instructional Technology Lab at George Washington University, gave me valuable assistance with the Freehand
graphics software package. Professors Alan Durfee of Mount Holyoke College and Michael
Sullivan of the University of Massachusetts (Amherst) provided me with some insights and
ideas on financial engineering that were very useful to me in the writing of this book.
I am also grateful to the American Association for the Advancement of Science, the
American Journal of Clinical Nutrition, the American Journal of Epidemiology, the American Statistical Association, the Biometrika Trustees, Cambridge University Press, Elsevier
Science, Iowa State University Press, Richard D. Irwin, McGraw-Hill, Oxford University
Press, Prentice-Hall, Routledge, Chapman & Hall, the Royal Society of Chemistry, Journal
of Chemical Education, and John Wiley & Sons for permission to use copyrighted material.
I have made every effort to secure permission from the original copyright holders for each
data set, and would be grateful to my readers for calling my attention to any omissions so
they can be corrected by the publisher.


Finally, I dedicate this book to my wife, Linda, for her patient support while I was almost
always busy writing it.


This page intentionally left blank


Contents

1 Data Analysis
1.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The Role and Scope of Statistics in Science and Engineering . . . . . .
1.3 Types of Data: Examples from Engineering, Public Health, and Finance

1.3.1 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Financial Data: Stock Market Prices and Their Time Series . . .
1.3.4 Stock Market Returns: Definition and Examples . . . . . . . . .
1.4 The Frequency Distribution of a Variable Defined on a Population . . .
1.4.1 Organizing the Data . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Graphical Displays . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Quantiles of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Quantiles of the Empirical Distribution Function . . . . . . . . .
1.6 Measures of Location (Central Value) and Variability . . . . . . . . . .
1.6.1 The Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Sample Standard Deviation: A Measure of Risk . . . . . . . . . .
1.6.3 Mean-Standard Deviation Diagram of a Portfolio . . . . . . . . .
1.6.4 Linear Transformations of Data . . . . . . . . . . . . . . . . . . .
1.7 Covariance, Correlation, and Regression: Computing a Stock’s Beta . .
1.7.1 Fitting a Straight Line to Bivariate Data . . . . . . . . . . . . .
1.8 Mathematical Details and Derivations . . . . . . . . . . . . . . . . . . .
1.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.11 Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.12 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
. .
.
. .
. .
. .

. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .

2 Probability Theory
2.1 Orientation . . . . . . . . . . . . . . . . . . . . . . .
2.2 Sample Space, Events, Axioms of Probability Theory
2.2.1 Probability Measures . . . . . . . . . . . . . .
2.3 Mathematical Models of Random Sampling . . . . .
2.3.1 Multinomial Coefficients . . . . . . . . . . . .
2.4 Conditional Probability and Bayes’ Theorem . . . .
2.4.1 Conditional Probability . . . . . . . . . . . .
2.4.2 Bayes’ Theorem . . . . . . . . . . . . . . . .

2.4.3 Independence . . . . . . . . . . . . . . . . . .
2.5 The Binomial Theorem . . . . . . . . . . . . . . . .
2.6 Chapter Summary . . . . . . . . . . . . . . . . . . .
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . .
2.8 To Probe Further . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

. .
.
. .
. .
. .
. .
. .
. .
. .
. .

. .
. .
. .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

1
1
2
5
5
7

9
13
17
17
18
22
26
26
27
32
32
33
36
37
38
40
43
44
44
65
70
71
71
72
78
84
93
94
94
97

99
100
101
101
111


3 Discrete Random Variables and Their Distribution Functions
3.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . .
3.2.1 Functions of a Random Variable . . . . . . . . . . . . . .
3.3 Expected Value and Variance of a Random Variable . . . . . . .
3.3.1 Moments of a Random Variable . . . . . . . . . . . . . . .
3.3.2 Variance of a Random Variable . . . . . . . . . . . . . . .
3.3.3 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . .
3.4 The Hypergeometric Distribution . . . . . . . . . . . . . . . . .
3.5 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . .
3.5.1 A Coin Tossing Model for Stock Market Returns . . . . .
3.6 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . .
3.7 Moment Generating Function: Discrete Random Variables . . .
3.8 Mathematical Details and Derivations . . . . . . . . . . . . . . .
3.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

113
113
114
120
121
125
128
130
130
134
140
144
146
148
150
151
160

4 Continuous Random Variables and Their Distribution Functions
161
4.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.2 Random Variables with Continuous Distribution Functions: Definition and
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

4.3 Expected Value, Moments, and Variance of a Continuous Random Variable 167
4.4 Moment Generating Function: Continuous Random Variables . . . . . . . 171
4.5 The Normal Distribution: Definition and Basic Properties . . . . . . . . . 172
4.6 The Lognormal Distribution: A Model for the Distribution of Stock Prices
177
4.7 The Normal Approximation to the Binomial Distribution . . . . . . . . . . 179
4.7.1 Distribution of the Sample Proportion pˆ . . . . . . . . . . . . . . . . 185
4.8 Other Important Continuous Distributions . . . . . . . . . . . . . . . . . . 185
4.8.1 The Gamma and Chi-Square Distributions . . . . . . . . . . . . . . 185
4.8.2 The Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . 188
4.8.3 The Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.9 Functions of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 189
4.10 Mathematical Details and Derivations . . . . . . . . . . . . . . . . . . . . . 191
4.11 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.13 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5 Multivariate Probability Distributions
205
5.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.2 The Joint Distribution Function: Discrete Random Variables . . . . . . . . 206
5.2.1 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . 211
5.3 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.4 Mean and Variance of a Sum of Random Variables . . . . . . . . . . . . . . 213
5.4.1 The Law of Large Numbers for Sums of Independent and Identically
Distributed (iid) Random Variables . . . . . . . . . . . . . . . . . . 220
5.4.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 222
5.5 Why Stock Prices Have a Lognormal Distribution: An Application of the
Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.5.1 The Binomial Lattice Model as an Approximation to a Continuous
Time Model for Stock Market Prices . . . . . . . . . . . . . . . . . . 227



5.6
5.7

5.8
5.9

5.10
5.11
5.12

5.13
5.14
5.15
5.16

Modern Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Mean-Variance Analysis of a Portfolio . . . . . . . . . . . . . . . . .
Risk Free and Risky Investing . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.1 Present Value Analysis of Risk Free and Risky Returns . . . . . . .
5.7.2 Present Value Analysis of Deterministic and Random Cash Flows . .
Theory of Single and Multi-Period Binomial Options . . . . . . . . . . . .
5.8.1 Black-Scholes Option Pricing Formula: Binomial Lattice Model . . .
Black-Scholes Formula for Multi-Period Binomial Options . . . . . . . . . .
5.9.1 Black-Scholes Pricing Formula for Stock Prices Governed by a Lognormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.1 The Poisson Process and the Gamma Distribution . . . . . . . . . .
Applications of Bernoulli Random Variables to Reliability Theory . . . . .
The Joint Distribution Function: Continuous Random Variables . . . . . .

5.12.1 Functions of Random Vectors . . . . . . . . . . . . . . . . . . . . . .
5.12.2 Conditional Distributions and Conditional Expectations: Continuous
Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.12.3 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . .
Mathematical Details and Derivations . . . . . . . . . . . . . . . . . . . . .
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Sampling Distribution Theory
6.1 Orientation . . . . . . . . . . . . . . . .
6.2 Sampling from a Normal Distribution .
6.3 The Distribution of the Sample Variance
6.3.1 Student’s t Distribution . . . . .
6.3.2 The F Distribution . . . . . . . .
6.4 Mathematical Details and Derivations .
6.5 Chapter Summary . . . . . . . . . . . .
6.6 Problems . . . . . . . . . . . . . . . . .
6.7 To Probe Further . . . . . . . . . . . .

. . .
. . .
. .
. . .
. . .
. . .
. . .
. . .
. . .


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

230
230
232
232
235
237
237
240
242
243

246
248
251
254
256
257
258
263
263
275
277
277
277
282
284
285
286
287
287
290

7 Point and Interval Estimation
291
7.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.2 Estimating Population Parameters: Methods and Examples . . . . . . . . . 292
7.2.1 Some Properties of Estimators: Bias, Variance, and Consistency . . 294
7.3 Confidence Intervals for the Mean and Variance . . . . . . . . . . . . . . . 296
7.3.1 Confidence Intervals for the Mean of a Normal Distribution: Variance
Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.3.2 Confidence Intervals for the Mean of an Arbitrary Distribution . . . 300

7.3.3 Confidence Intervals for the Variance of a Normal Distribution . . . 302
7.3.4 Value at Risk (VaR): An Application of Confidence Intervals to Risk
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.4 Point and Interval Estimation for the Difference of Two Means . . . . . . . 304
7.4.1 Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.5 Point and Interval Estimation for a Population Proportion . . . . . . . . . 307
7.5.1 Confidence Intervals for p1 − p2 . . . . . . . . . . . . . . . . . . . . . 309
7.6 Some Methods of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.6.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 310


7.7
7.8
7.9

7.6.2 Maximum Likelihood Estimators
Chapter Summary . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . .
To Probe Further . . . . . . . . . . . .

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


312
316
316
324

8 Hypothesis Testing
325
8.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2 Tests of Statistical Hypotheses: Basic Concepts and Examples . . . . . . . 326
8.2.1 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
8.2.2 Power Function and Sample Size . . . . . . . . . . . . . . . . . . . . 338
8.2.3 Large Sample Tests Concerning the Mean of an Arbitrary Distribution 339
8.2.4 Tests Concerning the Mean of a Distribution with Unknown Variance 340
8.3 Comparing Two Populations . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8.3.1 The Wilcoxon Rank Sum Test for Two Independent Samples . . . . 347
8.3.2 A Test of the Equality of Two Variances . . . . . . . . . . . . . . . . 350
8.4 Normal Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5 Tests Concerning the Parameter p of a Binomial Distribution . . . . . . . . 355
8.5.1 Tests of Hypotheses Concerning Two Binomial Distributions: Large
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
8.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
8.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.8 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9 Statistical Analysis of Categorical Data
9.1 Orientation . . . . . . . . . . . . . . . .
9.2 Chi-Square Tests . . . . . . . . . . . . .
9.2.1 Chi-Square Tests When the Cell
Specified . . . . . . . . . . . . . .
9.3 Contingency Tables . . . . . . . . . . .
9.4 Chapter Summary . . . . . . . . . . . .

9.5 Problems . . . . . . . . . . . . . . . . .
9.6 To Probe Further . . . . . . . . . . . .

. . . . . . . . . . . . .
. . . . . . . . . . . . .
Probabilities Are Not
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .

. . . . . . .
. . . . . . .
Completely
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

10 Linear Regression and Correlation
10.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Fitting a Straight Line via Ordinary Least Squares . . . . . . . . . .
10.3 The Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . .
10.3.1 The Sampling Distribution of βˆ1 , βˆ0 , SSE, and SSR . . . . . . . . .
10.3.2 Tests of Hypotheses Concerning the Regression Parameters . . . . .
10.3.3 Confidence Intervals and Prediction Intervals . . . . . . . . . . . . .
10.3.4 Displaying the Output of a Regression Analysis in an ANOVA Table

10.3.5 Curvilinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.1 Computing the Market Risk of a Stock . . . . . . . . . . . . . . . . .
10.5.2 The Shapiro–Wilk Test for Normality . . . . . . . . . . . . . . . . .
10.6 Mathematical Details and Derivations . . . . . . . . . . . . . . . . . . . . .
10.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.9 Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

373
373
373
376
377
383
383
388
389
389
390
392
398
399
402
403
406
408
411
416

417
421
422
426
426
437


10.10 To Probe Further

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

439

11 Multiple Linear Regression
441
11.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
11.2 The Matrix Approach to Simple Linear Regression . . . . . . . . . . . . . . 442
11.2.1 Sampling Distribution of the Least Squares Estimators . . . . . . . . 447
11.2.2 Geometric Interpretation of the Least Squares Solution . . . . . . . 449
11.3 The Matrix Approach to Multiple Linear Regression . . . . . . . . . . . . . 450
11.3.1 Normal Equations, Fitted Values, and ANOVA Table for the Multiple
Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 454
11.3.2 Testing Hypotheses about the Regression Model . . . . . . . . . . . 457
11.3.3 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.3.4 Confidence Intervals and Prediction Intervals in Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
11.4 Mathematical Details and Derivations . . . . . . . . . . . . . . . . . . . . . 464
11.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
11.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.7 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

12 Single Factor Experiments: Analysis of Variance
12.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 The Single Factor ANOVA Model . . . . . . . . . . . . .
12.2.1 Estimating the ANOVA Model Parameters . . . .
12.2.2 Testing Hypotheses about the Parameters . . . . .
12.2.3 Model Checking via Residual Plots . . . . . . . . .
12.2.4 Unequal Sample Sizes . . . . . . . . . . . . . . . .
12.3 Confidence Intervals for the Treatment Means; Contrasts
12.3.1 Multiple Comparisons of Treatment Means . . . .
12.4 Random Effects Model . . . . . . . . . . . . . . . . . . .
12.5 Mathematical Derivations and Details . . . . . . . . . . .
12.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . .
12.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8 To Probe Further . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

469
469
469
473
475
478
480

482
485
487
489
490
490
496

13 Design and Analysis of Multi-Factor Experiments
13.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Randomized Complete Block Designs . . . . . . . . . . . . . . .
13.2.1 Confidence Intervals and Multiple Comparison Procedures
13.2.2 Model Checking via Residual Plots . . . . . . . . . . . . .
13.3 Two Factor Experiments with n > 1 Observations per Cell . . .
13.3.1 Confidence Intervals and Multiple Comparisons . . . . .
13.4 2k Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
13.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.7 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

497
497
498
507

508
508
520
522
540
540
548

14 Statistical Quality Control
14.1 Orientation . . . . . . . . . . . . . . . . . . .
14.2 x and R Control Charts . . . . . . . . . . . .
14.2.1 Detecting a Shift in the Process Mean
14.3 p Charts and c Charts . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

551
551
552
557
559

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.


14.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14.6 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

562
562
565

A Tables
567
A.1 Cumulative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 568
A.2 Cumulative Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . 570
A.3 Standard Normal Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 572
A.4 Critical Values tν (α) of the t Distribution . . . . . . . . . . . . . . . . . . . 574
A.5 Quantiles Qν (p) = χ2ν (1 − p) of the χ2 Distribution . . . . . . . . . . . . . 575
A.6 Critical Values of the Fν1 ,ν2 (α) Distribution . . . . . . . . . . . . . . . . . 576
A.7 Critical Values of the Studentized Range q(α; n, ν) . . . . . . . . . . . . . . 580
A.8 Factors for Estimating σ, s, or σ RMS and σR from R . . . . . . . . . . . . 584
A.9 Factors for Determining from R the Three-Sigma Control Limits for X and
R Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
A.10 Factors for Determining from σ the Three-Sigma Control Limits for X, R,
and s or σ RMS Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

Answers to Selected Odd-Numbered Problems

589

1 Data Analysis

591

2 Probability Theory


597

3 Discrete Random Variables and Their Distribution Functions

601

4 Continuous Random Variables and Their Distribution Functions

609

5 Multivariate Probability Distributions

617

6 Sampling Distribution Theory

623

7 Point and Interval Estimation

625

8 Hypothesis Testing

631

9 Statistical Analysis of Categorical Data

637


10 Linear Regression and Correlation

641

11 Multiple Linear Regression

645

12 Single Factor Experiments: Analysis of Variance

647

13 Design and Analysis of Multi-Factor Experiments

653

14 Statistical Quality Control

659

Index

661


Chapter 1
Data Analysis

Information, that is imperfectly acquired, is generally as imperfectly retained; and a

man who has carefully investigated a printed table, finds, when done, that he has
only a very faint and partial idea of what he has read; and that like a figure
imprinted on sand, is soon totally erased and defaced.
William Playfair (1786), Commercial and Political Atlas

1.1

Orientation

Statistics is the science and art of collecting, displaying and interpreting data in order to
test theories and make inferences concerning all kinds of phenomena. In brief, its main goal
is to transform data into knowledge. Scientists, engineers, and economists use statistics
to summarize and interpret the data before they can make inferences from it. Statistical
software packages such as MINITAB1 and SAS2 are helpful because they produce graphical
displays that are very effective for visualizing and interpreting the data. It is therefore
of no small importance that we be able to understand and interpret the current methods
of graphically displaying statistical data. However, it is worth reminding ourselves now
and then, that the goal of a statistical analysis is primarily insight and understanding, not
mindless, formal calculations using statistical software packages.
In this chapter we develop the concepts and techniques of data analysis in the context
of suitable examples taken from science, engineering, and finance. In detail, the chapter is
organized as follows:
Organization of Chapter
1. Section 1.2: The Role and Scope of Statistics in Science, and Engineering
2. Section 1.3: Types of Data: Examples from Engineering, Public Health, and Finance
3. Section 1.3.1: Univariate Data
4. Section 1.3.2: Multivariate Data
5. Section 1.3.3: Financial Data, Stock Market Prices, and Their Time Series
6. Section 1.3.4: Stock Market Returns: Definition, Examples
7. Section 1.4: The Frequency Distribution of a Variable Defined on a Population

8. Section 1.5: Quantiles of a Distribution: Medians, Quartiles, Box Plots.
1 MINITAB
2 SAS

is a registered trademark of Minitab, Inc.
is a trademark of the SAS Institute, Inc., Cary, North Carolina, USA.

1


2

Introduction to Probability and Statistics for Science, Engineering, and Finance

9. Section 1.6: Measures of Location and Variability: Sample Mean and Sample Variance
10. Section 1.6.2: Standard Deviation: A Measure of Risk
11. Section 1.6.3: Mean-Standard Deviation Diagram of a Portfolio
12. Section 1.7: Covariance, Correlation, and Regression: Computing a Stock’s Beta
13. Section 1.8: Mathematical Details and Derivations
14. Section 1.9: Chapter Summary

1.2

The Role and Scope of Statistics in Science and Engineering

A scientific theory, according to the philosopher of science Karl Popper, is characterized
by “its falsifiability, or refutability, or testability” (Karl R. Popper, 1965), Conjectures and
Refutations: The Growth of Scientific Knowledge, Harper Torchbooks). In practical terms
this means a scientist tests a theory by performing an experiment in which he observes and
records the values of one or more variables in which he is interested. The variable of interest

for a chemist might be the atomic weight of an element; for an engineer it might be the
lifetime of a battery for a laptop computer; for an investor it might be the weekly or annual
rate of return of an investment; for the Bureau of the Census it is the population of each
state so political power can be reapportioned in accordance with the U.S. Constitution. The
next three examples illustrate how we use statistical models to validate scientific theories,
interpret data, and grapple with the variability inherent in all experiments.
Example 1.1
Einstein’s special relativity theory, published in 1905, asserts that the speed of light is
the same constant value, independent of position, inertial frame of reference, direction, or
time. The physicists A. A. Michelson and E. W. Morley (1881 and 1887) and D. Miller
(1924) performed a series of experiments to determine the velocity of light. They obtained
contradictory results. Michelson and Morley could not detect significant differences in the
velocity of light, thus validating relativity theory, while Miller’s experiments led him to the
opposite conclusion. The most plausible explanation for the different results appears to be
that the experimental apparatus had to detect differences in the velocity of light as small as
one part in 100,000,000. Unfortunately, temperature changes in the laboratory as small as
1/100 of a degree could produce an effect three times as large as the effect Miller was trying
to detect. In addition, Miller’s interferometer was so sensitive that “A tiny movement of
the hand, or a slight cough, made the interference fringes so unstable that no readings were
possible”(R. Clark (1971), Einstein: The Life and Times, World Publishing, p. 329). In
spite of their contradictory results, both scientists were rewarded for their work. Michelson
won the Nobel Prize for Physics in 1907 and Miller received the American Association for
the Advancement of Science Award in 1925. Reviewing this curious episode in the history of
science, Collins and Pinch (1993) (The Golem: What Everyone Should Know about Science,
Cambridge University Press, p. 40) write, “Thus, although the famous Michelson–Morley
experiment of 1887 is regularly taken as the first, if inadvertent, proof of relativity, in 1925,
a more refined and complete version of the experiment was widely hailed as, effectively,
disproving relativity.”



Data Analysis

3

Although the debate between Michelson and Miller continued—they confronted each
other at a scientific meeting in 1928 and agreed to differ—the physics community now
accepts special relativity theory as correct. This demonstrates that the validity of a scientific experiment strongly depends upon the theoretical framework within which the data
are collected, analyzed and interpreted. It is, perhaps, the British astronomer Sir Arthur
Eddington (1882-1944) who put it best when he said: “. . . no experiment should be believed until confirmed by theory.” From the statistician’s perspective this also shows how
experimental error can invalidate experimental results obtained by highly skilled scientists
using the best available state of the art equipment; that is, if the design of the experiment
is faulty then no reliable conclusions can be drawn from the data.
Example 1.2
Our second example is the famous breeding experiment of the geneticist Mendel who classified peas according to their shape (round (r) or wrinkled (w)) and color (yellow (y) or
green (g)). Each seed was classified into one of four categories: (ry) = (round, yellow),
(rg) = (round, green), (wy) = (wrinkled, yellow), and (wg) = (wrinkled, green). According to Mendelian genetics the frequency counts of seeds of each type produced from this
experiment occur in the following ratios:
Mendel’s predicted ratios: ry:rg:wy:wg = 9 : 3 : 3 : 1.
Thus, the ratio of the number of (ry) peas to (rg) peas should be equal to 9 : 3, and
similarly for the other ratios. An unusual feature of Mendel’s model should be noted:
it predicts a set of frequency counts (called a frequency distribution) instead of a single
number. The actual and predicted counts obtained by Mendel for n = 556 peas appear in
Table 1.1. The non integer values appearing in the third column come from dividing up the
556 peas into four categories in the ratios 9 : 3 : 3 : 1. This means that the first category
has 556 × (9/16) = 312.75 peas, the next two categories each have 556 × (3/16) = 104.25,
and the last category has 556 × (1/16) = 34.75 of them.
Table 1.1 Mendel’s data
Seed
type
ry

wy
rg
wg

Observed
frequency
315
101
108
32

Predicted
frequency
312.75
104.25
104.25
34.75

Looking at the results in Table 1.1 we see that the observed counts are close to but not
exactly equal to those predicted by Mendel. This leads us to one of the most fundamental
scientific questions: How much agreement is there between the collected data and the
scientific model that explains it? Are the discrepancies between the predicted counts and
the observed counts small enough so that they can be attributable to chance or are they
so large as to cast doubts on the theory itself? One problem with Mendel’s data, first
pointed out by Sir R. A. Fisher, Annals of Science (1936), pp. 115-137, is that every one
of his data sets fit his predictions extremely well—too well in fact. Statisticians measure
how well the experimental data fit a theoretical model by using the χ2 (chi-square statistic,
which we will study in Chapter 9). A large value of χ2 provides strong evidence against the
model. Fisher noted that the χ2 values for Mendel’s numerous experiments had more small
values than would be expected from random sampling; that is, Mendel’s data were too good



4

Introduction to Probability and Statistics for Science, Engineering, and Finance

to be true.3 Fisher’s method of analyzing Mendel’s data is a good example of inferential
statistics, which is the science of making inferences from the data based upon the theory of
probability. This theory will be presented in Chapters 2 through 6.
The scope of modern statistics, however, extends far beyond the task of validating scientific theories, as important as this task might be. It is also a search for mathematical
models that explain the given data and that are useful for predicting new data. In brief, it
is a highly useful tool for inductive learning from experimental data. The next example,
which is quite different from the preceding ones on relativity and genetics, illustrates how
misinterpreting statistical data can seriously harm a corporation, its reputation, and its
employees.
Example 1.3
The lackluster performance of the American economy in the last quarter of the twentieth
century has profoundly affected its corporations, its workers, and their families. Massive
layoffs, the household remedy of America’s managers for falling sales and mounting losses, do
not appear to have solved the basic problems of the modern corporation, which is to produce
high quality products at competitive prices. Even the ultimate symbol of success, the CEO
of a major American corporation, was no longer welcome at many college commencements–
“too unpopular among students who cannot find entry level jobs.”4 W. E. Deming (1900–
1993), originally trained as a physicist but employed as a statistician, became a highly
respected, if not always welcomed, consultant to America’s largest corporations. “The
basic cause of sickness in American industry and resulting unemployment,” he argued, “is
failure of top management to manage.” This was, and still is, a sharp departure from the
prevalent custom of blaming workers first when poor product quality leads to decline in
sales, profits, and dividends. To illustrate this point he devised a simple experiment that is
really a parable because it is in essence a simple story with a simple moral lesson.

Deming’s Parable of the Red Bead Experiment
(This example is adapted from W. Edwards Deming (1982), Out of the Crisis, MIT press,
pp. 109-112.) Consider a bowl filled with 800 red beads and 3200 white beads. The beads
in the bowl represent a shipment of parts from a supplier; the red beads represent the
defective parts. Each worker stirs the beads and then, while blindfolded, inserts a special
tool into the mixture with which he draws out exactly 50 beads. The aim of this process
is to produce white beads; red beads, as previously noted, are defective and will not be
accepted by the customers. The results for six workers are shown in Table 1.2. It is clear
that the workers’ skills in stirring the beads are irrelevant to the final results, the observed
variation between them due solely to chance; or, as Deming writes:
It would be difficult to construct physical circumstances so nearly equal for six people,
yet to the eye, the people vary greatly in performance.
For example, looking at the data in Table 1.2 we are tempted to conclude that Jack, who
produced only 4 red beads, is an outstanding employee and Terry, who produced nearly 4
times as many red beads as did Jack, is incompetent and should be immediately dismissed.
To explain the data, Deming goes on to calculate what he calls the limits of variation
attributable to the system. Omitting the technical details—we will explain his mathematical
model, methods and results in Chapter 4 (Example 4.20)—Deming concludes that 99% of
3 In spite of his fudged data, Mendelian genetics survived and is today regarded as one of the outstanding
scientific discoveries of all time.
4 NY Times, May 29, 1995.


Data Analysis

5

the observed variation in the workers’ performance is due to chance. He then continues:
“The six employees obviously all fall within the calculated limits of variation that could
arise from the system that they work in. There is no evidence in these data that Jack will

in the future be a better performer than Terry. Everyone should accordingly receive the
same raise. ...It would obviously be a waste of time to try and find out why Terry made 15
red beads, or why Jack made only 4.”
Table 1.2 Data from Deming’s red bead experiment
Name

Number of
red beads
produced
Mike
9
Peter
5
Terry
15
Jack
4
Louise
10
Gary
8
Total
51
In other words, the quality of the final product will not be improved by mass firings of
poorly performing workers. The solution is to improve the system. Management should
remove the workers’ blindfolds or switch to a new supplier who will ship fewer red beads.
This example illustrates one of the most important tasks of the engineer, which is to identify,
control, and reduce the sources of variation in the manufacture of a commercial product.

1.3


Types of Data: Examples from Engineering, Public Health,
and Finance

The correct statistical analysis of a data set depends on its type, which may consist of a
single list of numbers (univariate data), data collected in chronological order (time series),
multivariate data consisting of two or more columns of numbers, columns of categorical data
(non-numerical data), such as ethnicity, gender, and socioeconomic status, as well as many
other data structures of which there are too many to be listed here.

1.3.1

Univariate Data

The key to data analysis is to construct a mathematical model so that information can
be efficiently organized and interpreted. Formally, we begin to construct the mathematical
model of a univariate data set by stating the following definition.
Definition 1.1 1. A population is a set, denoted S, of well defined distinct objects, the
elements of which are denoted by s.
2. A sample is a subset of S, denoted by A ⊂ S. We represent the sample by listing its
elements as a sequence A = {s1 , . . . , sn }.
To illustrate these concepts we now consider an example of a sample drawn from a
population.


6

Introduction to Probability and Statistics for Science, Engineering, and Finance

Example 1.4

The federal government requires manufacturers to monitor the amount of radiation emitted
through the closed doors of a microwave oven. One manufacturer measured the radiation
emitted by 42 microwave ovens and recorded these values in Table 1.3. The data in Table 1.3
is a data set. Statisticians call this raw data—a list of measurements whose values have not
been manipulated in any way. The set of microwave ovens produced by this manufacturer
is an example of a population; the subset of 42 microwave ovens whose emitted radiations
were measured is a sample. We denote its elements by {s1 , . . . , s42 }.
Defining a Variable on a Population We denote the radiation emitted by the microwave oven s by X(s), where X is a function with domain S. It is worth noting that the
observed variation in the values of the emitted radiation come from the sampling procedure
while the observed value of the radiation is a function of the particular microwave oven that
is tested. From this point of view Table 1.3 lists the values of the function X evaluated on
the sample {s1 , . . . , s42 }. In particular, X(s1 ) = 0.15, X(s2) = 0.09, . . . , X(s42 ) = 0.05. Less
formally, and more simply, one describes X as a variable defined on a population. Engineers
are usually interested in several variables defined on the same population. For example, the
manufacturer also measured the amount of radiation emitted through the open doors of the
42 microwave ovens. This data set is listed in Problem 1.1. In this text we will denote the
ith item in the data set by the symbol xi , where X(si ) = xi , and the data set of which it
is a member by X = {x1 , . . . , xn }. Note that we use the same symbol X to denote both
the data set and the function X(s) that produces it. The sequential listing corresponds to
reading from left to right across the successive rows of Table 1.3. In detail:
x1 = 0.15, x2 = 0.09, . . . x8 = 0.05, . . . , x42 = 0.05.
We summarize these remarks in the following formal definition.
Definition 1.2 1. The elements of a data set are the values of a real valued function
X(si ) = xi defined on a sample A = {s1 , . . . , sn }.
2. The number of elements in the sample is called the sample size.
Table 1.3 Raw data for the radiation emitted by 42 microwave ovens

0.15
0.05
0.10

0.05
0.08
0.20

0.09
0.08
0.10
0.03
0.18
0.20

0.18
0.10
0.02
0.05
0.10
0.30

0.10
0.07
0.10
0.15
0.20
0.30

0.05
0.02
0.01
0.10
0.11

0.40

0.12
0.01
0.40
0.15
0.30
0.30

0.08
0.10
0.10
0.09
0.02
0.05

(Source: R. A. Johnson and D. W. Wichern (1992), Applied Multivariate Statistical Analysis, 3rd ed., Prentice Hall, p. 156. Used with permission.)
In many situations of practical interest we would like to compare two populations with respect to some numerical characteristic. For example, we might be interested in determining
which of two gasoline blends yields more miles per gallon (mpg). Problems of this sort lead
to data sets with a more complex structure than those considered in the previous section.
We now look at an example where the data have been obtained from two populations.
Example 1.5 Data analysis and Lord Rayleigh’s discovery of argon


Data Analysis

7

Carefully comparing two samples can sometimes lead to a Nobel Prize. The discovery of
the new element argon by Lord Rayleigh (1842-1919), Nobel Laureate (1904), provides

an interesting example. He prepared volumes of pure nitrogen by two different chemical
methods: (1) from an air sample, from which he removed all oxygen; (2) by chemical
decomposition of nitrogen from nitrous oxide, nitric oxide, or ammonium nitrite. Table
1.4 lists the masses of nitrogen gas obtained by the two methods. Example 1.23 examines
Rayleigh’s data using the techniques of exploratory data analysis, which were not yet known
in his time.
Table 1.4 Mass of nitrogen gas (g) isolated by Lord Rayleigh
From air (g)
2.31017
2.30986
2.31010
2.31001
2.31024
2.31010
2.31028
.

From chemical
decomposition (g)
2.30143
2.29890
2.29816
2.30182
2.29869
2.29940
2.29849
2.29889

(Source: R.D. Larsen, Lessons Learned from Lord Rayleigh on the Importance of Data
Analysis, J. Chem. Educ., vol. 67, no. 11, pp. 926-928, 1990. Used with permission.)

Example 1.6 Comparing lifetimes of two light bulb filaments
A light bulb manufacturer tested 10 light bulbs that contained filaments of type A and
10 light bulbs that contained filaments of type B. The values for the observed lifetimes
of the 20 light bulbs are recorded in Table 1.5. The purpose of the experiment was to
determine which filament had longer lifetimes. In this example we have two samples taken
from two populations, each one consisting of n = 10 light bulbs. The lifetimes of light bulbs
containing type A and type B filaments are denoted by the variables X, Y, where X =
lifetime of type A light bulb and Y = lifetime of type B light bulb.
Table 1.5 Lifetimes (in hours) of light bulbs with type A and type B filaments
Type A filaments: 1293, 1380, 1614, 1497, 1340, 1643, 1466, 1627, 1383, 1711
Type B filaments: 1061, 1065, 1092, 1017, 1021, 1138, 1143, 1094, 1270, 1028

1.3.2

Multivariate Data

In the previous section we defined the concept of a single variable X defined on a population S; multivariate data sets arise when one studies the values of one variable on two or
more different populations, or two or more variables defined on the same population. Examples 1.7 and 1.8 below are typical of the kinds of multivariate data sets one encounters
in public health studies.
Example 1.7 Smoking during pregnancy and its effect on breast-milk volume


8

Introduction to Probability and Statistics for Science, Engineering, and Finance

The influence of cigarette smoking on daily breast-milk volume (g/day) and infant weight
gain (grams) over 14 days was measured for 10 smoking and 10 non-smoking mothers.
Smoking and non-smoking mothers are examples of two populations, and breast-milk volume
is the variable defined on them. The first column (Group) represents the values of the Group

variable, which identifies the mother as smoking (S) or non smoking (NS). It is example of
a categorical variable; that is, its values are non-numerical. The two variables of interest
for the infant population are birth weight and weight gain, respectively.
Table 1.6 Biomedical characteristics of smoking/non-smoking mothers and their infants

Smoking (S)/Non Smoking (NS) Mother
Group
Mother’s
Mother’s
Milk
height (cm) weight (kg) volume(g/d)
S
154
55.3
621
S
155
53
793
S
159
60.5
593
S
157
65
545
S
160
53

753
S
155
56.5
655
S
153
52
895
S
152
53.1
767
S
157
56.5
714
S
160
70
598
NS
150
51
947
NS
146
54.2
945
NS

153
57.1
1086
NS
157
64
1202
NS
150
52
973
NS
155
55
981
NS
149
52.5
930
NS
149
49.5
745
NS
153
53
903
NS
148
53

899

Infant
Birth
Weight
weight (kg) gain (kg)
3.03
0.27
3.18
0.15
3.4
0.35
3.25
0.15
2.9
0.44
3.44
0.4
3.22
0.56
3.52
0.62
3.49
0.33
2.95
0.11
3.9
0.5
3.5
0.35

3.5
0.51
3
0.56
3.2
0.4
2.9
0.78
3.3
0.57
3.3
0.54
3.3
0.7
3.06
0.58

(Source: F. Vio, G. Salazar, and C. Infante, Smoking During Pregnancy and Lactation and
its Effects on Breast Milk Volume, Amer. J. of Clinical Nutrition, 1991, pp. 1011-1016.
Used with permission.)
Example 1.8 DNA damage from an occupational hazard
The data in Table 1.7 come from a study of possible DNA damage caused by occupational
exposure to the chemical 1,3-butadiene. The researchers compared the N-1-THB-ADE
levels (a measure of DNA damage) among the exposed and control workers, labeled E
and C respectively. Note that the data for this example are formatted in a single table
consisting of two columns, labeled Group and N-1-THB-ADE. Each column represents the
values of a variable defined on the worker population. The first column (Group) is another
example of a categorical variable; that is, its values (E=Exposed, C=Control) are nonnumerical. It is clearly a question of no small importance to determine whether or not
cigarette smoking influences breast-milk volume, or exposure to the chemical 1,3-butadiene
causes DNA damage. Statisticians have developed a variety of graphical and analytical

methods to answer these and similar questions, as will be explained later in this chapter,
beginning with Section 1.4.


Data Analysis

9

Table 1.7 DNA damage of exposed (E) and non-exposed (C) workers
Group N-1-THB-ADE
E
0.3
E
0.5
E
1
E
0.8
E
1
E
12.5
E
0.3
E
4.3
E
1.5
E
0.1

E
0.3
E
18
E
25
E
0.3
E
1.3
C
0.1
C
0.1
C
2.3
C
3.5
C
0.1
C
0.1
C
1.8
C
0.5
C
0.1
C
0.2

C
0.1
(Source: C. Zhao, et al., Human DNA adducts of 1,3 butadiene, an important environmental carcinogen, Carcinogenesis, vol. 21, no. 1, pp. 107-111, 2000. Used with
permission.)
Example 1.9 Parental socioeconomic data and their children’s development
The original StatLab population consists of 1296 member families of the Kaiser Foundation Health Plan living in the San Francisco Bay area during the years 1961–1972. These
families participated in a Child Health and Development Studies project conducted under
the supervision of the School of Public Health, University of California, Berkeley. The purpose of the study was “to investigate how certain biological and socioeconomic attributes
of parents affect the development of their offspring.” The 36 families listed in Table 1.16 in
Section 1.11 at the end of this chapter, come from a much larger data set of 648 families to
whom a baby girl had been born between 1 April 1961 and 15 April 1963. Ten years later,
between 5 April 1971 and 4 April 1972, measurements were taken on 32 variables for each
family, only 8 of which are listed here. Key to the variables: Reading from left to right
the eight variables are: girl’s height (in), girl’s weight (lbs), girl’s score on the Peabody
Picture Vocabulary Test (PEA), girl’s score on the Raven Progressive Matrices test (RAV),
mother’s height (in), mother’s weight (lbs), father’s height (in), father’s weight (lbs).

1.3.3

Financial Data: Stock Market Prices and Their Time Series

Variables recorded in chronological order are called time series. A time series plot is a
plot of the variable against time. Financial and economic data, for example, are obtained


10

Introduction to Probability and Statistics for Science, Engineering, and Finance

by recording—at equally spaced intervals of time, such as weeks, months, and years—the

values of a variable, such as the stock price of the General Electric (GE) company, the S&P
500 index, the gross domestic product, the unemployment rate, etc. The S&P500 is a value
weighted average of 500 large stocks; more precisely, the S&P500 consists of 500 stocks
weighted according to their market capitalization (= number of shares × share price). For
additional details concerning the makeup and computation of the S&P500 index, consult
the references in the Section 1.12 at the end of this chapter. It is an example of a portfolio,
which is is a collection of securities, such as stocks, bonds, and cash.
Example 1.10 Weekly closing prices of the S&P500 index and other securities
for the year 1999
Table 1.17 in Section 1.11 gives the closing weekly prices for the S&P 500 index, the General
Electric (GE) company, PIMCO total return bond mutual fund (PTTRX), and the Prudent
Bear fund (BEARX) for the year 1999. These data were downloaded from the YAHOO
website http://finance.yahoo.com/ . GE, PTTRX, and BEARX are the stock symbols that
are used in order to view the price history of a security. Figure 1.1 is the time series plot of
the closing averages of the S&P500 index and closing prices of GE stock for the year 1999
and Figure 1.2 displays the time series plot of the closing prices of the PTTRX and BEARX
mutual funds for the year 1999. Formally, the price of the security is X(ti ), where ti denotes
the ith week and X denotes its price. Its time series is the graph (ti , X(ti )), (i = 1, 2, . . . , n).
In practice we ignore the underlying population on which the variable is defined and denote
the sequence of closing prices by Xi , i = 1, 2, . . . , n. The year 1999 is an example of what
financial analysts call a bull market; it is characterized by increasing investor optimism
concerning future economic performance and its favorable impact on the stock market. The
opposite of a bull market is a bear market; which is characterized by increasing investor
pessimism concerning future economic performance and its potentially unfavorable impact
on the stock market. Clearly, investors in an S&P500 index fund were more pleased than
those who invested in, say, the BEARX mutual fund. The year 2002 is a sobering example
of a bear market (see Example 1.11). Financial data of the sort listed in Table 1.17 are
used by investors to evaluate the returns (see Section 1.3.4 for the definition of return) and
risks of an investment portfolio.
The Prudent Bear Fund’s investment policy is most unconventional: it sells selected stocks

when most investors are buying them and buys them when most investors are avoiding them.
It “seeks capital appreciation,” according to its prospectus, “primarily through short sales
of equity securities when overall market valuations are high and through long positions in
value-oriented equity securities when overall market valuations are low.” It’s an interesting
example of a security that is negatively correlated with the stock market as will be explained
in Section 1.7.


Data Analysis

11

S&P
1450
1400
1350
1300
10

20

30

40

50

week

GE


45
40
35

10

20

30

40

50

week

FIGURE 1.1: Weekly closing prices of the S&P 500 and GE for the year 1999

PTTRX
10.2
10.1
10

20

30

40


50

30

40

50

week

9.9
9.8
9.7
9.6

BEARX
4.6
4.4
4.2
10

20

week

3.8
3.6

FIGURE 1.2: Weekly closing prices of the PTTRX and BEARX mutual funds for the
year 1999



12

Introduction to Probability and Statistics for Science, Engineering, and Finance
S&P
1150
1100
1050
1000
950
900
850
10

20

30

40

50

30

40

50

week


BearX
8
7.5
7
6.5
6
5.5
10

20

week

4.5

FIGURE 1.3: Weekly closing prices of the S&P500 and BEARX mutual funds for the
year 2002

Example 1.11 The year 2002: An example of a bear market
The time series plot of the closing prices for the S&P500 and BEARX mutual fund for
the year 2002 are displayed in Figure 1.3 and Table 1.18 contains the data from which the
time series was computed. For most, but not all, investors, the results were a grim contrast
to the bull market of 1999, as Figure 1.3 clearly demonstrates. Indeed the return (see
Equation 1.1 in Section 1.3.4) on the S&P index was a confidence shaking −25%.5 Investors
in the BEARX mutual fund, which suffered a −22.5% loss in 1999, received more cheerful
news; their fund earned a 65% return. These data reveal not only the risks associated with
investing in financial markets, but also suggest ways of reducing them. For example, instead
of putting all your money into an S&P500 index fund, you could put 10% of your original
capital into a stock or mutual fund that is negatively correlated with the S&P500. This

certainly would have reduced your profits in 1999, but might have reduced your losses in
2002. We discuss this portfolio in some detail in Example 1.12 below. A more penetrating
analysis of financial data requires the concept of return, which is discussed next (Section
1.3.4).
Example 1.12 The effect of diversification on portfolio performance
Looking at Figures 1.1 and 1.2 we see that the S&P500 performed well and BEARX poorly
in 1999, while in 2002 the reverse was true (see Figure 1.3). This suggests that if we had
invested a small portion of our capital in BEARX we would have reduced our gains in 1999
but also have reduced our losses in 2002. To fix our ideas, consider portfolio A consisting
of a $9,000 investment in the S&P500 and a $1,000 investment in the BEARX mutual fund
for the years 1999 and 2002; that is 90% of our capital is allocated to the S&P500 and 10%
to the BEARX mutual fund. With the benefit of hindsight, which investment performed
5 Note: The actual return—including dividends and reinvestments, but excluding commissions—was
−22.1%.


×