Applied Multivariate
Statistical Analysis
FIFTH E DITION
Applied Multivariate
Statistical Analysis
RICHARD A. JOH N SON
University of Wisconsin-Madison
D EAN W. WIC H E RN
Texas A&M University
Prentice
Hall
------
PRENTICE HALL, Upper Saddle River, New Jersey
07458
Library of Congress Cataloging-in-Publication Data
Johnson, Richard Arnold.
Applied multivariate statistical analysis/Richard A. Johnson.--5th ed.
p. cm.
Includes bibliographical references and index.
ISBN 0-13-092553-5
1. Multivariate analysis. I. Wichern, Dean W. II. Title.
QA278 .J 63 2002
519.5'35--dc21
2001036199
Quincy McDonald
Editor-in-Chief: Sally Yagan
Acquisitions Editor:
David W. Riccardi
Kathleen Schiaparelli
Senior Managing Editor: Linda Mihatov Behrens
Assistant Managing Editor: Bayani DeLeon
Production Editor: Steven S. Pawlowski
Manufacturing Buyer: Alan Fischer
Manufacturing Manager: Trudy Pisciotti
Marketing Manager: Angela Battle
Editorial Assistant/Supplements Editor: Joanne Wendelken
Managing Editor, Audio/Video Assets: Grace Hazeldine
Art Director: Jayne Conte
Cover Designer: Bruce Kenselaar
Dlustrator: Marita Froimson
Vice President/Director Production and Manufacturing:
Executive Managing Editor:
•
© 2002, 1998, 1992, 1988, 1982 by Prentice-Hall, Inc.
Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means,
without permission in writing from the publisher.
Printed in the United States of America
10 9 8 7 6 5 4 3 2
ISBN 0-13-092553-5
Pearson Education LTD., London
Pearson Education Australia PTY, Limited, Sydney
Pearson Education Singapore, Pte. Ltd
Pearson Education North Asia Ltd, Hong Kong
Pearson Education Canada, Ltd., Toronto
Pearson Education de Mexico, S.A. de C.V.
Pearson Education-Japan, Tokyo
Pearson Education Malaysia, Pte. Ltd
To the memory of my mother and my father.
R. A. J.
To Dorothy, Michael, and Andrew.
D. W W
Contents
1
PREFACE
ASPECTS OF MULTIVARIATE ANALYSIS
1.1
1 .2
1.3
1 .4
1.5
1 .6
2
XV
1
Introduction 1
Applications of Multivariate Techniques
The Organization of Data 5
3
Arrays, 5
Descriptive Statistics, 6
Graphical Techniques, 11
Data Displays and Pictorial Representations
Linking Multiple Two-Dimensional Scatter Plots, 20
Graphs of Growth Curves, 24
Stars, 25
Chernoff Faces, 28
Distance 30
Final Comments
Exercises 38
References 48
19
38
MATRIX ALGEBRA AND RANDOM VECTORS
2.1
2.2
2.3
2.4
2.5
2.6
2.7
50
Introduction 50
Some Basics of Matrix and Vector Algebra 50
Vectors, 50
Matrices, 55
Positive Definite Matrices 61
A Square-Root Matrix 66
Random Vectors and Matrices 67
Mean Vectors and Covariance Matrices
68
Partitioning the Covariance Matrix, 74
The Mean Vecto r and Covariance Matrix
for Linear Combinations of Random Variables, 76
Partitioning the Sample Mean Vector
and Covariance Matrix, 78
Matrix Inequalities and Maximization 79
vii
viii
Contents
Supplement 2A: Vectors and Matrices: Basic Concepts
Vectors, 84
Matrices, 89
84
Exercises 104
References 1 1 1
3
3.1
3.2
3.3
3.4
3.5
3 .6
4
112
SAMPLE GEOMETRY AND RANDOM SAMPLING
Introduction 112
The Geometry of the Sample 112
Random Samples and the Expected Values of the Sample Mean and
Covariance Matrix 120
Generalized Variance 124
Situations in which the Generalized Sample Variance Is Zero, 130
Generalized Variance Determined by I R I
and Its Geometrical Interpretation, 136
Another Generalization of Variance, 138
Sample Mean, Covariance, and Correlation
As Matrix Operations 139
Sample Values of Linear Combinations of Variables
Exercises 145
References 148
141
THE MULTIVARIATE NORMAL DISTRIBUTION
4.1
4.2
4.3
4.4
4.5
4.6
Introduction 149
The Multivariate Normal Density and Its Properties
149
149
Additional Properties of the Multivariate
Normal Distribution, 156
Sampling from a Multivariate Normal Distribution
and Maximum Likelihood Estimation 168
The Multivariate Normal Likelihood, 168
Maximum Likelihood Estimation of JL and I, 170
Sufficient Statistics, 173
The Sampling Distribution of X and S
173
Properties of the Wishart D istribution, 174
Large-Sample Behavior of X and S 175
Assessing the Assumption of Normality 177
Evaluating the Normality of the Univariate Marginal D istributions, 178
Evaluating Bivariate Normality, 183
4.7
Detecting Outliers and Cleaning Data
4.8
Transformations To Near Normality
Steps for Detecting Outliers, 190
189
194
Transforming Multivariate Observations, 198
Exercises 202
References 209
Contents
5
INFERENCES ABOUT A MEAN VECTOR
5.1
5.2
5.3
5 .4
5.5
5.6
5.7
5.8
6
21 0
Introduction 210
The Plausibility of Ito as a Value for a Normal
Population Mean 210
Hotelling's T2 and Likelihood Ratio Tests 216
General Likelihood Ratio Method, 219
Confidence Regions and Simultaneous Comparisons
of Component Means 220
Simultaneous Confidence Statements, 223
A Comparison of Simultaneous Confidence Intervals
with One-at-a- Time Intervals, 229
The Bonferroni Method of Multiple Comparisons, 232
Large Sample Inferences about a Population Mean Vector 234
Multivariate Quality Control Charts 239
Charts for Monitoring a Sample of Individual Multivariate Observations
for Stability, 241
Contro l Regions for Future Individual Observations, 247
Control Ellipse for Future Observations, 248
T 2 -Chart for Future Observations, 248
Control Charts Based on Subsample Means, 249
Control Regions for Future Subsample Observations, 251
Inferences about Mean Vectors
when Some Observations Are Missing 252
Difficulties Due to Time Dependence
in Multivariate Observations 256
Supplement SA: Simultaneous Confidence Intervals and Ellipses
as Shadows of the p-Dimensional Ellipsoids 258
Exercises 260
References 270
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6.1
6.2
6.3
6.4
ix
272
Introduction 272
Paired Comparisons and a Repeated Measures Design 272
Paired Comparisons, 272
A Repeated Measures Design for Comparing Treatments, 278
Comparing Mean Vectors from Two Populations
Assumptions Concerning the Structure of the Data, 283
Further Assumptions when n1 and n2 Are Small, 284
Simultaneous Confidence Intervals, 287
The Two-Sample Situation when �1 i= �2, 290
283
Comparing Several Multivariate Population Means
( One-Way Manova) 293
Assumptions about the Structure of the Data for One-way MAN OVA, 293
A Summary of Univariate AN OVA, 293
Multivariate Analysis of Variance (MAN OVA), 298
x
Contents
6.5
6.6
6.7
6.8
6.9
7
Simultaneous Confidence Intervals for Treatment Effects
Two-Way Multivariate Analysis of Variance 307
305
Univariate Two-Way Fixed-Effects Model with Interaction, 307
Multivariate Two-Way Fixed-Effects Model with Interaction, 309
Profile Analysis 318
Repeated Measures Designs and Growth Curves
Perspectives and a Strategy for Analyzing
Multivariate Models 327
Exercises 332
References 352
323
MULTIVARIATE LINEAR REGRESSION MODELS
7.1
7.2
7.3
Introduction 354
The Classical Linear Regression Model
Least Squares Estimation 358
7.4
Inferences About the Regression Model
7.5
7.6
7.7
7.8
7.9
7.10
354
354
Sum-of-Squares Decomposition, 360
Geometry of Least Squares, 361
Sampling Properties of Classical Least Squares Estimators, 363
365
Inferences Concerning the Regression Parameters, 365
Likelihood Ratio Tests for the Regression Parameters, 370
Inferences from the Estimated Regression Function
Estimating the Regression Function at z0, 374
Forecasting a New Observation at z0, 375
Model Checking and Other Aspects of Regression
Does the Model Fit?, 377
Leverage and Influence, 380
Additional Problems in Linear Regression, 380
Multivariate Multiple Regression
374
377
383
Likelihood Ratio Tests for Regression Parameters, 392
Other Multivariate Test Statistics, 395
Predictions from Multivariate Multiple Regressions, 395
The Concept of Linear Regression
Prediction of Several Variables, 403
Partial Correlation Coefficient, 406
398
Comparing the Two Formulations of the Regression Model
Mean Corrected Form of the Regression Model, 407
Relating the Formulations, 409
407
Multiple Regression Models with Time Dependent Errors 410
Supplement 7 A: The Distribution of the Likelihood Ratio
for the Multivariate Multiple Regression Model
Exercises 417
References 424
415
Conte nts
8
PRINCIPAL COMPONENTS
426
8.2
Introduction 426
Population Principal Components
8.3
Summarizing Sample Variation by Principal Components
8.1
8.4
8.5
8.6
xi
426
Principal Components Obtained from Standardized Variables, 432
Principal Components for Covariance Matrices
with Special Structures, 435
The Number of Principal Components, 440
Interpretation of the Sample Principal Components, 444
Standardizing the Sample Principal Components, 445
Graphing the Principal Components
Large Sample Inferences 452
437
450
Large Sample Properties of Ai and ej, 452
Testing for the Equal Correlation Structure, 453
A
Monitoring Quality with Principal Components
Checking a Given Set of Measurements for Stability, 455
Controlling Future Values, 459
455
Supplement 8A: The Geometry of the Sample Principal
Component Approximation 462
The p-D imensional Geometrical Interpretation, 464
The n-D imensional Geometrical Interpretation, 465
Exercises 466
References 475
9
FACTOR ANALYSIS AND INFERENCE
FOR STRUCTURED COVARIANCE MATRICES
9.1
9.2
9.3
9.4
Introduction 477
The Orthogonal Factor Model
Methods of Estimation 484
477
478
The Principal Component (and Principal Factor) Method, 484
A Modified Approach-the Principal Factor So lution, 490
The Maximum Likelihood Method, 492
A Large Sample Test for the Number of Common Factors, 498
Factor Rotation
501
Oblique Rotations, 509
9.5
Factor Scores
9.6
9.7
Perspectives and a Strategy for Factor Analysis
Structural Equation Models 524
510
The Weighted Least Squares Method, 511
The Regression Method, 513
The LISREL Model, 525
Construction of a Path Diagram, 525
Covariance Structure, 526
Estimation, 527
Model-Fitting Strategy, 529
517
xii
Contents
Supplement 9A: Some Computational Details
for Maximum Likelihood Estimation
Recommended Computational Scheme, 531
Maximum Likelihood Estimators of p LzL'z + \flz, 532
530
=
Exercises 533
References 541
10
CANONICAL CORRELATION ANALYSIS
10.1
10.2
10.3
10.4
10.5
10.6
11
543
Introduction 543
Canonical Variates and Canonical Correlations
Interpreting the Population Canonical Variables
543
551
Identifying the Canonical Variables, 551
Canonical Correlations as Generalizations
of Other Correlation Coefficients, 553
The First r Canonical Variables as a Summary of Variability, 554
A Geometrical Interpretation of the Population Canonical
Correlation Analysis 555
The Sample Canonical Variates and Sample
Canonical Correlations 556
Additional Sample Descriptive Measures 564
Matrices of Errors ofApproximations, 564
Proportions of Explained Sample Variance, 567
Large Sample Inferences
Exercises 573
References 580
569
DISCRIMINATION AND CLASSIFICATION
11.1
1 1 .2
1 1 .3
11.4
1 1 .5
11.6
11.7
11 . 8
581
Introduction 581
Separation and Classification for Two Populations 582
Classification with Two Multivariate Normal Populations
Classification of Normal Populations When I1
Scaling, 595
Classification of Normal Populations When I1
=
#:-
I2
=
I, 590
590
I2, 596
Evaluating Classification Functions 598
Fisher's Discriminant Function-Separation of Populations
Classification with Several Populations 612
The Minimum Expected Cost of Misclassification Method, 613
Classification with Normal Populations, 616
Fisher's Method for Discriminating
among Several Populations 628
Using Fisher's Discriminants to Classify Objects, 635
Final Comments
641
Including Qualitative Variables, 641
Classification Trees, 641
Neural Networks, 644
609
Contents
xiii
Selection of Variables, 645
Testing for Group Differences, 645
Graphics, 646
Practical Considerations Regarding Multivariate Normality, 646
Exercises 647
References 666
12
CLUSTERING, DISTANCE METHODS, AND ORDINATION
12.1
12 . 2
12.3
12.4
12 . 5
12.6
12.7
12 . 8
Introduction 668
Similarity Measures
668
670
Distances and Similarity Coefficients for Pairs of Items, 670
Similarities and Association Measures
for Pairs of Variables, 676
Concluding Comments on Similarity, 677
Hierarchical Clustering Methods
679
Single Linkage, 681
Complete Linkage, 685
Average Linkage, 689
Ward's Hierarchical Clustering Method, 690
Final Comments-Hierarchical Procedures, 693
Nonhierarchical Clustering Methods
694
K-means Method, 694
Final Comments-Nonhierarchical Procedures, 698
Multidimensional Scaling
700
Correspondence Analysis
709
The Basic Algorithm, 700
Algebraic Development of Correspondence Analysis, 711
Inertia, 718
Interpretation in Two Dimensions, 719
Final Comments, 719
Biplots for Viewing San1pling Units and Variables
Constructing Biplots, 720
Procrustes Analysis: A Method
for Comparing Configurations
723
Supplement 12A: Data Mining
731
719
Constructing the Procrustes Measure ofAgreement, 724
Introduction, 731
The Data Mining Process, 732
Model Assessment, 733
Exercises 738
References 7 45
APPENDIX
DATA INDEX
SUBJECT INDEX
748
758
761
Preface
I NTE N D E D AU D I E NCE
This book originally grew out of our lecture notes for an "Applied Multivariate Analy
sis" course offered j ointly by the Statistics Department and the School of Business at
the University of Wisconsin-Madison. Applied Multivariate Statistical Analysis, Fifth
Edition, is concerned with statistical methods for describing and analyzing multi
variate data. Data analysis, while interesting with one variable, becomes truly fasci
nating and challenging when several variables are involved. Researchers in the
biological, physical, and social sciences frequently collect measurements on several
variables. Modern computer packages readily provide the numerical results to rather
complex statistical analyses. We have tried to provide readers with the supporting
knowledge necessary for making proper interpretations, selecting appropriate tech
niques, and understanding their strengths and weaknesses. We hope our discussions
will meet the needs of experimental scientists, in a wide variety of subject matter
areas, as a readable introduction to the statistical analysis of multivariate observations.
LEVEL
Our aim is to present the concepts and methods of multivariate analysis at a level
that is readily understandable by readers who have taken two or more statistics cours
es. We emphasize the applications of multivariate methods and, consequently, have
attempted to make the mathematics as palatable as possible. We avoid the use of cal
culus. On the other hand, the concepts of a matrix and of matrix manipulations are
important. We do not assume the reader is familiar with matrix algebra. Rather, we
introduce matrices as they appear naturally in our discussions, and we then show how
they simplify the presentation of multivariate models and techniques.
The introductory account of matrix algebra, in Chapter 2, highlights the more
important matrix algebra results as they apply to multivariate analysis. The Chapter
2 supplement provides a summary of matrix algebra results for those with little or no
previous exposure to the subject. This supplementary material helps make the book
self-contained and is used to complete proofs. The proofs may be ignored on the first
reading. In this way we hope to make the book accessible to a wide audience.
In our attempt to make the study of multivariate analysis appealing to a large
audience of both practitioners and theoreticians, we have had to sacrifice a consistency
XV
xvi
Preface
of level. Some sections are harder than others. In particular, we have summarized a
voluminous amount of material on regression in Chapter 7. The resulting presenta
tion is rather succinct and difficult the first time through. We hope instructors will be
able to compensate for the unevenness in level by judiciously choosing those sec
tions, and subsections, appropriate for their students and by toning them down if
necessary.
ORGAN IZATI ON AN D APPROACH
The methodological "tools" of multivariate analysis are contained in Chapters 5
through 12. These chapters represent the heart of the book, but they cannot be as
similated without much of the material in the introductory Chapters 1 through 4.
Even those readers with a good knowledge of matrix algebra or those willing to ac
cept the mathematical results on faith should, at the very least, peruse Chapter 3,
"Sample Geometry," and Chapter 4, "Multivariate Normal Distribution."
Our approach in the methodological chapters is to keep the discussion direct and
uncluttered. Typically, we start with a formulation of the population models, delineate
the corresponding sample results, and liberally illustrate everything with examples. The
examples are of two types: those that are simple and whose calculations can be eas
ily done by hand, and those that rely on real-world data and computer software. These
will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses
dictated by exercises, or (3) analyze the data using methods other than the ones we
have used or suggested.
The division of the methodological chapters (5 through 12) into three units al
lows instructors some flexibility in tailoring a course to their needs. Possible sequences
for a one-semester (two quarter) course are indicated schematically.
Each instructor will undoubtedly omit certain sections from some chapters to
cover a broader collection of topics than is indicated by these two choices.
Getting Started
�
Chapters 1-4
�
Inference About Means
Classification and Grouping
Chapters 5-7
Chapters 11 and 12
Analysis of Covariance
Analysis of Covariance
Structure
Structure
Chapters 8-10
Chapters 8-10
I
I
For most students, we would suggest a quick pass through the first four chap
ters (concentrating primarily on the material in Chapter 1 ; Sections 2.1, 2 . 2, 2.3, 2 .5,
2.6, and 3.6; and the "assessing normality" material in Chapter 4) followed by a se
lection of methodological topics. For example, one might discuss the comparison of
mean vectors, principal components, factor analysis, discriminant analysis and clus
tering. The discussions could feature the many "worked out" examples included in
Preface
xvii
these sections of the text. Instructors may rely on diagrams and verbal descriptions
to teach the corresponding theoretical developments. If the students have uniform
ly strong mathematical backgrounds, much of the book can successfully be covered
in one term.
We have found individual data-analysis proj ects useful for integrating materi
al from several of the methods chapters. Here, our rather complete treatments of
multivariate analysis of variance (MANOVA), regression analysis, factor analysis,
canonical correlation, discriminant analysis, and so forth are helpful, even though
they may not be specifically covered in lectures.
CHAN G E S TO TH E FI FTH EDITI ON
New material . Users of the previous editions will notice that we have added
several exercises and data sets, some new graphics, and have expanded the discus
sion of the dimensionality of multivariate data, growth curves and classification and
regression trees (CART). In addition, the algebraic development of correspondence
analysis has been redone and a new section on data mining has been added to Chap
ter 12. We put the data mining material in Chapter 12 since much of data mining, as
it is now applied in business, has a classification and/or grouping obj ective. As always,
we have tried to improve the exposition in several places.
Data CD. Recognizing the importance of modern statistical packages in the
analysis of multivariate data, we have added numerous real-data sets. The full data sets
used in the book are saved as ASCII files on the CD-ROM that is packaged with
each copy of the book. This format will allow easy interface with existing statistical
software packages and provide more convenient hands-on data analysis opportunities.
Instructors Sol utions Manual. An Instructors Solutions Manual (ISBN 0-13092555-1) containing complete solutions to most of the exercises in the book is avail
able free upon adoption from Prentice Hall.
For information on additional for sale supplements that may be used with the
book or additional titles of interest, please visit the Prentice Hall Web site at
www.prenhall.com.
ACKNOWLE D G M E NTS
We thank our many colleagues who helped improve the applied aspect of the book
by contributing their own data sets for examples and exercises. A number of indi
viduals helped guide this revision, and we are grateful for their suggestions: Steve
Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George
Mason University; Shyamal Peddada, University of Virginia; K. Sivakumar, Univer
sity of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasserman, Uni
versity of Illinois at Urbana-Champaign. We also acknowledge the feedback of the
students we have taught these past 30 years in our applied multivariate analysis cours
es. Their comments and suggestions are largely responsible for the present iteration
xviii
Preface
of this work. We would also like to give special thanks to Wai Kwong Cheang for his
help with the calculations for many of the examples.
We must thank Dianne Hall for her valuable work on the CD-ROM and Solu
tions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack
for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his
assistance with the multidimensional scaling examples discussed in Chapter 12.
Jacquelyn Forer did most of the typing of the original draft manuscript, and we ap
preciate her expertise and willingness to endure the caj oling of authors faced with pub
lication deadlines. Finally, we would like to thank Quincy McDonald, Joanne
Wendelken, Steven Scott Pawlowski, Pat Daly, Linda Behrens, Alan Fischer, and the
rest of the Prentice Hall staff for their help with this project.
R. A. Johnson
rich@stat. wisc. edu
D. W. Wichern
Applied Multivariate
Statistical Analysis
CHAPT E R
1
Aspects of Multivariate Analysis
1.1
I NTRODUCTI O N
Scientific inquiry i s an iterative learning process. Objectives pertaining t o the ex
planation of a social or physical phenomenon must be specified and then tested by
gathering and analyzing data. In turn, an analysis of the data gathered by experi
mentation or observation will usually suggest a modified explanation of the phe
nomenon. Throughout this iterative learning process, variables are often added or
deleted from the study. Thus, the complexities of most phenomena require an in
vestigator to collect observations on many different variables. This book is concerned
with statistical methods designed to elicit information from these kinds of data sets.
Because the data include simultaneous measurements on many variables, this body
of methodology is called multivariate analysis.
The need to understand the relationships between many variables makes mul
tivariate analysis an inherently difficult subj ect. Often, the human mind is over
whelmed by the sheer bulk of the data. Additionally, more mathematics is required
to derive multivariate statistical techniques for making inferences than in a univari
ate setting. We have chosen to provide explanations based upon algebraic concepts
and to avoid the derivations of statistical results that require the calculus of many
variables. Our objective is to introduce several useful multivariate techniques in a
clear manner, making heavy use of illustrative examples and a minimum of mathe
matics. Nonetheless, some mathematical sophistication and a desire to think quan
titatively will be required.
Most of our emphasis will be on the analysis of measurements obtained with
out actively controlling or manipulating any of the variables on which the mea
surements are made. Only in Chapters 6 and 7 shall we treat a few experimental
plans (designs) for generating data that prescribe the active manipulation of im
portant variables. Although the experimental design is ordinarily the most impor
tant part of a scientific investigation, it is frequently impossible to control the
generation of appropriate data in certain disciplines. (This is true, for example, in
business, economics, ecology, geology, and sociology.) You should consult [7] and
1
2
Chapter
1
Aspects of M u ltiva riate Ana lysis
[8] for detailed accounts of design principles that, fortunately, also apply to multi
variate situations.
It will become increasingly clear that many multivariate methods are based
upon an underlying probability model known as the multivariate normal distribu
tion. Other methods are ad hoc in nature and are justified by logical or commonsense
arguments. Regardless of their origin, multivariate techniques must, invariably, be im
plemented on a computer. Recent advances in computer technology have been ac
companied by the development of rather sophisticated statistical software packages,
making the implementation step easier.
Multivariate analysis is a "mixed bag." It is difficult to establish a classification
scheme for multivariate techniques that both is widely accepted and indicates the
appropriateness of the techniques. One classification distinguishes techniques de
signed to study interdependent relationships from those designed to study depen
dent relationships. Another classifies techniques according to the number of
populations and the number of sets of variables being studied. Chapters in this text
are divided into sections according to inference about treatment means, inference
about covariance structure, and techniques for sorting or grouping. This should not,
however, be considered an attempt to place each method into a slot. Rather, the
choice of methods and the types of analyses employed are largely determined by
the objectives of the investigation. In Section 1.2, we list a smaller number of practical
problems designed to illustrate the connection between the choice of a statistical
method and the obj ectives of the study. These problems, plus the examples in the
text, should provide you with an appreciation for the applicability of multivariate
techniques across different fields.
The objectives of scientific investigations to which multivariate methods most
naturally lend themselves include the following:
1. Data reduction or structural simplification . The phenomenon being studied is
represented as simply as possible without sacrificing valuable information. It
is hoped that this will make interpretation easier.
2. Sorting and grouping. Groups of " similar" obj ects or variables are created,
based upon measured characteristics. Alternatively, rules for classifying obj ects
into well-defined groups may be required.
3. Investigation of the dependence among variables. The nature of the relation
ships among variables is of interest. Are all the variables mutually indepen
dent or are one or more variables dependent on the others? If so, how?
4.
Prediction. Relationships between variables must be determined for the pur
pose of predicting the values of one or more variables on the basis of observa
tions on the other variables.
5. Hypothesis construction and testing. Specific statistical hypotheses, formulated
in terms of the parameters of multivariate populations, are tested. This may be
done to validate assumptions or to reinforce prior convictions.
We conclude this brief overview of multivariate analysis with a quotation from
F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster
analysis, but we feel it is appropriate for a broader range of methods. You should
Section 1 .2
Appl i cations of M u ltivariate Tec h n i ques
3
keep it in mind whenever you attempt or read about a data analysis. It allows one to
maintain a proper perspective and not be overwhelmed by the elegance of some of
the theory:
If the results disagree with informed opinion, do not admit a simple logical interpreta
tion, and do not show up clearly in a graphical presentation, they are probably wrong.
There is no magic about numerical methods, and many ways in which they can break
down. They are a valuable aid to the interpretation of data, not sausage machines au
tomatically transforming bodies of numbers into packets of scientific fact.
1 .2
APPLI CATI O N S O F M U LTIVARIATE TECH N I Q U E S
The published applications of multivariate methods have increased tremendously in
recent years. It is now difficult to cover the variety of real-world applications of these
methods with brief discussions, as we did in earlier editions of this book. However,
in order to give some indication of the usefulness of multivariate techniques, we offer
the following short descriptions of the results of studies from several disciplines.
These descriptions are organized according to the categories of objectives given in the
previous section. Of course, many of our examples are multifaceted and could be
placed in more than one category.
Data reduction or simplification
•
•
•
•
•
Using data on several variables related to cancer patient responses to radio
therapy, a simple measure of patient response to radiotherapy was constructed.
(See Exercise 1.15.)
Track records from many nations were used to develop an index of performance
for both male and female athletes. (See [10] and [22].)
Multispectral image data collected by a high-altitude scanner were reduced to
a form that could be viewed as images (pictures) of a shoreline in two dimen
sions. (See [23].)
Data on several variables relating to yield and protein content were used to
create an index to select parents of subsequent generations of improved bean
plants. (See [ 1 4] . )
A matrix of tactic similarities was developed from aggregate data derived from
professional mediators. From this matrix the number of dimensions by which
professional mediators judge the tactics they use in resolving disputes was de
termined. (See [21].)
Sorting and grouping
•
•
Data on several variables related to computer use were employed to create
clusters of categories of computer jobs that allow a better determination of ex
isting (or planned) computer utilization. (See [2].)
Measurements of several physiological variables were used to develop a screen
ing procedure that discriminates alcoholics from nonalcoholics. (See [26].)
4
Chapter
1
Aspects of M u ltivariate Ana lysis
•
•
Data related to responses to visual stimuli were used to develop a rule for sep
arating people suffering from a multiple-sclerosis-caused visual pathology from
those not suffering from the disease. (See Exercise 1.14.)
The U. S. Internal Revenue Service uses data collected from tax returns to sort
taxpayers into two groups: those that will be audited and those that will not.
(See [31].)
Investigation of the dependence among variables
Data on several variables were used to identify factors that were responsible for
client success in hiring external consultants. (See [13] .)
Measurements of variables related to innovation, on the one hand, and vari
ables related to the business environment and business organization, on the
other hand, were used to discover why some firms are product innovators and
some firms are not. (See [5] .)
' Data on variables representing the outcomes of the 10 decathlon events in the
Olympics were used to determine the physical factors responsible for success in
the decathlon. (See [17] .)
The associations between measures of risk-taking propensity and measures of
socioeconomic characteristics for top-level business executives were used to as
sess the relation between risk-taking behavior and performance. (See [18].)
•
•
•
Prediction
•
•
•
•
The associations between test scores and several high school performance vari
ables and several college performance variables were used to develop predic
tors of success in college. (See [11].)
Data on several variables related to the size distribution of sediments were
used to develop rules for predicting different depositional environments. (See
[9] and [20] .)
Measurements on several accounting and financial variables were used to de
velop a method for identifying potentially insolvent property-liability insurers.
(See [28] .)
Data on several variables for chickweed plants were used to develop a method
for predicting the species of a new plant. (See [4] .)
Hypotheses testing
•
•
•
Several pollution-related variables were measured to determine whether levels
for a large metropolitan area were roughly constant throughout the week, or
whether there was a noticeable difference between weekdays and weekends.
(See Exercise 1 .6.)
Experimental data on several variables were used to see whether the nature of
the instructions makes any difference in perceived risks, as quantified by test
scores. (See [27] . )
Data on many variables were used to investigate the differences in structure of
American occupations to determine the support for one of two competing so
ciological theories. (See [16] and [25] .)