Applied Statistics Using SPSS, STATISTICA,
MATLAB and R
With 195 Figures and a CD
123
Joaquim P. Marques de Sá
Applied Statistics
Using SPSS, STATISTICA, MATLAB and R
Printed on acid-free paper 5 4 3 2 1 0SPIN: 11908944 42/
E d itors
3100/Integra
Typesettin
Production: Integra Software Services Pvt. Ltd., India
Cover design: WMX design, Heidelberg
g: by the editors
Library of Congress Control Number: 2007926024
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant pro-
tective laws and regulations and therefore free for general use.
ISBN 978-3-540-71971-7 Springer Berlin Heidelberg New York
Prof. Dr. Joaquim P. Marques de Sá
Universidade do Porto
Fac. Engenharia
4200-465 Porto
Portugal
e-mail:
Rua Dr. Roberto Frias s/n
To
Wiesje
and Carlos.
Contents
Preface to the Second Edition xv
Preface to the First Edition xvii
Symbols and Abbreviations xix
1 Introduction 1
1.1 Deterministic Data and Random Data 1
1.2 Population, Sample and Statistics 5
1.3 Random Variables 8
1.4 Probabilities and Distributions 10
1.4.1 Discrete Variables 10
1.4.2 Continuous Variables 12
1.5 Beyond a Reasonable Doubt 13
1.6 Statistical Significance and Other Significances 17
1.7 Datasets 19
1.8 Software Tools 19
1.8.1 SPSS and STATISTICA 20
1.8.2 MATLAB and R 22
2 Presenting and Summarising the Data 29
2.1 Preliminaries 29
2.1.1 Reading in the Data 29
2.1.2 Operating with the Data 34
2.2 Presenting the Data 39
2.2.1 Counts and Bar Graphs 40
2.2.2 Frequencies and Histograms 47
2.2.3 Multivariate Tables, Scatter Plots and 3D Plots 52
2.2.4 Categorised Plots 56
2.3 Summarising the Data 58
2.3.1 Measures of Location 58
2.3.2 Measures of Spread 62
2.3.3 Measures of Shape 64
2.3.4
Measures of Association for Continuous Variables 66
2.3.5 Measures of Association for Ordinal Variables 69
2.3.6 Measures of Association for Nominal Variables 73
Exercises 77
3 Estimating Data Parameters 81
3.1 Point Estimation and Interval Estimation 81
3.2 Estimating a Mean 85
3.3 Estimating a Proportion 92
3.4 Estimating a Variance 95
3.5 Estimating a Variance Ratio 97
3.6 Bootstrap Estimation 99
Exercises 107
4 Parametric Tests of Hypotheses 111
4.1 Hypothesis Test Procedure 111
4.2 Test Errors and Test Power 115
4.3 Inference on One Population 121
4.3.1 Testing a Mean 121
4.3.2 Testing a Variance 125
4.4 Inference on Two Populations 126
4.4.1 Testing a Correlation 126
4.4.2 Comparing Two Variances 129
4.4.3 Comparing Two Means 132
4.5 Inference on More than Two Populations 141
4.5.1 Introduction to the Analysis of Variance 141
4.5.2 One-Way ANOVA 143
4.5.3 Two-Way ANOVA 156
Exercises 166
5 Non-Parametric Tests of Hypotheses 171
5.1 Inference on One Population 172
5.1.1 The Runs Test 172
5.1.2 The Binomial Test 174
5.1.3 The Chi-Square Goodness of Fit Test 179
5.1.4 The Kolmogorov-Smirnov Goodness of Fit Test 183
5.1.5 The Lilliefors Test for Normality 187
5.1.6 The Shapiro-Wilk Test for Normality 187
5.2 Contingency Tables 189
5.2.1 The 2×2 Contingency Table 189
5.2.2 The rxc Contingency Table 193
viii Contents
Contents ix
5.2.3 The Chi-Square Test of Independence 195
5.2.4 Measures of Association Revisited 197
5.3 Inference on Two Populations 200
5.3.1 Tests for Two Independent Samples 201
5.3.2 Tests for Two Paired Samples 205
5.4 Inference on More Than Two Populations 212
5.4.1 The Kruskal-Wallis Test for Independent Samples 212
5.4.2 The Friedmann Test for Paired Samples 215
5.4.3 The Cochran Q test 217
Exercises 218
6 Statistical Classification 223
6.1 Decision Regions and Functions 223
6.2 Linear Discriminants 225
6.2.1 Minimum Euclidian Distance Discriminant 225
6.2.2 Minimum Mahalanobis Distance Discriminant 228
6.3 Bayesian Classification 234
6.3.1 Bayes Rule for Minimum Risk 234
6.3.2 Normal Bayesian Classification 240
6.3.3 Dimensionality Ratio and Error Estimation 243
6.4 The ROC Curve 246
6.5 Feature Selection 253
6.6 Classifier Evaluation 256
6.7 Tree Classifiers 259
Exercises 268
7 Data Regression 271
7.1 Simple Linear Regression 272
7.1.1 Simple Linear Regression Model 272
7.1.2 Estimating the Regression Function 273
7.1.3 Inferences in Regression Analysis 279
7.1.4 ANOVA Tests 285
7.2 Multiple Regression 289
7.2.1 General Linear Regression Model 289
7.2.2 General Linear Regression in Matrix Terms 289
7.2.3 Multiple Correlation 292
7.2.4 Inferences on Regression Parameters 294
7.2.5 ANOVA and Extra Sums of Squares 296
7.2.6 Polynomial Regression and Other Models 300
7.3 Building and Evaluating the Regression Model 303
7.3.1 Building the Model 303
7.3.2 Evaluating the Model 306
7.3.3 Case Study 308
7.4 Regression Through the Origin 314
x Contents
7.5
Ridge Regression 316
7.6 Logit and Probit Models 322
Exercises 327
8 Data Structure Analysis 329
8.1 Principal Components 329
8.2 Dimensional Reduction 337
8.3 Principal Components of Correlation Matrices 339
8.4 Factor Analysis 347
Exercises 350
9 Survival Analysis 353
9.1 Survivor Function and Hazard Function 353
9.2 Non-Parametric Analysis of Survival Data 354
9.2.1 The Life Table Analysis 354
9.2.2 The Kaplan-Meier Analysis 359
9.2.3 Statistics for Non-Parametric Analysis 362
9.3 Comparing Two Groups of Survival Data 364
9.4 Models for Survival Data 367
9.4.1 The Exponential Model 367
9.4.2 The Weibull Model 369
9.4.3 The Cox Regression Model 371
Exercises 373
10 Directional Data 375
10.1 Representing Directional Data 375
10.2 Descriptive Statistics 380
10.3 The von Mises Distributions 383
10.4 Assessing the Distribution of Directional Data 387
10.4.1 Graphical Assessment of Uniformity 387
10.4.2 The Rayleigh Test of Uniformity 389
10.4.3 The Watson Goodness of Fit Test 392
10.4.4 Assessing the von Misesness of Spherical Distributions 393
10.5 Tests on von Mises Distributions 395
10.5.1 One-Sample Mean Test 395
10.5.2 Mean Test for Two Independent Samples 396
10.6 Non-Parametric Tests 397
10.6.1 The Uniform Scores Test for Circular Data 397
10.6.2 The Watson Test for Spherical Data 398
10.6.3 Testing Two Paired Samples 399
Exercises 400
Contents xi
Appendix A - Short Survey on Probability Theory 403
A.1 Basic Notions 403
A.1.1 Events and Frequencies 403
A.1.2 Probability Axioms 404
A.2 Conditional Probability and Independence 406
A.2.1 Conditional Probability and Intersection Rule 406
A.2.2 Independent Events 406
A.3 Compound Experiments 408
A.4 Bayes’ Theorem 409
A.5 Random Variables and Distributions 410
A.5.1 Definition of Random Variable 410
A.5.2 Distribution and Density Functions 411
A.5.3 Transformation of a Random Variable 413
A.6 Expectation, Variance and Moments 414
A.6.1 Definitions and Properties 414
A.6.2 Moment-Generating Function 417
A.6.3 Chebyshev Theorem 418
A.7 The Binomial and Normal Distributions 418
A.7.1 The Binomial Distribution 418
A.7.2 The Laws of Large Numbers 419
A.7.3 The Normal Distribution 420
A.8 Multivariate Distributions 422
A.8.1 Definitions 422
A.8.2 Moments 425
A.8.3 Conditional Densities and Independence 425
A.8.4 Sums of Random Variables 427
A.8.5 Central Limit Theorem 428
Appendix B - Distributions 431
B.1 Discrete Distributions 431
B.1.1 Bernoulli Distribution 431
B.1.2 Uniform Distribution 432
B.1.3 Geometric Distribution 433
B.1.4 Hypergeometric Distribution 434
B.1.5 Binomial Distribution 435
B.1.6 Multinomial Distribution 436
B.1.7 Poisson Distribution 438
B.2 Continuous Distributions 439
B.2.1 Uniform Distribution 439
B.2.2 Normal Distribution 441
B.2.3 Exponential Distribution 442
B.2.4 Weibull Distribution 444
B.2.5 Gamma Distribution 445
B.2.6 Beta Distribution 446
B.2.7 Chi-Square Distribution 448
xii Contents
B.2.8
Student’s t Distribution 449
B.2.9 F Distribution 451
B.2.10 Von Mises Distributions 452
Appendix C - Point Estimation 455
C.1 Definitions 455
C.2 Estimation of Mean and Variance 457
Appendix D - Tables 459
D.1 Binomial Distribution 459
D.2 Normal Distribution 465
D.3 Student´s t Distribution 466
D.4 Chi-Square Distribution 467
D.5 Critical Values for the F Distribution 468
Appendix E - Datasets 469
E.1 Breast Tissue 469
E.2 Car Sale 469
E.3 Cells 470
E.4 Clays 470
E.5 Cork Stoppers 471
E.6 CTG 472
E.7 Culture 473
E.8 Fatigue 473
E.9 FHR 474
E.10 FHR-Apgar 474
E.11 Firms 475
E.12 Flow Rate 475
E.13 Foetal Weight 475
E.14 Forest Fires 476
E.15 Freshmen 476
E.16 Heart Valve 477
E.17 Infarct 478
E.18 Joints 478
E.19 Metal Firms 479
E.20 Meteo 479
E.21 Moulds 479
E.22 Neonatal 480
E.23 Programming 480
E.24 Rocks 481
E.25 Signal & Noise 481
Contents xiii
E.26 Soil Pollution 482
E.27 Stars 482
E.28 Stock Exchange 483
E.29 VCG 484
E.30 Wave 484
E.31 Weather 484
E.32 Wines 485
Appendix F - Tools 487
F.1 MATLAB Functions 487
F.2 R Functions 488
F.3 Tools EXCEL File 489
F.4 SCSize Program 489
References 491
Index 499
Preface to the Second Edition
Four years have passed since the first edition of this book. During this time I have
had the opportunity to apply it in classes obtaining feedback from students and
inspiration for improvements. I have also benefited from many comments by users
of the book. For the present second edition large parts of the book have undergone
major revision, although the basic concept – concise but sufficiently rigorous
mathematical treatment with emphasis on computer applications to real datasets –,
has been retained.
The second edition improvements are as follows:
• Inclusion of R as an application tool. As a matter of fact, R is a free
software product which has nowadays reached a high level of maturity
and is being increasingly used by many people as a statistical analysis
tool.
• Chapter 3 has an added section on bootstrap estimation methods, which
have gained a large popularity in practical applications.
• A revised explanation and treatment of tree classifiers in Chapter 6 with
the inclusion of the QUEST approach.
• Several improvements of Chapter 7 (regression), namely: details
concerning the meaning and computation of multiple and partial
correlation coefficients, with examples; a more thorough treatment and
exemplification of the ridge regression topic; more attention dedicated to
model evaluation.
• Inclusion in the book CD of additional MATLAB functions as well as a
set of R functions.
• Extra examples and exercises have been added in several chapters.
• The bibliography has been revised and new references added.
I have also tried to improve the quality and clarity of the text as well as notation.
Regarding notation I follow in this second edition the more widespread use of
denoting random variables with italicised capital letters, instead of using small
cursive font as in the first edition. Finally, I have also paid much attention to
correcting errors, misprints and obscurities of the first edition.
J.P. Marques de Sá
Porto, 2007
Preface to the First Edition
This book is intended as a reference book for students, professionals and research
workers who need to apply statistical analysis to a large variety of practical
problems using STATISTICA, SPSS and MATLAB. The book chapters provide a
comprehensive coverage of the main statistical analysis topics (data description,
statistical inference, classification and regression, factor analysis, survival data,
directional statistics) that one faces in practical problems, discussing their solutions
with the mentioned software packages.
The only prerequisite to use the book is an undergraduate knowledge level of
mathematics. While it is expected that most readers employing the book will have
already some knowledge of elementary statistics, no previous course in probability
or statistics is needed in order to study and use the book. The first two chapters
introduce the basic needed notions on probability and statistics. In addition, the
first two Appendices provide a short survey on Probability Theory and
Distributions for the reader needing further clarification on the theoretical
foundations of the statistical methods described.
The book is partly based on tutorial notes and materials used in data analysis
disciplines taught at the Faculty of Engineering, Porto University. One of these
management. The students in this course have a variety of educational backgrounds
and professional interests, which generated and brought about datasets and analysis
objectives which are quite challenging concerning the methods to be applied and
the interpretation of the results. The datasets used in the book examples and
exercises were collected from these courses as well as from research. They are
included in the book CD and cover a broad spectrum of areas: engineering,
medicine, biology, psychology, economy, geology, and astronomy.
Every chapter explains the relevant notions and methods concisely, and is
illustrated with practical examples using real data, presented with the distinct
intention of clarifying sensible practical issues. The solutions presented in the
examples are obtained with one of the software packages STATISTICA, SPSS or
MATLAB; therefore, the reader has the opportunity to closely follow what is being
done. The book is not intended as a substitute for the STATISTICA, SPSS and
MATLAB user manuals. It does, however, provide the necessary guidance for
applying the methods taught without having to delve into the manuals. This
includes, for each topic explained in the book, a clear indication of which
STATISTICA, SPSS or MATLAB tools to be applied. These indications appear in
use the tools, whenever necessary. In this way, a comparative perspective of the
specific “Commands” frames together with a complementary description on how to
disciplines is attended by students of a Master’s Degree course on information
xviii Preface to the First Edition
capabilities of those software packages is also provided, which can be quite useful
for practical purposes.
STATISTICA, SPSS or MATLAB do not provide specific tools for some of the
statistical topics described in the book. These range from such basic issues as the
choice of the optimal number of histogram bins to more advanced topics such as
directional statistics. The book CD provides these tools, including a set of
MATLAB functions for directional statistics.
I am grateful to many people who helped me during the preparation of the book.
Professor Luís Alexandre provided help in reviewing the book contents. Professor
Willem van Meurs provided constructive comments on several topics. Professor
Joaquim Góis contributed with many interesting discussions and suggestions,
namely on the topic of data structure analysis. Dr. Carlos Felgueiras and Paulo
Sousa gave valuable assistance in several software issues and in the development
of some software tools included in the book CD. My gratitude also to Professor
Pimenta Monteiro for his support in elucidating some software tricks during the
preparation of the text files. A lot of people contributed with datasets. Their names
are mentioned in Appendix E. I express my deepest thanks to all of them. Finally, I
would also like to thank Alan Weed for his thorough revision of the texts and the
clarification of many editing issues.
J.P. Marques de Sá
Porto, 2003
Symbols and Abbreviations
Sample Sets
A event
A set (of events)
{A
1
, A
2
,…} set constituted of events A
1
, A
2
,…
A
complement of {A}
BA U union of {A} with {B}
BA I intersection of {A} with {B}
E set of all events (universe)
φ
empty set
Functional Analysis
∃ there is
∀ for every
∈ belongs to
∉
≡ equivalent to
|| || Euclidian norm (vector length)
⇒ implies
→ converges to
ℜ real number set
+
ℜ [0, +∞ [
[a, b] closed interval between and including a and b
]a, b] interval between a and b, excluding a
[a, b[ interval between a and b, excluding b
doesn’t belong to
xx Symbols and Abbreviations
]a, b[ open interval between a and b (excluding a and b)
∑
=
n
i 1
sum for index i = 1,…, n
∏
=
n
i 1
product for index i = 1,…, n
∫
b
a
integral from a to b
k! factorial of k, k! = k(k−1)(k−2) 2.1
(
)
n
k
combinations of n elements taken k at a time
| x | absolute value of x
x largest integer smaller or equal to x
g
X
(a) function g of variable X evaluated at a
dX
dg
derivative of function g with respect to X
a
n
dX
gd
n
derivative of order n of g evaluated at a
ln(x) natural logarithm of x
log(x) logarithm of x in base 10
sgn(x) sign of x
mod(x,y) remainder of the integer division of x by y
Vectors and Matrices
x vector (column vector), multidimensional random vector
x' transpose vector (row vector)
[x
1
x
2
…x
n
] row vector whose components are x
1
, x
2
,…,x
n
x
i
i-th component of vector x
x
k,i
i-th component of vector x
k
∆x vector x increment
x'y inner (dot) product of x and y
A matrix
a
ij
i-th row, j-th column element of matrix A
A' transpose of matrix A
A
−1
inverse of matrix A
Symbols and Abbreviations xxi
|A| determinant of matrix A
tr(A) trace of A (sum of the diagonal elements)
I unit matrix
λ
i
eigenvalue i
Probabilities and Distributions
X random variable (with value denoted by the same lower case letter, x)
P(A) probability of event A
P(A|B) probability of event A conditioned on B having occurred
P(x) discrete probability of random vector x
P(
ω
i
|x) discrete conditional probability of
ω
i
given x
f(x) probability density function f evaluated at x
f(x |
ω
i
) conditional probability density function f evaluated at x given
ω
i
X ~ f X has probability density function f
X ~ F X has probability distribution function (is distributed as) F
Pe probability of misclassification (error)
Pc probability of correct classification
df degrees of freedom
x
df,
α
α
-percentile of X distributed with df degrees of freedom
b
n,p
binomial probability for n trials and probability p of success
B
n,p
binomial distribution for n trials and probability p of success
u uniform probability or density function
U uniform distribution
g
p
geometric probability (Bernoulli trial with probability p)
G
p
geometric distribution (Bernoulli trial with probability p)
h
N,D,n
hypergeometric probability (sample of n out of N with D items)
H
N,D,n
hypergeometric distribution (sample of n out of N with D items)
p
λ
Poisson probability with event rate
λ
P
λ
Poisson distribution with event rate
λ
n
µ
,
σ
normal density with mean
µ
and standard deviation
σ
xxii Symbols and Abbreviations
N
µ
,
σ
normal distribution with mean
µ
and standard deviation
σ
ε
λ
exponential density with spread factor
λ
Ε
λ
exponential distribution with spread factor
λ
w
α
,
β
Weibull density with parameters
α
,
β
W
α
,
β
Weibull distribution with parameters
α
,
β
γ
a,p
Gamma density with parameters a, p
Γ
a,p
Gamma distribution with parameters a, p
β
p,q
Beta density with parameters p, q
Β
p,q
Beta distribution with parameters p, q
2
df
χ
Chi-square density with df degrees of freedom
2
df
Χ Chi-square distribution with df degrees of freedom
t
df
T
df
21
,dfdf
f
F density with df
1
, df
2
degrees of freedom
21
,dfdf
F F distribution with df
1
, df
2
degrees of freedom
Statistics
x
ˆ
estimate of x
[]
XΕ expected value (average, mean) of X
[]
XV variance of X
Ε[x | y] expected value of x given y (conditional expectation)
k
m central moment of order k
µ
mean value
σ
standard deviation
XY
σ
covariance of X and Y
ρ
correlation coefficient
µ mean vector
Student’s t density with df degrees of freedom
Student’s t distribution with df degrees of freedom
Symbols and Abbreviations xxiii
Σ covariance matrix
x arithmetic mean
v sample variance
s sample standard deviation
x
α
α
-quantile of X (
α
α
=
)(xF
X
)
med(X) median of X (same as x
0.5
)
S sample covariance matrix
α
significance level (1−
α
is the confidence level)
x
α
α
-percentile of X
ε
tolerance
Abbreviations
FNR False Negative Ratio
FPR False Positive Ratio
iff if an only if
i.i.d. independent and identically distributed
IRQ inter-quartile range
pdf probability density function
LSE Least Square Error
ML Maximum Likelihood
MSE Mean Square Error
PDF probability distribution function
RMS Root Mean Square Error
r.v. Random variable
ROC Receiver Operating Characteristic
SSB Between-group Sum of Squares
SSE Error Sum of Squares
SSLF Lack of Fit Sum of Squares
SSPE Pure Error Sum of Squares
SSR Regression Sum of Squares
xxiv Symbols and Abbreviations
SST Total Sum of Squares
SSW Within-group Sum of Squares
TNR True Negative Ratio
TPR True Positive Ratio
VIF Variance Inflation Factor
Tradenames
EXCEL Microsoft Corporation
MATLAB The MathWorks, Inc.
SPSS SPSS, Inc.
STATISTICA Statsoft, Inc.
WINDOWS Microsoft Corporation
1 Introduction
1.1 Deterministic Data and Random Data
Our daily experience teaches us that some data are generated in accordance to
known and precise laws, while other data seem to occur in a purely haphazard way.
Data generated in accordance to known and precise laws are called deterministic
gravity. When the body is released at a height h, we can calculate precisely where
the body stands at each time t. The physical law, assuming that the fall takes place
in an empty space, is expressed as:
2
0
½gthh −= ,
where h
0
is the initial height and g is the Earth s gravity acceleration at the point
where the body falls.
Figure 1.1 shows the behaviour of h with t, assuming an initial height of 15
meters.
0
2
4
6
8
10
12
14
16
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
t
h
t h
0.00 15.00
0.20 14.80
0.40 14.22
0.60 13.24
0.80 11.86
1.00 10.10
1.20 7.94
1.40 5.40
1.60 2.46
Figure 1.1. Body in free-fall, with height in meters and time in seconds, assuming
g = 9.8 m/s
2
. The h column is an example of deterministic data.
data. An example of such type of data is the fall of a body subject to the Earth’s
’
2 1 Introduction
In the case of the body fall there is a law that allows the exact computation of
one of the variables h or t (for given h
0
and g) as a function of the other one.
Moreover, if we repeat the body-fall experiment under identical conditions, we
consistently obtain the same results, within the precision of the measurements.
These are the attributes of deterministic data: the same data will be obtained,
within the precision of the measurements, under repeated experiments in well-
defined conditions.
Imagine now that we were dealing with Stock Exchange data, such as, for
instance, the daily share value throughout one year of a given company. For such
data there is no known law to describe how the share value evolves along the year.
Furthermore, the possibility of experiment repetition with identical results does not
apply here. We are, thus, in presence of what is called random data.
Classical examples of random data are:
− Thermal noise generated in electrical resistances, antennae, etc.;
− Brownian motion of tiny particles in a fluid;
− Weather variables;
− Financial variables such as Stock Exchange share values;
− Gambling game outcomes (dice, cards, roulette, etc.);
− Conscript height at military inspection.
In none of these examples can a precise mathematical law describe the data.
Also, there is no possibility of obtaining the same data in repeated experiments,
performed under similar conditions. This is mainly due to the fact that several
unforeseeable or immeasurable causes play a role in the generation of such data.
For instance, in the case of the Brownian motion, we find that, after a certain time,
the trajectories followed by several particles that have departed from exactly the
same point, are completely different among them. Moreover it is found that such
differences largely exceed the precision of the measurements.
When dealing with a random dataset, especially if it relates to the temporal
evolution of some variable, it is often convenient to consider such dataset as one
realization (or one instance) of a set (or ensemble) consisting of a possibly infinite
number of realizations of a generating process. This is the so-called random
phenomenon composed of random parts). Thus:
− The wandering voltage signal one can measure in an open electrical
resistance is an instance of a thermal noise process (with an ensemble of
infinitely many continuous signals);
− The succession of face values when tossing n times a die is an instance of a
die tossing process (with an ensemble of finitely many discrete sequences).
− The trajectory of a tiny particle in a fluid is an instance of a Brownian
process (with an ensemble of infinitely many continuous trajectories);
process (or stochastic process, from the Greek “stochastikos” = method or
1.1 Deterministic Data and Random Data 3
0
2
4
6
8
10
12
14
16
18
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
t
h
1.1, with measurement errors (random data components). The dotted line
represents the theoretical curve (deterministic data component). The solid circles
correspond to the measurements made.
could probably find a deterministic description of the data. Furthermore, if we
didn t know the mathematical law underlying a deterministic experiment, we might
conclude that a random dataset were present. For example, imagine that we did not
experiments in the same conditions as before, performing the respective
measurement of the height h for several values of the time t, obtaining the results
shown in Figure 1.2. The measurements of each single experiment display a
random variability due to measurement errors. These are always present in any
dataset that we collect, and we can only hope that by averaging out such errors we
matter of fact, statistics were first used as a means of summarising data, namely
Even now the deterministic vs. random phenomenal characterization is subject
to controversies and often statistical methods are applied to deterministic data. A
good example of this is the so-called chaotic phenomena, which are described by a
precise mathematical law, i.e., such phenomena are deterministic. However, the
sensitivity of these phenomena on changes of causal variables is so large that the
’
Figure 1.2. Three “body fall” experiments, under identical conditions as in Figure
We might argue that if we knew all the causal variables of the “random data” we
know the “body fall” law and attempted to describe it by running several
get the “underlying law” of the data. This is a central idea in statistics: that certain
quantities give the “big picture” of the data, averaging out random errors. As a
social and state data (the word “statistics” coming from the “science of state”).
Scientists’ attitude towards the “deterministic vs. random” dichotomy has
undergone drastic historical changes, triggered by major scientific discoveries.
Paramount of these changes in recent years has been the development of the
quantum description of physical phenomena, which yields a granular-all-
connectedness picture of the universe. The well-known “uncertainty principle” of
Heisenberg, which states a limit to our capability of ever decreasing the
measurement errors of experiment related variables (e.g. position and velocity),
also supports a critical attitude towards determinism.
“
”
4 1 Introduction
precision of the result cannot be properly controlled by the precision of the causes.
To illustrate this, let us consider the following formula used as a model of
population growth in ecology studies, where p(n) ∈ [0, 1] is the fraction of a
limiting number of population of a species at instant n, and k is a constant that
depends on ecological conditions, such as the amount of food present:
))1(1(
1 nnn
pkpp −+=
+
, k > 0.
Imagine we start (n = 1) with a population percentage of 50% (p
1
= 0.5) and
wish to know the percentage of population at the following three time instants,
with k = 1.9:
p
2
= p
1
(1+1.9 x (1− p
1
)) = 0.9750
p
3
= p
2
(1+1.9 x (1− p
2
)) = 1.0213
p
4
= p
3
(1+1.9 x (1− p
3
)) = 0.9800
It seems that after an initial growth the population dwindles back. As a matter of
fact, the evolution of p
n
shows some oscillation until stabilising at the value 1, the
limiting number of population. However, things get drastically more complicated
when k = 3, as shown in Figure 1.3. A mere deviation in the value of p
1
of only
10
−6
has a drastic influence on p
n
. For practical purposes, for k around 3 we are
unable to predict the value of the p
n
after some time, since it is so sensitive to very
small changes of the initial condition p
1
. In other words, the deterministic p
n
process can be dealt with as a random process for some values of k.
a
0 10 20 30 40 50 60 70 80
0
0.2
0.4
0.6
0.8
1
1.2
1.4
time
p
n
b
0 10 20 30 40 50 60 70 80
0
0.2
0.4
0.6
0.8
1
1.2
1.4
time
p
n
Figure 1.3. Two instances of the population growth process for k = 3: a) p
1
= 0.1;
b) p
1
= 0.100001.
The random-like behaviour exhibited by some iterative series is also present in
programs. One such routine iteratively generates x
n
as follows:
mxx
nn
mod
1
α
=
+
.
the so-called “random number generator routine” used in many computer