Tải bản đầy đủ (.pdf) (803 trang)

Probability for Statistics and Machine Learning AI fundamentals and advanced

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.93 MB, 803 trang )

<span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

<b>Springer Texts in Statistics</b>

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

Anirban DasGupta

Probability for Statistics and Machine Learning Fundamentals and Advanced Topics

ABC

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Springer New York Dordrecht Heidelberg London <small>Library of Congress Control Number: 2011924777</small>

<small> Springer Science+Business Media, LLC 2011</small>

<small>All rights reserved. This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use inconnection with any form of information storage and retrieval, electronic adaptation, computer software,or by similar or dissimilar methodology now known or hereafter developed is forbidden.</small>

<small>The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subjectto proprietary rights.</small>

<small>Printed on acid-free paper</small>

<small>Springer is part of Springer Science+Business Media (www.springer.com)</small>

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

<i>To Persi Diaconis, Peter Hall, Ashok Maitra,and my mother, with affection</i>

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

<i>This is the companion second volume to my undergraduate text Fundamentals ofProbability: A First Course. The purpose of my writing this book is to give </i>

gradu-ate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning. Numerous top-ics in probability and stochastic processes of current importance in statisttop-ics and machine learning that are widely scattered in the literature in many different spe-cialized books are all brought together under one fold in this book. This is done with an extensive bibliography for each topic, and numerous worked-out examples and exercises. Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness. The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the far-reaching applicability of probability in science.

The book starts with a self-contained and fairly complete review of basic prob-ability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself. Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a year-long probabil-ity sequence, or for focused short courses on selected topics, for self-study, and as a nearly unique reference for research in statistics, probability, and computer sci-ence. It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning. This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of worked-out examples and exercises. The total number of worked-out examples in this book is 423, and the total number of exercises is 808. An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and self-study. I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.

Topics in core probability, such as distribution theory, asymptotics, Markov chains, martingales, Poisson processes, random walks, and Brownian motion are covered in the first 14 chapters. In these chapters, a reader will also find basic

<small>vii</small>

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

coverage of such core statistical topics as confidence intervals, likelihood functions, maximum likelihood estimates, posterior densities, sufficiency, hypothesis testing, variance stabilizing transformations, and extreme value theory, all illustrated with many examples. In Chapters 15, 16, and 17, I treat three major topics of great appli-cation potential, empirical processes and VC theory, probability metrics, and large deviations. Chapters 18, 19, and 20 are specifically directed to the statistics and machine-learning community, and cover simulation, Markov chain Monte Carlo, the exponential family, bootstrap, the EM algorithm, and kernels.

The book does not make formal use of measure theory. I do not intend to mini-mize the role of measure theory in a rigorous study of probability. However, I believe that a large amount of probability can be taught, understood, enjoyed, and applied without needing formal use of measure theory. We do it around the world every day. At the same time, some theorems cannot be proved without at least a men-tion of some measure theory terminology. Even some definimen-tions require a menmen-tion of some measure theory notions. I include some unavoidable mention of measure-theoretic terms and results, such as the strong law of large numbers and its proof, the dominated convergence theorem, monotone convergence, Lebesgue measure, and a few others, but only in the advanced chapters in the book.

Following the table of contents, I have suggested some possible courses with different themes using this book. I have also marked the nonroutine and harder ex-ercises in each chapter with an asterisk. Likewise, some specialized sections with reference value have also been marked with an asterisk. Generally, the exercises and the examples come with a caption, so that the reader will immediately know the content of an exercise or an example. The end of the proof of a theorem has been marked by a sign.

My deepest gratitude and appreciation are due to Peter Hall. I am lucky that the style and substance of this book are significantly molded by Peter’s influence. Out of habit, I sent him the drafts of nearly every chapter as I was finishing them. It didn’t matter where exactly he was, I always received his input and gentle suggestions for improvement. I have found Peter to be a concerned and warm friend, teacher, mentor, and guardian, and for this, I am extremely grateful.

Mouli Banerjee, Rabi Bhattacharya, Burgess Davis, Stewart Ethier, Arthur Frazho, Evarist Gin´e, T. Krishnan, S. N. Lahiri, Wei-Liem Loh, Hyun-Sook Oh, B. V. Rao, Yosi Rinott, Wen-Chi Tsai, Frederi Viens, and Larry Wasserman graciously went over various parts of this book. I am deeply indebted to each of them. Larry Wasserman, in particular, suggested the chapters on empirical pro-cesses, VC theory, concentration inequalities, the exponential family, and Markov chain Monte Carlo. The Springer series editors, Peter Bickel, George Casella, Steve Fienberg, and Ingram Olkin have consistently supported my efforts, and I am so very thankful to them. Springer’s incoming executive editor Marc Strauss saw through the final production of this book extremely efficiently, and I have much enjoyed working with him. I appreciated Marc’s gentility and his thoroughly professional handling of the transition of the production of this book to his oversight. Valerie Greco did an astonishing job of copyediting the book. The presentation, display, and the grammar of the book are substantially better because of the incredible care

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

and thoughtfulness that she put into correcting my numerous errors. The staff at SPi Technologies, Chennai, India did an astounding and marvelous job of produc-ing this book. Six anonymous reviewers gave extremely gracious and constructive comments, and their input has helped me in various dimensions to make this a better book. Doug Crabill is the greatest computer systems administrator, and with an infectious pleasantness has bailed me out of my stupidity far too many times. I also want to mention my fond memories and deep-rooted feelings for the Indian Statistical Institute, where I had all of my college education. It was just a wonderful place for research, education, and friendships. Nearly everything that I know is due to my years at the Indian Statistical Institute, and for this I am thankful.

This is the third time that I have written a book in contract with John Kimmel. John is much more than a nearly unique person in the publishing world. To me, John epitomizes sensitivity and professionalism, a singular combination. I have now known John for almost six years, and it is very very difficult not to appreciate and admire him a whole lot for his warmth, style, and passion for the subjects of statis-tics and probability. Ironically, the day that this book entered production, the news came that John was leaving Springer. I will remember John’s contribution to my professional growth with enormous respect and appreciation.

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

<b>Suggested Courses with Different Themes . . . xix</b>

<b>1Review of Univariate Probability . . . .</b> 1

1.1 Experiments and Sample Spaces . . . . 1

1.2 Conditional Probability and Independence.. . . . 5

1.3 Integer-Valued and Discrete Random Variables . . . . 8

1.3.1 CDF and Independence.. . . . 9

1.3.2 Expectation and Moments. . . 13

1.4 Inequalities . . . 19

1.5 Generating and Moment-Generating Functions . . . 22

1.6  Applications of Generating Functions to a Pattern Problem ... 26

1.7 Standard Discrete Distributions . . . 28

1.8 Poisson Approximation to Binomial . . . 34

1.9 Continuous Random Variables. . . 36

1.10 Functions of a Continuous Random Variable . . . 42

1.10.1 Expectation and Moments. . . 45

1.10.2 Moments and the Tail of a CDF . . . 49

1.11 Moment-Generating Function and Fundamental Inequalities .. . . 51

1.11.1  Inversion of an MGF and Post’s Formula ... 53

1.12 Some Special Continuous Distributions . . . 54

1.13 Normal Distribution and Confidence Interval for a Mean .. . . 61

1.14 Stein’s Lemma .. . . 66

1.15  Chernoff’s Variance Inequality ... 68

1.16  Various Characterizations of Normal Distributions ... 69

1.17 Normal Approximations and Central Limit Theorem . . . 71

1.17.1 Binomial Confidence Interval .. . . 74

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

<b>2Multivariate Discrete Distributions . . . 95</b>

2.1 Bivariate Joint Distributions and Expectations of Functions . . . 95

2.2 Conditional Distributions and Conditional Expectations .. . . .100

2.2.1 Examples on Conditional Distributions and Expectations .. . . .101

2.3 Using Conditioning to Evaluate Mean and Variance . . . .104

2.4 Covariance and Correlation . . . .107

3.4 Conditional Densities and Expectations.. . . .140

3.4.1 Examples on Conditional Densities and Expectations .. . . . .142

3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates . . . .147

3.6 Maximum Likelihood Estimates. . . .152

3.7 Bivariate Normal Conditional Distributions . . . .154

3.8  Useful Formulas and Characterizations for Bivariate Normal ...155

3.8.1 Computing Bivariate Normal Probabilities .. . . .157

3.9  Conditional Expectation Given a Set and Borel’s Paradox ...158

References . . . .165

<b>4Advanced Distribution Theory . . . .167</b>

4.1 Convolutions and Examples . . . .167

4.2 Products and Quotients and the t - and F -Distribution . . . .172

4.3 Transformations . . . .177

4.4 Applications of Jacobian Formula .. . . .178

4.5 Polar Coordinates in Two Dimensions . . . .180

4.6  n-Dimensional Polar and Helmert’s Transformation ...182

4.6.1 Efficient Spherical Calculations with Polar Coordinates . . . .182

4.6.2 Independence of Mean and Variance in Normal Case . . . .185

4.6.3 The t Confidence Interval . . . .187

4.7 The Dirichlet Distribution . . . .188

4.7.1  Picking a Point from the Surface of a Sphere ...191

4.7.2  Poincar´e’s Lemma ...191

4.8  Ten Important High-Dimensional Formulas for Easy Reference . . . .191

References . . . .197

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

<b>5Multivariate Normal and Related Distributions . . . .199</b>

5.1 Definition and Some Basic Properties .. . . .199

5.2 Conditional Distributions . . . .202

5.3 Exchangeable Normal Variables . . . .205

5.4 Sampling Distributions Useful in Statistics . . . .207

5.4.1  Wishart Expectation Identities ...208

5.4.2 * Hotelling’s T<sup>2</sup>and Distribution of Quadratic Forms . . . . .209

5.4.3  Distribution of Correlation Coefficient ...212

5.5 Noncentral Distributions . . . .213

5.6 Some Important Inequalities for Easy Reference . . . .214

References . . . .218

<b>6Finite Sample Theory of Order Statistics and Extremes .. . . .221</b>

6.1 Basic Distribution Theory . . . .221

6.2 More Advanced Distribution Theory .. . . .225

6.3 Quantile Transformation and Existence of Moments . . . .229

<b>7Essential Asymptotics and Applications . . . .249</b>

7.1 Some Basic Notation and Convergence Concepts . . . .250

7.2 Laws of Large Numbers . . . .254

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

<b>8Characteristic Functions and Applications . . . .293</b>

8.1 Characteristic Functions of Standard Distributions . . . .294

8.2 Inversion and Uniqueness .. . . .298

8.3 Taylor Expansions, Differentiability, and Moments . . . .302

8.4 Continuity Theorems .. . . .303

8.5 Proof of the CLT and the WLLN . . . .305

8.6  Producing Characteristic Functions ...306

8.7 Error of the Central Limit Theorem . . . .308

8.8 Lindeberg–Feller Theorem for General Independent Case . . . .311

8.9  Infinite Divisibility and Stable Laws ...315

8.10  Some Useful Inequalities ...317

References . . . .322

<b>9Asymptotics of Extremes and Order Statistics . . . .323</b>

9.1 Central-Order Statistics . . . .323

9.1.1 Single-Order Statistic. . . .323

9.1.2 Two Statistical Applications . . . .325

9.1.3 Several Order Statistics. . . .326

9.2 Extremes .. . . .328

9.2.1 Easily Applicable Limit Theorems . . . .328

9.2.2 The Convergence of Types Theorem . . . .332

9.3  Fisher–Tippett Family and Putting it Together ...333

References . . . .338

<b>10Markov Chains and Applications . . . .339</b>

10.1 Notation and Basic Definitions . . . .340

10.2 Examples and Various Applications as a Model . . . .340

10.3 Chapman–Kolmogorov Equation .. . . .345

10.4 Communicating Classes . . . .349

10.5 Gambler’s Ruin . . . .352

10.6 First Passage, Recurrence, and Transience .. . . .354

10.7 Long Run Evolution and Stationary Distributions . . . .359

References . . . .374

<b>11Random Walks . . . .375</b>

11.1 Random Walk on the Cubic Lattice . . . .375

11.1.1 Some Distribution Theory .. . . .378

11.1.2 Recurrence and Transience . . . .379

11.1.3  P´olya’s Formula for the Return Probability ...382

11.2 First Passage Time and Arc Sine Law . . . .383

11.3 The Local Time . . . .387

11.4 Practically Useful Generalizations . . . .389

11.5 Wald’s Identity . . . .390

11.6 Fate of a Random Walk . . . .392

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

11.7 Chung–Fuchs Theorem . . . .394

11.8 Six Important Inequalities . . . .396

References . . . .400

<b>12Brownian Motion and Gaussian Processes . . . .401</b>

12.1 Preview of Connections to the Random Walk . . . .402

12.2 Basic Definitions . . . .403

12.2.1 Condition for a Gaussian Process to be Markov . . . .406

12.2.2  Explicit Construction of Brownian Motion ...407

12.3 Basic Distributional Properties . . . .408

12.3.1 Reflection Principle and Extremes . . . .410

12.3.2 Path Properties and Behavior Near Zero and Infinity .. . . .412

12.3.3  Fractal Nature of Level Sets ...415

12.4 The Dirichlet Problem and Boundary Crossing Probabilities .. . . .416

12.4.1 Recurrence and Transience . . . .418

12.5 The Local Time of Brownian Motion . . . .419

12.6 Invariance Principle and Statistical Applications . . . .421

12.7 Strong Invariance Principle and the KMT Theorem .. . . .425

12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process. . . .427

12.8.1 Negative Drift and Density of Maximum.. . . .427

12.8.2  Transition Density and the Heat Equation ...428

12.8.3  The Ornstein–Uhlenbeck Process ...429

References . . . .435

<b>13Poisson Processes and Applications. . . .437</b>

13.1 Notation .. . . .438

13.2 Defining a Homogeneous Poisson Process. . . .439

13.3 Important Properties and Uses as a Statistical Model . . . .440

13.4  Linear Poisson Process and Brownian Motion: A Connection ....448

13.5 Higher-Dimensional Poisson Point Processes . . . .450

13.5.1 The Mapping Theorem . . . .452

13.6 One-Dimensional Nonhomogeneous Processes . . . .453

13.7  Campbell’s Theorem and Shot Noise ...456

13.7.1 Poisson Process and Stable Laws . . . .458

References . . . .462

<b>14Discrete Time Martingales and Concentration Inequalities . . . .463</b>

14.1 Illustrative Examples and Applications in Statistics . . . .463

14.2 Stopping Times and Optional Stopping . . . .468

14.2.1 Stopping Times . . . .469

14.2.2 Optional Stopping . . . .470

14.2.3 Sufficient Conditions for Optional Stopping Theorem . . . . .472

14.2.4 Applications of Optional Stopping . . . .474

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

14.3 Martingale and Concentration Inequalities.. . . .477

14.3.1 Maximal Inequality .. . . .477

14.3.2  Inequalities of Burkholder, Davis, and Gundy ...480

14.3.3 Inequalities of Hoeffding and Azuma . . . .483

14.3.4  Inequalities of McDiarmid and Devroye ...485

14.3.5 The Upcrossing Inequality . . . .488

14.4 Convergence of Martingales . . . .490

14.4.1 The Basic Convergence Theorem .. . . .490

14.4.2 Convergence in L1and L2. . . .493

14.5  Reverse Martingales and Proof of SLLN ...494

14.6 Martingale Central Limit Theorem .. . . .497

References . . . .503

<b>15Probability Metrics . . . .505</b>

15.1 Standard Probability Metrics Useful in Statistics . . . .505

15.2 Basic Properties of the Metrics . . . .508

15.3 Metric Inequalities . . . .515

15.4 Differential Metrics for Parametric Families . . . .519

15.4.1  Fisher Information and Differential Metrics ...520

15.4.2  Rao’s Geodesic Distances on Distributions ...522

References . . . .525

<b>16Empirical Processes and VC Theory . . . .527</b>

16.1 Basic Notation and Definitions . . . .527

16.2 Classic Asymptotic Properties of the Empirical Process . . . .529

16.2.1 Invariance Principle and Statistical Applications . . . .531

16.2.2  Weighted Empirical Process ...534

16.2.3 The Quantile Process . . . .536

16.2.4 Strong Approximations of the Empirical Process . . . .537

16.3 Vapnik–Chervonenkis Theory . . . .538

16.3.1 Basic Theory .. . . .538

16.3.2 Concrete Examples . . . .540

16.4 CLTs for Empirical Measures and Applications . . . .543

16.4.1 Notation and Formulation .. . . .543

16.4.2 Entropy Bounds and Specific CLTs. . . .544

16.4.3 Concrete Examples . . . .547

16.5 Maximal Inequalities and Symmetrization .. . . .547

16.6  Connection to the Poisson Process ...551

References . . . .557

<b>17Large Deviations .. . . .559</b>

17.1 Large Deviations for Sample Means . . . .560

17.1.1 The Cram´er–Chernoff Theorem in<i>R...560</i>

17.1.2 Properties of the Rate Function . . . .564

17.1.3 Cram´er’s Theorem for General Sets . . . .566

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

17.2 The GRartner–Ellis Theorem and Markov Chain Large

Deviations . . . .567

17.3 The t-Statistic . . . .570

17.4 Lipschitz Functions and Talagrand’s Inequality . . . .572

17.5 Large Deviations in Continuous Time. . . .574

17.5.1  Continuity of a Gaussian Process...576

17.5.2  Metric Entropy of T and Tail of the Supremum ...577

References . . . .582

<b>18The Exponential Family and Statistical Applications . . . .583</b>

18.1 One-Parameter Exponential Family . . . .583

18.1.1 Definition and First Examples . . . .584

18.2 The Canonical Form and Basic Properties . . . .589

18.2.1 Convexity Properties . . . .590

18.2.2 Moments and Moment Generating Function .. . . .591

18.2.3 Closure Properties . . . .594

18.3 Multiparameter Exponential Family. . . .596

18.4 Sufficiency and Completeness .. . . .600

18.4.1  Neyman–Fisher Factorization and Basu’s Theorem ...602

18.4.2  Applications of Basu’s Theorem to Probability...604

18.5 Curved Exponential Family . . . .607

References . . . .612

<b>19Simulation and Markov Chain Monte Carlo . . . .613</b>

19.1 The Ordinary Monte Carlo. . . .615

19.1.1 Basic Theory and Examples. . . .615

19.1.2 Monte Carlo P -Values . . . .622

19.1.3 Rao–Blackwellization . . . .623

19.2 Textbook Simulation Techniques .. . . .624

19.2.1 Quantile Transformation and Accept–Reject . . . .624

19.2.2 Importance Sampling and Its Asymptotic Properties . . . .629

19.2.3 Optimal Importance Sampling Distribution .. . . .633

19.2.4 Algorithms for Simulating from Common Distributions . . . .634

19.3 Markov Chain Monte Carlo . . . .637

19.3.1 Reversible Markov Chains . . . .639

19.3.2 Metropolis Algorithms . . . .642

19.4 The Gibbs Sampler . . . .645

19.5 Convergence of MCMC and Bounds on Errors . . . .651

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

19.6 MCMC on General Spaces . . . .662

19.6.1 General Theory and Metropolis Schemes . . . .662

19.6.2 Convergence . . . .666

19.6.3 Convergence of the Gibbs Sampler . . . .670

19.7 Practical Convergence Diagnostics . . . .673

20.1.3  Higher-Order Accuracy of the Bootstrap...699

20.1.4 Bootstrap for Dependent Data . . . .701

20.2 The EM Algorithm . . . .704

20.2.1 The Algorithm and Examples . . . .706

20.2.2 Monotone Ascent and Convergence of EM . . . .711

20.2.3  Modifications of EM ...714

20.3 Kernels and Classification . . . .715

20.3.1 Smoothing by Kernels . . . .715

20.3.2 Some Common Kernels in Use . . . .717

20.3.3 Kernel Density Estimation . . . .719

20.3.4 Kernels for Statistical Classification . . . .724

20.3.5 Mercer’s Theorem and Feature Maps. . . .732

References . . . .744

<b>A Symbols, Useful Formulas, and Normal Table . . . .747</b>

A.1 Glossary of Symbols . . . .747

A.2 Moments and MGFs of Common Distributions . . . .750

A.3 Normal Table . . . .755

<b>Author Index . . . .757</b>

<b>Subject Index . . . .763</b>

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

<b>Suggested Courses with Different Themes</b>

<small>15 weeksSpecial topics for statistics students9, 10, 15, 16, 17, 18, 2015 weeksSpecial topics for computer science students4, 11, 14, 16, 17, 18, 198 weeksSummer course for statistics students11, 12, 14, 208 weeksSummer course for computer science students14, 16, 18, 208 weeksSummer course on modeling and simulation4, 10, 13, 19</small>

<small>xix</small>

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

<b>Chapter 1</b>

<b>Review of Univariate Probability</b>

Probability is a universally accepted tool for expressing degrees of confidence or doubt about some proposition in the presence of incomplete information or uncer-tainty. By convention, probabilities are calibrated on a scale of 0 to 1; assigning something a zero probability amounts to expressing the belief that we consider it impossible, whereas assigning a probability of one amounts to considering it a cer-tainty. Most propositions fall somewhere in between. Probability statements that we make can be based on our past experience, or on our personal judgments. Whether our probability statements are based on past experience or subjective personal judg-ments, they obey a common set of rules, which we can use to treat probabilities in a mathematical framework, and also for making decisions on predictions, for un-derstanding complex systems, or as intellectual experiments and for entertainment. Probability theory is one of the most applicable branches of mathematics. It is used as the primary tool for analyzing statistical methodologies; it is used routinely in nearly every branch of science, such as biology, astronomy and physics, medicine, economics, chemistry, sociology, ecology, finance, and many others. A background in the theory, models, and applications of probability is almost a part of basic edu-cation. That is how important it is.

For a classic and lively introduction to the subject of probability, we recommend Feller(1968,1971). Among numerous other expositions of the theory of probabil-ity, a variety of examples on various topics can be seen inRoss(1984),Stirzaker (1994),Pitman(1992),Bhattacharya and Waymire(2009), andDasGupta(2010). Ash(1972),Chung(1974),Breiman(1992),Billingsley(1995), andDudley(2002) are masterly accounts of measure-theoretic probability.

<b>1.1 Experiments and Sample Spaces</b>

<i>Treatment of probability theory starts with the consideration of a sample space.</i>

The sample space is the set of all possible outcomes in some physical experiment. For example, if a coin is tossed twice and after each toss the face that shows is recorded, then the possible outcomes of this particular coin-tossing experiment, say

<i><small>A. DasGupta, Probability for Statistics and Machine Learning: Fundamentalsand Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1,</small></i>

<small> Springer Science+Business Media, LLC 2011</small>

<small>1</small>

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

<small>21 Review of Univariate Probability</small>

<i> are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting</i>

the occurrence of tails. We call

<i>D fHH; HT; TH; TTg</i>

the sample space of the experiment .

In general, a sample space is a general set , finite or infinite. An easy example where the sample space  is infinite is to toss a coin until the first time heads show up and record the number of the trial at which the first head appeared. In this case,

<i>the sample space  is the countably infinite set</i>

D f1; 2; 3; : : :g:

<i>Sample spaces can also be uncountably infinite; for example, consider the </i>

experi-ment of choosing a number at random from the interval Œ0; 1. The sample space of this experiment is  D Œ0; 1. In this case,  is an uncountably infinite set. In all cases, individual elements of a sample space are denoted as !. The first task is to

<i>define events and to explain the meaning of the probability of an event.</i>

<b>Definition 1.1. Let  be the sample space of an experiment . Then any subset A</b>

<i>of , including the empty set  and the entire sample space  is called an event.</i>

Events may contain even one single sample point !, in which case the event

<i>is a singleton set f!g. We want to assign probabilities to events. But we want to</i>

assign probabilities in a way that they are logically consistent. In fact, this cannot be done in general if we insist on assigning probabilities to arbitrary collections of sample points, that is, arbitrary subsets of the sample space . We can only define probabilities for such subsets of  that are tied together like a family, the exact concept being that of a  -field. In most applications, including those cases where the sample space  is infinite, events that we would want to normally think about will be members of such an appropriate  -field. So we do not mention the need for consideration of  -fields any further, and get along with thinking of events as subsets of the sample space , including in particular the empty set  and the entire sample space  itself.

Here is a definition of what counts as a legitimate probability on events.

<i><b>Definition 1.2. Given a sample space , a probability or a probability measure on</b></i>

 is a function P on subsets of  such that

(a) P .A/  0 for anyA  I (b) P ./ D 1I

(c) Given disjoint subsets A1; A<small>2</small>; : : : of ; P .[<sup>1</sup><sub>iD1</sub>A<small>i</small>/DP<small>1</small>

<small>iD1</small>P .A<small>i</small>/:

<i>Property (c) is known as countable additivity. Note that it is not something thatcan be proved, but it is like an assumption or an axiom. In our experience, we have</i>

seen that operating as if the assumption is correct leads to useful and credible an-swers in many problems, and so we accept it as a reasonable assumption. Not all

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<small>1.1 Experiments and Sample Spaces3</small>

probabilists agree that countable additivity is natural; but we do not get into that debate in this book. One important point is that finite additivity is subsumed in countable additivity; that is if there are some finite number m of disjoint subsets

A<small>1</small>; A<small>2</small>; : : : ; A<small>m</small>of , then P .[<sup>m</sup><sub>iD1</sub>A<small>i</small>/D P<small>m</small>

<small>iD1</small>P .A<small>i</small>/: Also, it is useful to note

that the last two conditions in the definition of a probability measure imply that

<i>P ./, the probability of the empty set or the null event, is zero.</i>

One notational convention is that strictly speaking, for an event that is just a singleton set f!g, we should write P .f!g/ to denote its probability. But to reduce clutter, we simply use the more convenient notation P .!/.

One pleasant consequence of the axiom of countable additivity is the following basic result. We do not prove it here as it is a simple result; seeDasGupta(2010) for a proof.

<i><b>Theorem 1.1. Let</b></i>A<small>1</small> A<sub>2</sub> A<sub>3</sub><i>    be an infinite family of subsets of a sample</i>

<i>space such that A</i><small>n</small><i># A. Then, P.A</i><small>n</small>/<i>! P.A/ as n ! 1.</i>

<i>Next, the concept of equally likely sample points is a very fundamental one.</i>

<b>Definition 1.3. Let  be a finite sample space consisting of N sample points. We</b>

<i>say that the sample points are equally likely if P .!/ D</i><sub>N</sub><sup>1</sup> for each sample point !. An immediate consequence, due to the addivity axiom, is the following useful formula.

<i><b>Proposition. Let</b> be a finite sample space consisting of N equally likely sample</i>

<i>points. LetA be any event and suppose A contains n distinct sample points. Then</i>

P .A/D <sup>n</sup>

N D <i><sup>Number of sample points favorable to</sup></i><sup>A</sup>

<i>Total number of sample points</i> <sup>:</sup>

Let us see some examples.

<i>Example 1.1 (The Shoe Problem). Suppose there are five pairs of shoes in a closet</i>

and four shoes are taken out at random. What is the probability that among the four that are taken out, there is at least one complete pair?

The total number of sample points is<sub>10</sub>

D 210. Because selection was done

completely at random, we assume that all sample points are equally likely. At least one complete pair would mean two complete pairs, or exactly one complete pair and two other nonconforming shoes. Two complete pairs can be chosen in<sub>5</sub>

term is for choosing two incomplete pairs, and then from each incomplete pair, one chooses the left or the right shoe. Thus, the probability that there will be at least one complete pair among the four shoes chosen is .10 C 120/=210 D 13=21 D :62.

<i>Example 1.2 (Five-Card Poker). In five-card poker, a player is given 5 cards from a</i>

full deck of 52 cards at random. Various named hands of varying degrees of rarity

<i>exist. In particular, we want to calculate the probabilities of A D two pairs and</i>

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

<small>41 Review of Univariate Probability</small>

B<i>D a flush. Two pairs is a hand with 2 cards each of 2 different denominations and</i>

the fifth card of some other denomination; a flush is a hand with 5 cards of the same suit, but the cards cannot be of denominations in a sequence.

To find P .B/, note that there are 10 ways to select 5 cards from a suit such that the cards are in a sequence, namely, fA; 2; 3; 4; 5g; f2; 3; 4; 5; 6g; : : : ; f10; J; Q;

K; Ag, and so,

These are basic examples of counting arguments that are useful whenever there is a finite sample space and we assume that all sample points are equally likely.

<i>A major result in combinatorial probability is the inclusionexclusion formula,</i>

which says the following.

<i><b>Theorem 1.2. Let</b></i>A<small>1</small>; A<small>2</small>; : : : ; A<small>n</small><i>ben general events. Let</i>

<i>Example 1.3 (Missing Suits in a Bridge Hand). Consider a specific player, say</i>

North, in a Bridge game. We want to calculate the probability that North’s hand is void in at least one suit. Towards this, denote the suits as 1, 2, 3, 4 and let

A<small>i</small> D North’s hand is void in suit i.

Then, by the inclusion exclusion formula,

P .North’s hand is void in at least one suit/

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

<small>1.2 Conditional Probability and Independence5</small>

The inclusion–exclusion formula can be hard to apply exactly, because the quan-tities Sj for large indices j can be difficult to calculate. However, fortunately, the inclusion–exclusion formula leads to bounds in both directions for the probability of the union of n general events. We have the following series of bounds.

<i><b>Theorem 1.3 (Bonferroni Bounds). Given</b>n events A</i><small>1</small>; A<small>2</small>; : : : ; A<small>n</small><i>, let</i> p<small>n</small> D

<b>1.2 Conditional Probability and Independence</b>

Both conditional probability and independence are fundamental concepts for proba-bilists and statisticians alike. Conditional probabilities correspond to updating one’s beliefs when new information becomes available. Independence corresponds to ir-relevance of a piece of new information, even when it is made available. In addition, the assumption of independence can and does significantly simplify development, mathematical analysis, and justification of tools and procedures.

<b>Definition 1.4. Let A; B be general events with respect to some sample space ,</b>

<i>and suppose P .A/ > 0. The conditional probability of B given A is defined as</i>

P .BjA/ D <sup>P .A</sup><sup>\ B/</sup> P .A/ <sup>:</sup>

Some immediate consequences of the definition of a conditional probability are the following.

<i><b>Theorem 1.4. (a) (Multiplicative Formula). For any two events A, B such that</b></i>

<i>P .A/ > 0, one has P .A</i>\ B/ D P.A/P.BjA/I

<i>(b) For any two events A, B such that0 < P .A/ < 1, one has P .B/</i> D P.BjA/ P .A/C P.BjA<small>c</small>/P .A<small>c</small>/I

<i><b>(c) (Total Probability Formula). If</b></i>A<small>1</small>; A<small>2</small>; : : : ; A<sub>k</sub> <i>form a partition of the samplespace, (i.e., A</i><small>i</small> \ A<small>j</small> <i>D  for all i Ô j , and [</i><small>k</small>

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

<small>61 Review of Univariate Probability</small>

<i><b>(d) (Hierarchical Multiplicative Formula). Let</b></i> A<small>1</small>; A<small>2</small>; : : : ; A<small>k</small> <i>bek general</i>

<i>events in a sample space. Then</i>

P .A<small>1</small>\ A<sub>2</sub>\    ::: \ A<sub>k</sub>/D P.A<sub>1</sub>/P .A<small>2</small>jA<sub>1</sub>/P .A<small>3</small>jA<sub>1</sub>\ A<sub>2</sub>/    P.A<sub>k</sub>jA<small>1</small>\ A<small>2</small>\    ::: \ A<sub>k1</sub>/:

<i>Example 1.4. One of two urns has a red and b black balls, and the other has c red</i>

and d black balls. One ball is chosen at random from each urn, and then one of these two balls is chosen at random. What is the probability that this ball is red?

If each ball selected from the two urns is red, then the final ball is definitely red. If one of those two balls is red, then the final ball is red with probability 1/2. If none of those two balls is red, then the final ball cannot be red.

Although the total percentage of red balls in the two urns is more than 98%, the chance that the final ball selected would be red is just about 75%.

<i>Example 1.5 (A Clever Conditioning Argument). Coin A gives heads with </i>

probabil-ity s and coin B gives heads with probabilprobabil-ity t . They are tossed alternately, starting off with coin A. We want to find the probability that the first head is obtained on coin A.

We find this probability by conditioning on the outcomes of the first two tosses; more precisely, define

A<small>1</small>D fH g D First toss gives HI A<small>2</small>D fTH gI A<small>3</small>D fT T g:

Let also,

AD The first head is obtained on coin A:

One of the three events A1; A<small>2</small>; A<small>3</small> must happen, and they are also mutually exclusive. Therefore, by the total probability formula,

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

<small>1.2 Conditional Probability and Independence7</small>

As an example, let s D :4; t D :5. Note that coin A is biased against heads. Even

<i>then, s=.s C t  st / D :57 > :5. We see that there is an advantage in starting first.</i>

<b>Definition 1.5. A collection of events A</b><small>1</small>; A<small>2</small>; : : : ; A<small>n</small><i>is said to be mutually inde-pendent (or just indeinde-pendent) if for each k; 1  k  n, and any k of the events,</i>

A<small>i1</small>; : : : ; A<small>ik</small>; P .A<small>i1</small> \    A<small>ik</small>/ D P.A<small>i1</small>/   P.A<small>ik</small><i>/: They are called pairwise</i>

<i>independent if this property holds for k D 2.</i>

<i>Example 1.6 (Lotteries). Although many people buy lottery tickets out of an </i>

expec-tation of good luck, probabilistically speaking, buying lottery tickets is usually a waste of money. Here is an example. Suppose in a weekly state lottery, five of the numbers 00; 01; : : : ; 49 are selected without replacement at random, and someone holding exactly those numbers wins the lottery. Then, the probability that someone holding one ticket will be the winner in a given week is

Suppose this person buys a ticket every week for 40 years. Then, the probability that he will win the lottery on at least one week is 1  .1  4:72  10<sup>7</sup>/<sup>5240</sup> D :00098 < :001; still a very small probability. We assumed in this calculation that

the weekly lotteries are all mutually independent, a reasonable assumption. The calculation would fall apart if we did not make this independence assumption.

It is not uncommon to see the conditional probabilities P .AjB/ and P .BjA/ confused with each other. Suppose in some group of lung cancer patients, we see a large percentage of smokers. If we define B to be the event that a person is a smoker, and A to be the event that a person has lung cancer, then all we can conclude is that in our group of people P .BjA/ is large. But we cannot conclude from just this information that smoking increases the chance of lung cancer, that is, that P .AjB/ is large. In order to calculate a conditional probability P .AjB/ when we know the

<i>other conditional probability P .BjA/, a simple formula known as Bayes’ theorem</i>

is useful. Here is a statement of a general version of Bayes’ theorem.

<i><b>Theorem 1.5. Let</b></i>fA<small>1</small>; A<small>2</small>; : : : ; A<small>m</small><i>g be a partition of a sample space . Let B be</i>

<i>some fixed event. Then</i>

P .A<small>j</small>jB/ D P<sub>m</sub><sup>P .BjA</sup><sup>j</sup><sup>/P .A</sup><sup>j</sup><sup>/</sup>

<small>iD1</small>P .BjA<small>i</small>/P .A<small>i</small>/<sup>:</sup>

<i>Example 1.7 (Multiple Choice Exams). Suppose that the questions in a multiple</i>

choice exam have five alternatives each, of which a student has to pick one as the correct alternative. A student either knows the truly correct alternative with proba-bility :7, or she randomly picks one of the five alternatives as her choice. Suppose a particular problem was answered correctly. We want to know what the probability is that the student really knew the correct answer.

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

<small>81 Review of Univariate Probability</small>

AD The student knew the correct answer; BD The student answered the question correctly:

We want to compute P .AjB/. By Bayes’ theorem,

P .AjB/ D <sup>P .B</sup><sup>jA/P.A/</sup>

P .BjA/P.A/ C P.BjA<small>c</small>/P .A<small>c</small>/D <sup>1</sup><sup> :7</sup>

1 :7 C :2  :3 <sup>D :921:</sup>

Before the student answered the question, our probability that she would know the correct answer to the question was :7; but once she answered it correctly, the poste-rior probability that she knew the correct answer increases to :921. This is exactly

<i>what Bayes’ theorem does; it updates our prior belief to the posterior belief, when</i>

new evidence becomes available.

<b>1.3 Integer-Valued and Discrete Random Variables</b>

In some sense, the entire subject of probability and statistics is about distributions of random variables. Random variables, as the very name suggests, are quantities that vary, over time, or from individual to individual, and the reason for the variability is some underlying random process. Depending on exactly how an underlying exper-iment  ends, the random variable takes different values. In other words, the value of the random variable is determined by the sample point ! that prevails, when the underlying experiment  is actually conducted. We cannot know a priori the value of the random variable, because we do not know a priori which sample point ! will prevail when the experiment  is conducted. We try to understand the behavior of a random variable by analyzing the probability structure of that underlying random experiment.

Random variables, like probabilities, originated in gambling. Therefore, the ran-dom variables that come to us more naturally, are integer-valued ranran-dom variables; for examples, the sum of the two rolls when a die is rolled twice. Integer-valued random variables are special cases of what are known as discrete random variables. Discrete or not, a common mathematical definition of all random variables is the following.

<b>Definition 1.6. Let  be a sample space corresponding to some experiment  and</b>

let X W  ! <i>R be a function from the sample space to the real line. Then X iscalled a random variable.</i>

Discrete random variables are those that take a finite or a countably infinite number of possible values. In particular, all integer-valued random variables are discrete. From the point of view of understanding the behavior of a random variable, the important thing is to know the probabilities with which X takes its different possible values.

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

<small>1.3 Integer-Valued and Discrete Random Variables9</small>

<b>Definition 1.7. Let X W  !</b> <i>R be a discrete random variable taking a finite or</i>

countably infinite number of values x1; x<small>2</small>; x<small>3</small>; : : : : The probability distribution or

<i>the probability mass function (pmf) of X is the function p.x/ D P .X D x/; x D</i>

x<small>1</small>; x<small>2</small>; x<small>3</small>; : : : ; and p.x/ D 0, otherwise.

It is common to not explicitly mention the phrase “p.x/ D 0 otherwise,” and we

<i>generally follow this convention. Some authors use the phrase mass function insteadof probability mass function.</i>

For any pmf, one must have p.x/  0 for any x, and P

<small>i</small>p.x<small>i</small>/ D 1. Any

function satisfying these two properties for some set of numbers x1; x<small>2</small>; x<small>3</small>; : : : is a

valid pmf.

<i><b>1.3.1 CDF and Independence</b></i>

<i>A second important definition is that of a cumulative distribution function (CDF).</i>

The CDF gives the probability that a random variable X is less than or equal to any given number x. It is important to understand that the notion of a CDF is universal to all random variables; it is not limited to only the discrete ones.

<i><b>Definition 1.8. The cumulative distribution function of a random variable X is the</b></i>

function F .x/ D P .X  x/; x 2<i>R.</i>

<b>Definition 1.9. Let X have the CDF F .x/. Any number m such that P .X m/  :5,</b>

and also P .X  m/  :5 is called a median of F , or equivalently, a median of X .

<i>Remark. The median of a random variable need not be unique. A simple way to</i>

characterize all the medians of a distribution is available.

<i><b>Proposition. Let</b>X be a random variable with the CDF F .x/. Let m</i><small>0</small><i>be the first</i>

<i>x such that F .x/ :5, and let m</i><sub>1</sub><i>be the lastx such that P .X x/  :5. Then, a</i>

<i>numberm is a median of X if and only if m</i> 2 Œm<sub>0</sub>; m<small>1</small><i>.</i>

<i>The CDF of any random variable satisfies a set of properties. Conversely, anyfunction satisfying these properties is a valid CDF; that is, it will be the CDF ofsome appropriately chosen random variable. These properties are given in the nextresult.</i>

<i><b>Theorem 1.6. A function</b>F .x/ is the CDF of some real-valued random variable X</i>

<i>if and only if it satisfies all of the following properties.(a)</i> 0<i> F .x/  1 8x 2 R.</i>

<i>(b)</i> F .x/<i>! 0 as x ! 1; and F .x/ ! 1 as x ! 1.</i>

<i>(c) Given any real number</i>a; F .x/<i># F .a/ as x # a.</i>

<i>(d) Given any two real numbers</i>x; y; x < y; F .x/ F .y/:

<i>Property (c) is called continuity from the right, or simply right continuity. It isclear that a CDF need not be continuous from the left; indeed, for discrete randomvariables, the CDF has a jump at the values of the random variable, and at the jumppoints, the CDF is not left continuous. More precisely, one has the following result.</i>

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

<small>101 Review of Univariate Probability</small>

<i><b>Proposition. Let</b>F .x/ be the CDF of some random variable X . Then, for any x,</i>

<i>(a)</i> P .X D x/ D F .x/  lim<small>y"x</small>F .y/<i>D F .x/  F .x/, including those pointsx for which P .XD x/ D 0.</i>

<i>(b)</i> P .X  x/ D P.X > x/ C P.X D x/ D .1  F .x// C .F .x/  F .x// D 1<i> F .x/.</i>

<i>Example 1.8 (Bridge). Consider the random variable</i>

X D Number of aces in North’s hand in a Bridge game:

Clearly, X can take any of the values x D 0; 1; 2; 3; 4. If X D x, then the other

13 x cards in North’s hand must be non-ace cards. Thus, the pmf of X is

The CDF of X is a jump function, taking jumps at the values 0; 1; 2; 3; 4, namely the possible values of X . The CDF is

<i>Example 1.9 (Indicator Variables). Consider the experiment of rolling a fair die</i>

twice and now define a random variable Y as follows.

Y D 1 if the sum of the two rolls X is an even numberI Y D 0 if the sum of the two rolls X is an odd number:

If we let A be the event that X is an even number, then Y D 1 if A happens, and

Y <i>D 0 if A does not happen. Such random variables are called indicator random</i>

<i>variables and are immensely useful in mathematical calculations in many complex</i>

situations.

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

<small>1.3 Integer-Valued and Discrete Random Variables11</small>

<i><b>Definition 1.10. Let A be any event in a sample space . The indicator random</b></i>

<i>variable for A is defined as</i>

I<small>A</small> D 1 if A happens: I<small>A</small> D 0 if A does not happen:

Thus, the distribution of an indicator variable is simply P .IA D 1/ D P.A/I P .I<small>A</small>D 0/ D 1  P.A/.

<i>An indicator variable is also called a Bernoulli variable with parameter p, where</i>

p is just P .A/. We later show examples of uses of indicator variables in calculation

<i>of expectations.</i>

In applications, we are sometimes interested in the distribution of a function, say g.X /, of a basic random variable X . In the discrete case, the distribution of a function is found in the obvious way.

<i>variable and</i>P Y <i>D g.X/ a real-valued function of X. Then, P.Y D y/ D</i>

Note that g.X / is a one-to-one function of X , but h.X / is not one-to-one. The values of Y are 0; ˙1; ˙8; ˙27. For example, P .Y D 0/ D P .X D 0/ D c D

5=13I P.Y D 1/ D P.X D 1/ D c=2 D 5=26, and so on. In general, for y D 0;

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

<small>121 Review of Univariate Probability</small>

So, for example, P .Z D 0/ D P .X D 2/ C P .X D 0/ C P .X D 2/ D <sup>7</sup><sub>5</sub>cD 7=13: The pmf of Z D h.X / is:

P .Z<i>D z/ 3/13 7/13 3/13</i>

A key concept in probability is that of independence of a collection of random variables. The collection could be finite or infinite. In the infinite case, we want each finite subcollection of the random variables to be independent. The definition of independence of a finite collection is as follows.

<b>Definition 1.11. Let X</b><small>1</small>; X<small>2</small>; : : : ; X<sub>k</sub> be k  2 discrete random variables defined on the same sample space . We say that X1; X<small>2</small>; : : : ; X<sub>k</sub> <i>are independent if</i>

P .X<small>1</small> D x<small>1</small>; X<small>2</small> D x<small>2</small>; : : : ; X<sub>k</sub> D x<sub>k</sub>/D P.X<small>1</small> D x<small>1</small>/P .X<small>2</small> D x<small>2</small>/   P.X<sub>k</sub> D x<sub>k</sub>/;8 x<small>1</small>; x<small>2</small>; : : : ; x<small>k.</small>

It follows from the definition of independence of random variables that if X1; X<small>2</small> are independent, then any function of X1and any function of X2are also indepen-dent. In fact, we have a more general result.

<i><b>Theorem 1.7. Let</b></i>X<small>1</small>; X<small>2</small>; : : : ; X<sub>k</sub> <i>be</i>k <i> 2 discrete random variables, and </i>

<i>sup-pose they are independent. Let</i> U D f .X<small>1</small>; X<small>2</small>; : : : ; X<small>i</small><i>/ be some function of</i>

X<small>1</small>; X<small>2</small>; : : : ; X<small>i</small><i>, and</i> V D g.X<small>iC1</small>; : : : ; X<small>k</small><i>/ be some function of X</i><small>iC1</small>; : : : ; X<small>k</small><i>.Then,U and V are independent.</i>

<i>This result is true of any types of random variables</i> X<small>1</small>; X<small>2</small>;   ; X<small>k</small><i>, not justdiscrete ones.</i>

<i>A common notation of wide use in probability and statistics is now introduced.</i>

If X1; X<small>2</small>; : : : ; X<sub>k</sub> are independent, and moreover have the same CDF, say F , then we say that X1; X<small>2</small>; : : : ; X<small>k</small> are iid (or IID) and write X1; X<small>2</small>; : : : ; X<small>k</small>

F .

The abbreviation iid (IID) means independent and identically distributed.

<i>Example 1.11 (Two Simple Illustrations). Consider the experiment of tossing a fair</i>

coin (or any coin) four times. Suppose X1is the number of heads in the first two tosses, and X2is the number of heads in the last two tosses. Then, it is intuitively clear that X1; X<small>2</small> are independent, because the last two tosses carry no informa-tion regarding the first two tosses. The independence can be easily mathematically verified by using the definition of independence.

Next, consider the experiment of drawing 13 cards at random from a deck of 52 cards. Suppose X1is the number of aces and X2is the number of clubs among the 13 cards. Then, X1; X<small>2</small>are not independent. For example, P .X1 D 4; X<small>2</small> D 0/ D 0,

but P .X1D 4/, and P.X<small>2</small> D 0/ are both > 0, and so P.X<small>1</small> D 4/P.X<small>2</small>D 0/ > 0.

So, X1; X<small>2</small>cannot be independent.

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

<small>1.3 Integer-Valued and Discrete Random Variables13</small>

<i><b>1.3.2 Expectation and Moments</b></i>

By definition, a random variable takes different values on different occasions. It is natural to want to know what value it takes on average. Averaging is a very primitive concept. A simple average of just the possible values of the random variable will be misleading, because some values may have so little probability that they are rela-tively inconsequential. The average or the mean value, also called the expected value of a random variable is a weighted average of the different values of X , weighted according to how important the value is. Here is the definition.

<i><b>Definition 1.12. Let X be a discrete random variable. We say that the expected</b></i>

<i>expected value is also known as the expectation or the mean of X .</i>

If the set of possible values of X is infinite, then the infinite sum P

If the sample space  of the underlying experiment is finite or countably infinite, then we can also calculate the expectation by averaging directly over the sample space.

<i>finite or countably infinite andX is a discrete random variable with expectation .</i>

<i>whereP .!/ is the probability of the sample point !.</i>

<b>Important Point. Although it is not the focus of this chapter, in applications we are</b>

often interested in more than one variable at the same time. To be specific, consider two discrete random variables X; Y defined on a common sample space . Then we could construct new random variables out of X and Y , for example, X Y; X C

Y; X<small>2</small>C Y<small>2</small>, and so on. We can then talk of their expectations as well. Here is a general definition of expectation of a function of more than one random variable.

<b>Definition 1.13. Let X</b><small>1</small>; X<small>2</small>; : : : ; X<small>n</small>be n discrete random variables, all defined on a common sample space , with a finite or a countably infinite number of sam-ple points. We say that the expectation of a function g.X1; X<small>2</small>; : : : ; X<small>n</small>/ exists if

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

<small>141 Review of Univariate Probability</small>

<i><b>Proposition. (a) If there exists a finite constant</b>c such that P .XD c/ D 1, then</i>

E.X /<i>D c.</i>

<i>(b) IfX; Y are random variables defined on the same sample space  with finite</i>

<i>expectations, and if</i>P .X <i> Y / D 1, then E.X/  E.Y /.</i>

<i>(c) IfX has a finite expectation, and if P .X c/ D 1; then E.X/  c. If P.X </i>

c/<i>D 1, then E.X/  c.</i>

<i><b>Proposition (Linearity of Expectations). Let</b></i> X<small>1</small>; X<small>2</small>; : : : ; X<small>n</small> <i>be random vari-ables defined on the same sample space, and c</i><small>1</small>; c<small>2</small>; : : : ; c<small>n</small> <i>any real-valuedconstants. Then, provided</i>E.X<small>i</small><i>/ exists for every X</i><small>i</small><i>,</i>

<i>in particular,</i>E.cX / <i>D cE.X/ and E.X</i><sub>1</sub>C X<sub>2</sub>/ D E.X<sub>1</sub>/C E.X<sub>2</sub><i>/, whenever</i>

<i>the expectations exist.</i>

<i>The following fact also follows easily from the definition of the pmf of a functionof a random variable. The result says that the expectation of a function of a randomvariableX can be calculated directly using the pmf of X itself, without having to</i>

<i>calculate the pmf of the function.</i>

<i><b>Proposition (Expectation of a Function). Let</b>X be a discrete random variable</i>

<i>on a sample space with a finite or countable number of sample points, and</i>

<i>providedE.Y / exists.</i>

<b>Caution. If g.X / is a linear function of X , then, of course, E.g.X // D g.E.X //.</b>

But, in general, the two things are not equal. For example, E.X<sup>2</sup>/ is not the same

as .E.X //<sup>2</sup>; indeed, E.X<sup>2</sup>/ > .E.X //<small>2</small>for any random variable X that is not a constant.

A very important property of independent random variables is the following fac-torization result on expectations.

<i><b>Theorem 1.8. Suppose</b></i>X<small>1</small>; X<small>2</small>; : : : ; X<small>n</small> <i>are independent random variables. Then,provided each expectation exists,</i>

E.X<small>1</small>X<small>2</small>   X<small>n</small>/D E.X<small>1</small>/E.X<small>2</small>/   E.X<small>n</small>/:

<i>Let us now show some more illustrative examples.</i>

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

<small>1.3 Integer-Valued and Discrete Random Variables15</small>

<i>Example 1.12. Let X be the number of heads obtained in two tosses of a fair coin.</i>

The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2. Therefore, E.X / D 0  1=4 C

1 1=2 C 2  1=4 D 1. Because the coin is fair, we expect it to show heads 50% of

the number of times it is tossed, which is 50% of 2, that is, 1.

<i>Example 1.13 (Dice Sum). Let X be the sum of the two rolls when a fair die</i>

is rolled twice. The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D

2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D 6=36. Therefore, E.X / D 21=36C32=36C43=36C  C121=36 D 7. This

can also be seen by letting X1D the face obtained on the first roll; X<small>2</small>D the face

obtained on the second roll, and by using E.X / D E.X1 C X<small>2</small>/ D E.X<small>1</small>/ C E.X<small>2</small>/D 3:5 C 3:5 D 7.

Let us now make this problem harder. Suppose that a fair die is rolled 10 times and X is the sum of all 10 rolls. The pmf of X is no longer so simple; it will be cumbersome to write it down. But, if we let Xi <i>D the face obtained on the ith roll, it</i>

is still true by the linearity of expectations that E.X / D E.X1CX<small>2</small>C    CX<small>10</small>/D E.X<small>1</small>/C E.X<small>2</small>/C    C E.X<small>10</small>/ D 3:5  10 D 35. We can easily compute the

expectation, although the pmf would be difficult to write down.

<i>Example 1.14 (A Random Variable Without a Finite Expectation). Let X take the</i>

positive integers 1; 2; 3; : : : as its values with the pmf

p.x/D P.X D x/ D <sup>1</sup>

x.xC 1/<sup>; x</sup><sup>D 1; 2; 3; : : : :</sup>

This is a valid pmf, because obviously<sub>x.xC1/</sub><sup>1</sup> > 0 for any x D 1; 2; 3; : : : ; and also

the infinite seriesP<small>1</small>

<small>xD1x.xC1/</small><sup>1</sup> sums to 1, a fact from calculus. Now,

also a fact from calculus.

This example shows that not all random variables have a finite expectation. Here, the reason for the infiniteness of E.X / is that X takes large integer values x with probabilities p.x/ that are not adequately small. The large values are realized suffi-ciently often that on average X becomes larger than any given finite number.

The zero–one nature of indicator random variables is extremely useful for calcu-lating expectations of certain integer-valued random variables whose distributions are sometimes so complicated that it would be difficult to find their expectations di-rectly from definition. We describe the technique and some illustrations of it below.

<i><b>Proposition. Let</b>X be an integer-valued random variable such that it can be </i>

<i>rep-resented as</i>X D P<small>m</small>

<small>iD1</small>c<small>i</small>I<small>Ai</small> <i>for somem, constants c</i><small>1</small>; c<small>2</small>; : : : ; c<small>m</small><i>, and suitableevents</i>A<small>1</small>; A<small>2</small>; : : : ; A<small>m</small><i>. Then,</i>E.X /DP<small>m</small>

<small>iD1</small>c<small>i</small>P .A<small>i</small><i>/.</i>

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

<small>161 Review of Univariate Probability</small>

<i>Example 1.15 (Coin Tosses). Suppose a coin that has probability p of showing</i>

heads in any single toss is tossed n times, and let X denote the number of times in the

n tosses that a head is obtained. Then, X DP<sub>n</sub>

<small>iD1</small>I<small>Ai</small>, where Aiis the event that a

<i>head is obtained in the ith toss. Therefore, E.X / D</i>P<small>n</small>

<small>iD1</small>P .A<small>i</small>/DP<small>n</small>

<small>iD1</small>pD np.

A direct calculation of the expectation would involve finding the pmf of X and obtaining the sumP<small>n</small>

<small>xD0</small>xP .X D x/; it can also be done that way, but that is a

much longer calculation.

<i>The random variable X of this example is a binomial random variable</i>

with parameters n and p. Its pmf is given by the formula P .X D x/ D<sub>n</sub>

 p<sup>x</sup> .1 p/<small>nx</small>; xD 0; 1; 2; : : : ; n.

<i>Example 1.16 (Consecutive Heads in Coin Tosses). Suppose a coin with probability</i>

p for heads in a single toss is tossed n times. How many times can we expect to see

a head followed by at least one more head? For example, if n D 5, and we see the outcomes HTHHH, then we see a head followed by at least one more head twice.

Define AiD The ith and the .i C 1/th toss both result in heads. Then XD number of times a head is followed by at least one more head D

<small>iD1</small>p<small>2</small>D .n  1/p<small>2</small>. For example, if a fair coin is tossed 20 times, we can expect to see a head followed by another head about five times (19  :5<sup>2</sup>D 4:75).

Another useful technique for calculating expectations of nonnegative integer-valued random variables is based on the CDF of the random variable, rather than directly on the pmf. This method is useful when calculating probabilities of the form

P .X > x/ is logically more straightforward than directly calculating P .X D x/.

Here is the expectation formula based on the tail CDF.

<i><b>Theorem 1.9 (Tailsum Formula). Let</b>X take values 0; 1; 2; : : : : Then</i>

<i>Example 1.17 (Family Planning). Suppose a couple will have children until they</i>

have at least one child of each sex. How many children can they expect to have? Let X denote the childbirth at which they have a child of each sex for the first time. Suppose the probability that any particular childbirth will be a boy is p, and that all births are independent. Then,

P .X > n/D P.the first n children are all boys or all girls/ D p<small>n</small>C .1  p/<small>n</small>

Therefore, E.X / D 2 CP<small>1</small>

<small>nD2</small>Œp<sup>n</sup>C .1  p/<small>n</small>D 2 C p<small>2</small>=.1 p/ C .1  p/<small>2</small>=pD

<small>p.1p/</small> 1. If boys and girls are equally likely on any childbirth, then this says that

a couple waiting to have a child of each sex can expect to have three children.

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

<small>1.3 Integer-Valued and Discrete Random Variables17</small>

The expected value is calculated with the intention of understanding what a typical value is of a random variable. But two very different distributions can have exactly the same expected value. A common example is that of a return on an in-vestment in a stock. Two stocks may have the same average return, but one may be much riskier than the other, in the sense that the variability in the return is much higher for that stock. In that case, most risk-averse individuals would prefer to in-vest in the stock with less variability. Measures of risk or variability are of course not unique. Some natural measures that come to mind are E.jX  j/, known as the

<i>mean absolute deviation, or P .jX  j > k/ for some suitable k. However, neither</i>

of these two is the most common measure of variability. The most common measure

<i>is the standard deviation of a random variable.</i>

<i><b>Definition 1.14. Let a random variable X have a finite mean . The variance of X</b></i>

is defined as

<sup>2</sup>D EŒ.X  /<small>2</small>;

<i>and the standard deviation of X is defined as  D</i>p <small>2</small>:

It is easy to prove that <sup>2</sup><1 if and only if E.X<small>2</small><i>/, the second moment of X ,</i>

is finite. It is not uncommon to mistake the standard deviation for the mean absolute deviation, but they are not the same. In fact, an inequality always holds.

<b>Proposition.</b>  <i> E.jX  j/, and  is strictly greater unless X is a constant</i>

<i>random variable, namely,</i>P .X<i>D / D 1.</i>

<i>We list some basic properties of the variance of a random variable.</i>

<i>(a) Var</i>.cX /D c<small>2</small><i>Var.X / for any real c.</i>

<i>(b) Var</i>.X<i>C k/ D Var.X/ for any real k.</i>

<i>(c) Var</i>.X /<i> 0 for any random variable X, and equals zero only if P.X D c/ D 1</i>

<i>for some real constantc.</i>

<i>(d) Var</i>.X /D E.X<small>2</small>/ <small>2</small>:

<i>The quantity</i>E.X<sup>2</sup><i>/ is called the second moment of X . The definition of a </i>

<i>gen-eral moment is as follows.</i>

<b>Definition 1.15. Let X be a random variable, and k  1 a positive integer. Then</b>

E.X<sup>k</sup><i>/ is called the kth moment of X , and E.X</i><sup>k</sup><i>/ is called the kth inverse moment</i>

<i>of</i>X , provided they exist.

We therefore have the following relationships involving moments and the variance.

Variance D Second Moment  .First Moment/<sup>2</sup>:

Second Moment D Variance C .First Moment/<sup>2</sup>:

Statisticians often use the third moment around the mean as a measure of lack of symmetry in the distribution of a random variable. The point is that if a random variable X has a symmetric distribution, and has a finite mean , then all odd mo-ments around the mean, namely, EŒ.X  /<sup>2kC1</sup> will be zero, if the moment exists.

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

<small>181 Review of Univariate Probability</small>

In particular, EŒ.X  /<sup>3</sup> will be zero. Likewise, statisticians also use the fourth

moment around the mean as a measure of how spiky the distribution is around the mean. To make these indices independent of the choice of unit of measurement (e.g., inches or centimeters), they use certain scaled measures of asymmetry and peaked-ness. Here are the definitions.

<b>Definition 1.16. (a) Let X be a random variable with EŒjX j</b><sup>3</sup> <<i>1. The skewness</i>

The skewness ˇ is zero for symmetric distributions, but the converse need not be true. The kurtosis is necessarily 2, but can be arbitrarily large, with spikier distributions generally having a larger kurtosis. But a very good interpretation of

<i>is not really available. We later show that D 0 for all normal distributions; hence</i>

the motivation for subtracting 3 in the definition of .

<i>Example 1.18 (Variance of Number of Heads). Consider the experiment of two</i>

tosses of a fair coin and let X be the number of heads obtained. Then, we have seen that p.0/ D p.2/ D 1=4; and p.1/ D 1=2. Thus, E.X<sup>2</sup>/D 0  1=4 C 1  1=2 C 4 1=4 D 3=2, and E.X/ D 1. Therefore, Var.X/ D E.X<small>2</small>/ <small>2</small>D 3=2  1 D <small>1</small>

<small>2</small>, and the standard deviation is  Dp

:5D :707.

<i>Example 1.19 (A Random Variable with an Infinite Variance). If a random variable</i>

has a finite variance, then it can be shown that it must have a finite mean. This example shows that the converse need not be true.

Let X be a discrete random variable with the pmf

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

is not finitely summable, a fact from calculus. Because E.X<sup>2</sup>/ is infinite, but E.X /

is finite, <sup>2</sup>D E.X<small>2</small>/ ŒE.X/<small>2</small>must also be infinite.

If a collection of random variables is independent, then just like the expectation, the variance also adds up. Precisely, one has the following very useful fact.

<i><b>Theorem 1.10. Let</b></i>X<small>1</small>; X<small>2</small>; : : : ; X<small>n</small><i>ben independent random variables. Then,</i>

<i>Var</i>.X<small>1</small>C X<sub>2</sub>C    C X<sub>n</sub>/<i>D Var.X</i><sub>1</sub>/<i>C Var.X</i><sub>2</sub>/<i>C    C Var.X</i><sub>n</sub>/:

<i>An important corollary of this result is the following variance formula for themean, NX , of n independent and identically distributed random variables.</i>

<i><b>Corollary 1.1. Let</b></i>X<small>1</small>; X<small>2</small>; : : : ; X<small>n</small><i>be independent random variables with a com-mon variance</i><sup>2</sup><<i>1. Let N</i>X D <small>X1CCX</small><sub>n</sub>

<small>n</small> <i>. Then Var</i>. NX/D <small>2n</small><i>.</i>

<b>1.4 Inequalities</b>

The mean and the variance, together, have earned the status of being the two most common summaries of a distribution. A relevant question is whether ;  are useful summaries of the distribution of a random variable. The answer is a qualified yes. The inequalities below suggest that knowing just the values of ;  , it is in fact possible to say something useful about the full distribution.

<small>2</small><i>, assumed to be finite. Letk be any positive number. Then</i>

P .jX  j  k/  <sup>1</sup> k<small>2</small>:

<i><b>(b) (Markov’s Inequality). Suppose</b>X takes only nonnegative values, and </i>

<i>sup-pose</i>E.X /<i>D , assumed to be finite. Let c be any postive number. Then,</i>

P .X  c/  <sup></sup> c<sup>:</sup>

<i>The virtue of these two inequalities is that they make no restrictive assumptions onthe random variableX . Whenever ;  are finite, Chebyshev’s inequality is </i>

<i>appli-cable, and whenever; is finite, Markov’s inequality applies, provided the random</i>

<i>variable is nonnegative. However, the universal nature of these inequalities alsomakes them typically quite conservative.</i>

<i>Although Chebyshev’s inequality usually gives conservative estimates for tailprobabilities, it does imply a major result in probability theory in a special case.</i>

</div>

×