Tải bản đầy đủ (.pdf) (498 trang)

A gentle introduction to stata, fourth edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.22 MB, 498 trang )

A Gentle Introduction to Stata
4th Edition



A Gentle Introduction to Stata
4th Edition

ALAN C. ACOCK
Oregon State University

®

A Stata Press Publication
StataCorp LP
College Station, Texas


®

Copyright c 2006, 2008, 2010, 2012, 2014 by StataCorp LP
All rights reserved. First edition 2006
Second edition 2008
Third edition 2010
Revised third edition 2012
Fourth edition 2014

Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in LATEX 2ε
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1


ISBN-10: 1-59718-142-0
ISBN-13: 978-1-59718-142-6
Library of Congress Control Number: 2014935652
No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any
form or by any means—electronic, mechanical, photocopy, recording, or otherwise—without
the prior written permission of StataCorp LP.
Stata,
, Stata Press, Mata,
StataCorp LP.

, and NetCourse are registered trademarks of

Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations.
LATEX 2ε is a trademark of the American Mathematical Society.


Contents
List of figures

xiii

List of tables

xix

List of boxed tips

xxi

Preface


xxv

Support materials for the book
1

2

xxix

Getting started

1

1.1

Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

The Stata screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


7

1.4

Using an existing dataset . . . . . . . . . . . . . . . . . . . . . . . .

9

1.5

An example of a short Stata session

. . . . . . . . . . . . . . . . . .

11

1.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.7

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Entering data


21

2.1

Creating a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2

An example questionnaire . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3

Developing a coding system . . . . . . . . . . . . . . . . . . . . . . .

24

2.4

Entering data using the Data Editor . . . . . . . . . . . . . . . . . .

29

2.4.1

Value labels . . . . . . . . . . . . . . . . . . . . . . . . . . .


33

2.5

The Variables Manager . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.6

The Data Editor (Browse) view . . . . . . . . . . . . . . . . . . . . .

40

2.7

Saving your dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.8

Checking the data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.9

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


48


vi

Contents
2.10

3

4

5

6

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Preparing data for analysis

49

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49


3.2

Planning your work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.3

Creating value labels . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.4

Reverse-code variables . . . . . . . . . . . . . . . . . . . . . . . . . .

58

3.5

Creating and modifying variables . . . . . . . . . . . . . . . . . . . .

63

3.6

Creating scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68


3.7

Saving some of your data . . . . . . . . . . . . . . . . . . . . . . . .

71

3.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.9

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Working with commands, do-files, and results

75

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.2


How Stata commands are constructed . . . . . . . . . . . . . . . . .

76

4.3

Creating a do-file . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

4.4

Copying your results to a word processor . . . . . . . . . . . . . . . .

86

4.5

Logging your command file . . . . . . . . . . . . . . . . . . . . . . .

87

4.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.7


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

Descriptive statistics and graphs for one variable

91

5.1

Descriptive statistics and graphs . . . . . . . . . . . . . . . . . . . .

91

5.2

Where is the center of a distribution? . . . . . . . . . . . . . . . . . .

92

5.3

How dispersed is the distribution? . . . . . . . . . . . . . . . . . . .

96

5.4

Statistics and graphs—unordered categories . . . . . . . . . . . . . .


98

5.5

Statistics and graphs—ordered categories and variables . . . . . . . . 107

5.6

Statistics and graphs—quantitative variables

5.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.8

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

. . . . . . . . . . . . . 109

Statistics and graphs for two categorical variables
6.1

121

Relationship between categorical variables . . . . . . . . . . . . . . . 121


Contents


vii

6.2

Cross-tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.3

Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7

6.3.1

Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . 127

6.3.2

Probability tables . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4

Percentages and measures of association . . . . . . . . . . . . . . . . 130

6.5

Odds ratios when dependent variable has two categories . . . . . . . 133

6.6


Ordered categorical variables . . . . . . . . . . . . . . . . . . . . . . 135

6.7

Interactive tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.8

Tables—linking categorical and quantitative variables

6.9

Power analysis when using a chi-squared test of significance . . . . . 143

6.10

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.11

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Tests for one or two means

. . . . . . . . 140

149

7.1


Introduction to tests for one or two means . . . . . . . . . . . . . . . 149

7.2

Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.3

Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.4

Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.5

One-sample test of a proportion . . . . . . . . . . . . . . . . . . . . . 155

7.6

Two-sample test of a proportion

7.7

One-sample test of means . . . . . . . . . . . . . . . . . . . . . . . . 162

7.8

Two-sample test of group means . . . . . . . . . . . . . . . . . . . . 164
7.8.1


. . . . . . . . . . . . . . . . . . . . 157

Testing for unequal variances . . . . . . . . . . . . . . . . . 170

7.9

Repeated-measures t test

. . . . . . . . . . . . . . . . . . . . . . . . 171

7.10

Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.11

Nonparametric alternatives . . . . . . . . . . . . . . . . . . . . . . . 183
7.11.1

Mann–Whitney two-sample rank-sum test . . . . . . . . . . 183

7.11.2

Nonparametric alternative: Median test . . . . . . . . . . . 184

7.12

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185


7.13

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186


viii
8

9

Contents
Bivariate correlation and regression

189

8.1

Introduction to bivariate correlation and regression . . . . . . . . . . 189

8.2

Scattergrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

8.3

Plotting the regression line . . . . . . . . . . . . . . . . . . . . . . . . 195

8.4

An alternative to producing a scattergram, binscatter . . . . . . . . 196


8.5

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.6

Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

8.7

Spearman’s rho: Rank-order correlation for ordinal data . . . . . . . 211

8.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8.9

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

Analysis of variance

215

9.1

The logic of one-way analysis of variance . . . . . . . . . . . . . . . . 215

9.2


ANOVA example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

9.3

ANOVA example using survey data . . . . . . . . . . . . . . . . . . . 225

9.4

A nonparametric alternative to ANOVA . . . . . . . . . . . . . . . . 228

9.5

Analysis of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9.6

Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

9.7

Repeated-measures design . . . . . . . . . . . . . . . . . . . . . . . . 249

9.8

Intraclass correlation—measuring agreement . . . . . . . . . . . . . . 255

9.9

Power analysis with ANOVA . . . . . . . . . . . . . . . . . . . . . . 257

9.9.1

One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 257
Power analysis for two-way ANOVA . . . . . . . . . . . . . 260

10

9.9.2

Power analysis for repeated-measures ANOVA . . . . . . . . 262

9.9.3

Summary of power analysis for ANOVA . . . . . . . . . . . 264

9.10

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

9.11

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Multiple regression

267

10.1

Introduction to multiple regression . . . . . . . . . . . . . . . . . . . 267


10.2

What is multiple regression? . . . . . . . . . . . . . . . . . . . . . . . 268

10.3

The basic multiple regression command . . . . . . . . . . . . . . . . 269


Contents

ix

10.4

Increment in R-squared: Semipartial correlations . . . . . . . . . . . 273

10.5

Is the dependent variable normally distributed? . . . . . . . . . . . . 275

10.6

Are the residuals normally distributed? . . . . . . . . . . . . . . . . . 278

10.7

Regression diagnostic statistics . . . . . . . . . . . . . . . . . . . . . 283
10.7.1


Outliers and influential cases . . . . . . . . . . . . . . . . . . 283

10.7.2

Influential observations: DFbeta . . . . . . . . . . . . . . . . 286

10.7.3

Combinations of variables may cause problems . . . . . . . . 287

10.8

Weighted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

10.9

Categorical predictors and hierarchical regression . . . . . . . . . . . 291

10.10 A shortcut for working with a categorical variable . . . . . . . . . . . 299
10.11 Fundamentals of interaction . . . . . . . . . . . . . . . . . . . . . . . 301
10.12 Nonlinear relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.12.1 Fitting a quadratic model . . . . . . . . . . . . . . . . . . . 311
10.12.2 Centering when using a quadratic term . . . . . . . . . . . . 317
10.12.3 Do we need to add a quadratic component? . . . . . . . . . 319
10.13 Power analysis in multiple regression . . . . . . . . . . . . . . . . . . 321
10.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
10.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
11


Logistic regression

329

11.1

Introduction to logistic regression . . . . . . . . . . . . . . . . . . . . 329

11.2

An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

11.3

What is an odds ratio and a logit? . . . . . . . . . . . . . . . . . . . 334
11.3.1

The odds ratio

. . . . . . . . . . . . . . . . . . . . . . . . . 336

11.3.2

The logit transformation . . . . . . . . . . . . . . . . . . . . 336

11.4

Data used in the rest of the chapter

11.5


Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

11.6

Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

11.7

. . . . . . . . . . . . . . . . . . 337

11.6.1

Testing individual coefficients . . . . . . . . . . . . . . . . . 346

11.6.2

Testing sets of coefficients . . . . . . . . . . . . . . . . . . . 347

More on interpreting results from logistic regression . . . . . . . . . . 349


x

Contents
11.8

Nested logistic regressions . . . . . . . . . . . . . . . . . . . . . . . . 353

11.9


Power analysis when doing logistic regression . . . . . . . . . . . . . 355

11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
11.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
12

Measurement, reliability, and validity
12.1

Overview of reliability and validity . . . . . . . . . . . . . . . . . . . 361

12.2

Constructing a scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
12.2.1

12.3

12.4

Generating a mean score for each person . . . . . . . . . . . 363

Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
12.3.1

Stability and test–retest reliability . . . . . . . . . . . . . . 367

12.3.2


Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

12.3.3

Split-half and alpha reliability—internal consistency . . . . 368

12.3.4

Kuder–Richardson reliability for dichotomous items . . . . . 371

12.3.5

Rater agreement—kappa (κ) . . . . . . . . . . . . . . . . . . 372

Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
12.4.1

Expert judgment . . . . . . . . . . . . . . . . . . . . . . . . 375

12.4.2

Criterion-related validity . . . . . . . . . . . . . . . . . . . . 376

12.4.3

Construct validity . . . . . . . . . . . . . . . . . . . . . . . . 377

12.5

Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378


12.6

PCF analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

12.7

12.6.1

Orthogonal rotation: Varimax . . . . . . . . . . . . . . . . . 386

12.6.2

Oblique rotation: Promax . . . . . . . . . . . . . . . . . . . 388

But we wanted one scale, not four scales . . . . . . . . . . . . . . . . 389
12.7.1

13

361

Scoring our variable . . . . . . . . . . . . . . . . . . . . . . . 390

12.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

12.9


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

Working with missing values—multiple imputation

393

13.1

The nature of the problem . . . . . . . . . . . . . . . . . . . . . . . . 393

13.2

Multiple imputation and its assumptions about the mechanism for
missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395


Contents

14

xi

13.3

What variables do we include when doing imputations? . . . . . . . 397

13.4

Multiple imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 398


13.5

A detailed example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
13.5.1

Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . 400

13.5.2

Setup and multiple-imputation stage . . . . . . . . . . . . . 402

13.5.3

The analysis stage . . . . . . . . . . . . . . . . . . . . . . . 405

13.5.4

For those who want an R2 and standardized βs . . . . . . . 406

13.5.5

When impossible values are imputed . . . . . . . . . . . . . 408

13.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

13.7

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411


The sem and gsem commands
14.1

Ordinary least-squares regression models using sem . . . . . . . . . . 413
14.1.1

14.2

A

Using the SEM Builder to fit a basic regression model . . . 415

A quick way to draw a regression model and a fresh start . . . . . . 422
14.2.1

14.3

413

Using sem without the SEM Builder . . . . . . . . . . . . . 425

The gsem command for logistic regression . . . . . . . . . . . . . . . 425
14.3.1

Fitting the model using the logit command

. . . . . . . . . 426

14.3.2


Fitting the model using the gsem command . . . . . . . . . 428

14.4

Path analysis and mediation . . . . . . . . . . . . . . . . . . . . . . . 434

14.5

Conclusions and what is next for the sem command . . . . . . . . . . 438

14.6

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

What’s next?

443

A.1

Introduction to the appendix . . . . . . . . . . . . . . . . . . . . . . 443

A.2

Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

A.3

A.2.1


Web resources . . . . . . . . . . . . . . . . . . . . . . . . . . 444

A.2.2

Books about Stata . . . . . . . . . . . . . . . . . . . . . . . 446

A.2.3

Short courses . . . . . . . . . . . . . . . . . . . . . . . . . . 449

A.2.4

Acquiring data . . . . . . . . . . . . . . . . . . . . . . . . . 449

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450


xii

Contents
References

453

Author index

457

Subject index


459


Figures
1.1

Stata menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Stata’s opening screen . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.3

The toolbar in Stata for Windows . . . . . . . . . . . . . . . . . . .

9

1.4

The toolbar in Stata for Mac . . . . . . . . . . . . . . . . . . . . . .

9

1.5


Stata command to open cancer.dta . . . . . . . . . . . . . . . . . .

10

1.6

The summarize dialog box . . . . . . . . . . . . . . . . . . . . . . .

12

1.7

Histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.8

The histogram dialog box . . . . . . . . . . . . . . . . . . . . . . .

14

1.9

The tabs on the histogram dialog box . . . . . . . . . . . . . . . . .

15

1.10


The Titles tab of the histogram dialog box . . . . . . . . . . . . . .

15

1.11

First attempt at an improved histogram . . . . . . . . . . . . . . . .

16

1.12

Final histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.1

Example questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.2

The Data Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.3


Data Editor (Edit) and Data Editor (Browse) icons on the toolbar .

30

2.4

Variable name and variable label . . . . . . . . . . . . . . . . . . . .

31

2.5

Data Editor with a complete dataset . . . . . . . . . . . . . . . . . .

33

2.6

The Variables Manager icon on the Stata toolbar . . . . . . . . . . .

34

2.7

Using the Variables Manager to add a label for gender . . . . . . .

35

2.8


Variables Manager with value labels added . . . . . . . . . . . . . .

38

2.9

Dataset shown in the Data Editor (Browse) mode . . . . . . . . . .

41

2.10

The describe dialog box . . . . . . . . . . . . . . . . . . . . . . . .

46

3.1

The Variables Manager . . . . . . . . . . . . . . . . . . . . . . . . .

56


xiv

Figures
3.2

The Variables Manager with value labels assigned . . . . . . . . . .


57

3.3

recode: specifying recode rules on the Main tab . . . . . . . . . . .

60

3.4

recode: specifying new variable names on the Options tab . . . . . .

60

3.5

The generate dialog box . . . . . . . . . . . . . . . . . . . . . . . .

66

3.6

Two-way tabulation dialog box . . . . . . . . . . . . . . . . . . . . .

67

3.7

The Main tab for the egen dialog box . . . . . . . . . . . . . . . . .


68

3.8

The by/if/in tab for the egen dialog box . . . . . . . . . . . . . . . .

70

4.1

The Do-file Editor icon on the Stata menu . . . . . . . . . . . . . .

81

4.2

The Do-file Editor of Stata for Windows . . . . . . . . . . . . . . . .

82

4.3

The Do-file Editor toolbar of Stata for Windows . . . . . . . . . . .

82

4.4

Highlighting in the Do-file Editor . . . . . . . . . . . . . . . . . . . .


83

4.5

Commands in the Do-file Editor window of Stata for Mac . . . . . .

85

5.1

How many children do families have?

. . . . . . . . . . . . . . . . .

94

5.2

Distributions with same M = 1000 but

SDs

= 100 or 200 . . . . . . .

97

5.3

Dialog box for frequency tabulation . . . . . . . . . . . . . . . . . .


98

5.4

The Options tab for pie charts (by category)

5.5

Pie charts of marital status in the United States . . . . . . . . . . . 103

5.6

The Graph Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.7

Using the histogram dialog box to make a bar chart . . . . . . . . . 106

5.8

Bar chart of marital status of U.S. adults . . . . . . . . . . . . . . . 106

5.9

Histogram of political views of U.S. adults . . . . . . . . . . . . . . . 109

5.10

Histogram of time spent on the World Wide Web . . . . . . . . . . . 112


5.11

Histogram of time spent on the World Wide Web (fewer than 25
hours a week, by gender) . . . . . . . . . . . . . . . . . . . . . . . . 113

5.12

The Main tab for the tabstat dialog box . . . . . . . . . . . . . . . 114

5.13

Box plot of time spent on the World Wide Web (fewer than 25
hours a week, by gender) . . . . . . . . . . . . . . . . . . . . . . . . 116

6.1

The Main tab for creating a cross-tabulation . . . . . . . . . . . . . 123

6.2

Results of search chitable . . . . . . . . . . . . . . . . . . . . . . 128

. . . . . . . . . . . . . 103


Figures

xv


6.3

Entering data for a table . . . . . . . . . . . . . . . . . . . . . . . . 139

6.4

Summarizing a quantitative variable by categories of a categorical
variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5

The Bar label properties dialog box . . . . . . . . . . . . . . . . . . 142

6.6

Bar graph summarizing a quantitative variable by categories of a
categorical variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1

Restrict observations to those who score 1 on wrkstat . . . . . . . . 163

7.2

Two-sample t test using groups dialog box . . . . . . . . . . . . . . . 166

7.3

Cohen’s d effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176


7.4

Power and sample-size control panel . . . . . . . . . . . . . . . . . . 178

8.1

Dialog box for a scattergram . . . . . . . . . . . . . . . . . . . . . . 191

8.2

Scattergram of son’s education on father’s education . . . . . . . . . 192

8.3

Scattergram of son’s education on father’s education with “jitter” . 193

8.4

Scattergram of son’s education on father’s education with a
regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.5

Scattergram relating hourly wage to job tenure . . . . . . . . . . . . 197

8.6

Average wage by tenure . . . . . . . . . . . . . . . . . . . . . . . . . 198

8.7


Relationship between wages and tenure with a discontinuity in the
relationship at 3 years . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.8

Relationship between wages and tenure with a discontinuity in the
relationship at 3 years; whites shown with solid lines and blacks
shown with dashed lines . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.9

The Model tab of the regress dialog box . . . . . . . . . . . . . . . 207

8.10

Confidence band around regression prediction . . . . . . . . . . . . . 211

9.1

One-way analysis-of-variance dialog box . . . . . . . . . . . . . . . . 219

9.2

Bar graph of relationship between prestige and mobility . . . . . . . 227

9.3

Bar graph of support for stem cell research by political party
identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230


9.4

Box plot of support for stem cell research by political party
identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9.5

The Specification 1 dialog box under margins . . . . . . . . . . . . . 239


xvi

Figures
9.6

Hours of

TV

watching by whether the person works full time . . . . 247

9.7

Hours of

TV

watching by whether the person is married . . . . . . . 248


9.8

Hours of TV watching by whether the person is married and
whether the person works full time . . . . . . . . . . . . . . . . . . . 249

9.9

Effect size for power of 0.80, alpha of 0.05 for N ’s from 40 to 500 . . 259

9.10

Effect size for power of 0.80 with two rows in each of the three
columns for N ’s from 100 to 300 . . . . . . . . . . . . . . . . . . . . 261

9.11

Effect size for power of 0.80, alpha of 0.05, four repeated
measurements, and a 0.60 correlation between measurements for
N ’s from 100 to 300 . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

10.1

The Model tab for multiple regression . . . . . . . . . . . . . . . . . 269

10.2

The Main tab of the pcorr dialog box . . . . . . . . . . . . . . . . . 274

10.3


Histogram of dependent variable, env con . . . . . . . . . . . . . . . 275

10.4

Hanging rootogram of dependent variable, env con . . . . . . . . . . 276

10.5

Heteroskedasticity of residuals . . . . . . . . . . . . . . . . . . . . . 280

10.6

Residual-versus-fitted plot . . . . . . . . . . . . . . . . . . . . . . . . 281

10.7

Actual value of environmental concern regressed on the predicted
value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

10.8

Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

10.9

Education and gender predicting income, no interaction . . . . . . . 303

10.10 Education and gender predicting income, with interaction term . . . 307
10.11 Five quadratic curves . . . . . . . . . . . . . . . . . . . . . . . . . . 310
10.12 Graph of quadratic model . . . . . . . . . . . . . . . . . . . . . . . . 312

10.13 binscatter representation of nonlinear relationship between the
log of wages and total years of experience . . . . . . . . . . . . . . . 313
10.14 Quadratic model of relationship between total experience and log
of income . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
10.15 Quadratic model relating log of income to total experienced where
experience is centered . . . . . . . . . . . . . . . . . . . . . . . . . . 318
10.16 Comparison of linear and quadratic models . . . . . . . . . . . . . . 319
11.1

Positive feedback and divorce . . . . . . . . . . . . . . . . . . . . . . 331

11.2

Predicted probability of positive feedback and divorce . . . . . . . . 332


Figures

xvii

11.3

Predicted probability of positive feedback and logit of divorce . . . . 333

11.4

Positive feedback and divorce using

11.5


Dialog box for doing logistic regression . . . . . . . . . . . . . . . . . 339

11.6

Risk factors associated with teen drinking . . . . . . . . . . . . . . . 344

11.7

Estimated probability that an adolescent drank in last month
adjusted for age, race, and frequency of family meals . . . . . . . . . 353

12.1

Scree plot: National priorities . . . . . . . . . . . . . . . . . . . . . . 386

14.1

SEM

14.2

Initial

14.3

Adding variable names and correlations of independent variables . . 417

14.4

Result without any reformatting . . . . . . . . . . . . . . . . . . . . 419


14.5

Intermediate results . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

14.6

The

14.7

Final result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422

14.8

Regression component dialog box . . . . . . . . . . . . . . . . . . . . 423

14.9

Quick drawing of regression model . . . . . . . . . . . . . . . . . . . 424

OLS

regression . . . . . . . . . . 334

Builder on a Mac . . . . . . . . . . . . . . . . . . . . . . . . . . 415
SEM

SEM


diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

Text dialog box in Stata for Mac

. . . . . . . . . . . . . . 421

14.10 Maximum likelihood estimation of model using listwise deletion . . . 424
14.11 A logistic regression model with the outcome, obese, clicked to
highlight it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
14.12 Initial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
14.13 Dialog box for changing information in a textbox . . . . . . . . . . . 433
14.14 Final results for logistic regression . . . . . . . . . . . . . . . . . . . 433
14.15

BMI

predicted without using the quickfood variable . . . . . . . . . 434

14.16 A path model with the quickfood variable mediating part of the
effect of educ and incomeln on bmi . . . . . . . . . . . . . . . . . . 435
14.17 Direct effects without the mediator . . . . . . . . . . . . . . . . . . . 436
14.18 Final mediation model . . . . . . . . . . . . . . . . . . . . . . . . . . 437
14.19 More complex path model . . . . . . . . . . . . . . . . . . . . . . . . 440


xviii

Figures
A.1


Growth of downloads of files from Statistical Software Components
(source: />item=repec:boc:bocode) . . . . . . . . . . . . . . . . . . . . . . . . . 445

A.2

A path model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451


Tables
2.1

Example codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.2

Example coding sheet . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3

New variable names and labels . . . . . . . . . . . . . . . . . . . . .

31

3.1

Sample project task outline . . . . . . . . . . . . . . . . . . . . . . .


51

3.2

NLSY97

sample codebook entries . . . . . . . . . . . . . . . . . . . .

53

3.3

Reverse-coding plan . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

3.4

Arithmetic symbols . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.1

Relational operators used by Stata . . . . . . . . . . . . . . . . . . .

78

5.1


Level of measurement and choice of average . . . . . . . . . . . . . .

94

9.1

Hypothetical data—wide view . . . . . . . . . . . . . . . . . . . . . 217

9.2

Hypothetical data—long view . . . . . . . . . . . . . . . . . . . . . . 218

10.1

Regression equation and Stata output . . . . . . . . . . . . . . . . . 271

10.2

Effect size of f 2 and R2 . . . . . . . . . . . . . . . . . . . . . . . . . 322

12.1

Four kinds of reliability and the appropriate statistical measure . . . 365

12.2

Correlations you might expect for one factor . . . . . . . . . . . . . 378

12.3


Correlations you might expect for two factors . . . . . . . . . . . . . 379

14.1

Selected families available with gsem . . . . . . . . . . . . . . . . . . 429

14.2

Direct and indirect effects of mother’s education and family income
on her BMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438



Boxed tips
Why do we show the dot prompt with these commands?

. . . . . . . . . .

2

Setting how much output is in the Results window . . . . . . . . . . . . . .

4

Work along with the book

. . . . . . . . . . . . . . . . . . . . . . . . . . .

5


Searching for help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Internet access to datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Clearing the Results window: The cls command . . . . . . . . . . . . . . .

16

When to use Submit and when to use OK . . . . . . . . . . . . . . . . . . .

17

Variables and items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

I typed the letter l for the number 1

. . . . . . . . . . . . . . . . . . . . .

32

Saving data and different versions of Stata . . . . . . . . . . . . . . . . . .

42


Scrolling the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Working with Excel files . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

What is a Stata dictionary file? . . . . . . . . . . . . . . . . . . . . . . . . .

50

Stata and capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

More on recoding rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Beyond egen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Deciding among different ways to do something . . . . . . . . . . . . . . . .

71

What is a command? What is a do-file? . . . . . . . . . . . . . . . . . . . .


76

Stata do-files for this book . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

Saving tabular output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

Tabulating a series of variables and including missing values . . . . . . . .

99

Obtaining both numbers and value labels . . . . . . . . . . . . . . . . . . . 102


xxii

Boxed tips
Independent and dependent variables . . . . . . . . . . . . . . . . . . . . . 124
Reporting chi-squared results . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Why can φ be negative? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Random sample and randomization . . . . . . . . . . . . . . . . . . . . . . 151
Distinguishing between two p-values . . . . . . . . . . . . . . . . . . . . . . 157
Proportions and percentages . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Effect size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
How can you get the same result each time? . . . . . . . . . . . . . . . . . 191

Predictors and outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Statistical and substantive significance . . . . . . . . . . . . . . . . . . . . . 202
Multiple-comparison procedures with correlations . . . . . . . . . . . . . . 206
Can Stata give me an F table? . . . . . . . . . . . . . . . . . . . . . . . . . 221
What are categorical covariates and what are continuous covariates? . . . . 233
Estimating the effect size and omega-squared, ω 2

. . . . . . . . . . . . . . 241

2

Estimating the effect size and omega-squared, ω , continued . . . . . . . . 242
Names for categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . 292
More on testing a set of parameter estimates . . . . . . . . . . . . . . . . . 297
Tabular presentation of hierarchical regression models . . . . . . . . . . . . 299
Centering quantitative predictors before computing interaction terms . . . 305
Do not compare correlations across populations . . . . . . . . . . . . . . . . 308
Predicting a count variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Using Stata as a calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Odds ratio versus relative-risk ratio . . . . . . . . . . . . . . . . . . . . . . 345
Requiring a 75% completion rate . . . . . . . . . . . . . . . . . . . . . . . . 364
A problem generating a total scale score . . . . . . . . . . . . . . . . . . . . 366
Alpha, average correlation, number of items . . . . . . . . . . . . . . . . . . 371


Boxed tips

xxiii

What is a strong kappa? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

What’s in a name? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382



Preface
This book was written with a particular reader in mind. This reader is learning social
statistics and needs to learn Stata but has no prior experience with other statistical
software packages. When I learned Stata, I found there were no books written explicitly
for this type of reader. There are certainly excellent books on Stata, but they assume
extensive prior experience with other packages, such as SAS or IBM SPSS Statistics; they
also assume a fairly advanced working knowledge of statistics. These books moved
quickly to advanced topics and left my intended reader in the dust. Readers who have
more background in statistical software and statistics will be able to read chapters
quickly and even skip sections. The goal is to move the true beginner to a level of
competence using Stata.
With this target reader in mind, I make far more use of the menus and dialog boxes
in Stata’s interface than do any other books about Stata. Advanced users may not
see the value in using the interface, and the more people learn about Stata, the less
they will rely on the interface. Also, even when you are using the interface, it is still
important to save a record of the sequence of commands you run. Although I rely on
the commands much more than the dialog boxes in the interface in my own work, I still
find value in the interface. The dialog boxes in the interface include many options that
I might not have known or might have forgotten.
To illustrate the interface as well as graphics, I have included more than 100 figures,
many of which show dialog boxes. I present many tables and extensive Stata “results”
as they appear on the screen. I interpret these results substantively in the belief that
beginning Stata users need to learn more than just how to produce the results—users
also need to be able to interpret them.
I have tried to use real data. There are a few examples where it is much easier to
illustrate a point with hypothetical data, but for the most part, I use data that are in

the public domain. For example, I use the General Social Surveys for 2002 and 2006
in many chapters, as well as the National Survey of Youth, 1997. I have simplified the
files by dropping many of the variables in the original datasets, but I have kept all the
observations. I have tried to use examples from several social-science fields, and I have
included a few extra variables in several datasets so that instructors, as well as readers,
can make additional examples and exercises that are tailored to their disciplines. People
who are used to working with statistics books that have contrived data with just a few
observations, presumably so work can be done by hand, may be surprised to see more
than 1,000 observations in this book’s datasets. Working with these files provides better


×