Data Analysis with Open Source Tools docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (16.5 MB, 533 trang )

www.it-ebooks.info
www.it-ebooks.info
Use your data – or lose
Save 20% with code EBOOK
Register Now
Strata Conference
Sep 22-23, 2011, NY
Strata Summit
Sep 20-21, 2011, NY
Strata Jumpstart
Sep 19, 2011, NY
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data Analysis with Open Source Tools
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data Analysis with
Open Source Tools
Philipp K. Janert
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Data Analysis with Open Source Tools
by Philipp K. Janert
Copyright
c
 2011 Philipp K. Janert. All rights reserved. Printed in the United States of America.
Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online

editions are also available for most titles (). For more information,
contact our corporate/institutional sales department: (800) 998-9938 or
Editor: Mike Loukides
Production Editor: Sumita Mukherji
Copyeditor: Matt Darnell
Production Services: MPS Limited, a Macmillan
Company, and Newgen North America, Inc.
Indexer: Fred Brown
Cover Designer: Karen Montgomery
Interior Designer: Edie Freedman
and Ron Bilodeau
Illustrator: Philipp K. Janert
Printing History:
November 2010: First Edition.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analysis with Open Source
Tools, the image of a common kite, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc.
was aware of a trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author
assume no responsibility for errors or omissions, or for damages resulting from the use of the
information contained herein.
ISBN: 978-0-596-80235-6
[M]
[2011-05-27]
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Furious activity is no substitute for understanding.
—H. H. Williams
www.it-ebooks.info

O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
CONTENTS
PREFACE xiii
1 INTRODUCTION 1
Data Analysis 1
What’s in This Book 2
What’s with the Workshops? 3
What’s with the Math? 4
What You’ll Need 5
What’s Missing 6
PART I Graphics: Looking at Data
2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION 11
Dot and Jitter Plots 12
Histograms and Kernel Density Estimates 14
The Cumulative Distribution Function 23
Rank-Order Plots and Lift Charts 30
Only When Appropriate: Summary Statistics and Box Plots 33
Workshop: NumPy 38
Further Reading 45
3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS 47
Scatter Plots 47
Conquering Noise: Smoothing 48
Logarithmic Plots 57
Banking 61
Linear Regression and All That 62
Showing What’s Important 66
Graphical Analysis and Presentation Graphics 68
Workshop: matplotlib 69

Further Reading 78
4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS 79
Examples 79
The Task 83
Smoothing 84
Don’t Overlook the Obvious! 90
The Correlation Function 91
vii
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Optional: Filters and Convolutions 95
Workshop: scipy.signal 96
Further Reading 98
5 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS 99
False-Color Plots 100
A Lot at a Glance: Multiplots 105
Composition Problems 110
Novel Plot Types 116
Interactive Explorations 120
Workshop: Tools for Multivariate Graphics 123
Further Reading 125
6 INTERMEZZO: A DATA ANALYSIS SESSION 127
A Data Analysis Session 127
Workshop: gnuplot 136
Further Reading 138
PART II Analytics: Modeling Data
7 GUESSTIMATION AND THE BACK OF THE ENVELOPE 141
Principles of Guesstimation 142
How Good Are Those Numbers? 151
Optional: A Closer Look at Perturbation Theory and

Error Propagation 155
Workshop: The Gnu Scientific Library (GSL) 158
Further Reading 161
8 MODELS FROM SCALING ARGUMENTS 163
Models 163
Arguments from Scale 165
Mean-Field Approximations 175
Common Time-Evolution Scenarios 178
Case Study: How Many Servers Are Best? 182
Why Modeling? 184
Workshop: Sage 184
Further Reading 188
9 ARGUMENTS FROM PROBABILITY MODELS 191
The Binomial Distribution and Bernoulli Trials 191
The Gaussian Distribution and the Central Limit Theorem 195
Power-Law Distributions and Non-Normal Statistics 201
Other Distributions 206
Optional: Case Study—Unique Visitors over Time 211
Workshop: Power-Law Distributions 215
Further Reading 218
viii CONTENTS
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
10 WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS 221
Genesis 221
Statistics Defined 223
Statistics Explained 226
Controlled Experiments Versus Observational Studies 230
Optional: Bayesian Statistics—The Other Point of View 235
Workshop: R 243

Further Reading 249
11 INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES,
AND ALL THAT 253
How to Average Averages 253
The Standard Deviation 256
Least Squares 260
Further Reading 264
PART III Computation: Mining Data
12 SIMULATIONS 267
A Warm-Up Question 267
Monte Carlo Simulations 270
Resampling Methods 276
Workshop: Discrete Event Simulations with SimPy 280
Further Reading 291
13 FINDING CLUSTERS 293
What Constitutes a Cluster? 293
Distance and Similarity Measures 298
Clustering Methods 304
Pre- and Postprocessing 311
Other Thoughts 314
A Special Case: Market Basket Analysis 316
AWordofWarning 319
Workshop: Pycluster and the C Clustering Library 320
Further Reading 324
14 SEEING THE FOREST FOR THE TREES: FINDING
IMPORTANT ATTRIBUTES 327
Principal Component Analysis 328
Visual Techniques 337
Kohonen Maps 339
Workshop: PCA with R 342

Further Reading 348
15 INTERMEZZO: WHEN MORE IS DIFFERENT 351
A Horror Story 353
CONTENTS ix
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Some Suggestions 354
What About Map/Reduce? 356
Workshop: Generating Permutations 357
Further Reading 358
PART IV Applications: Using Data
16 REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS 361
Business Intelligence 362
Corporate Metrics and Dashboards 369
Data Quality Issues 373
Workshop: Berkeley DB and SQLite 376
Further Reading 381
17 FINANCIAL CALCULATIONS AND MODELING 383
The Time Value of Money 384
Uncertainty in Planning and Opportunity Costs 391
Cost Concepts and Depreciation 394
Should You Care? 398
Is This All That Matters? 399
Workshop: The Newsvendor Problem 400
Further Reading 403
18 PREDICTIVE ANALYTICS 405
Introduction 405
Some Classification Terminology 407
Algorithms for Classification 408
The Process 419

The Secret Sauce 423
The Nature of Statistical Learning 424
Workshop: Two Do-It-Yourself Classifiers 426
Further Reading 431
19 EPILOGUE: FACTS ARE NOT REALITY 433
A PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION
AND DATA ANALYSIS 435
Software Tools 435
A Catalog of Scientific Software 437
Writing Your Own 443
Further Reading 444
B RESULTS FROM CALCULUS 447
Common Functions 448
Calculus 460
Useful Tricks 468
x CONTENTS
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
Notation and Basic Math 472
Where to Go from Here 479
Further Reading 481
C WORKING WITH DATA 485
Sources for Data 485
Cleaning and Conditioning 487
Sampling 489
Data File Formats 490
The Care and Feeding of Your Data Zoo 492
Skills 493
Terminology 495
Further Reading 497

INDEX 499
CONTENTS xi
www.it-ebooks.info
O’Reilly-5980006 janert5980006˙fm October 28, 2010 22:1
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
Preface
THIS BOOK GREW OUT OF MY EXPERIENCE OF WORKING WITH DATA FOR VARIOUS COMPANIES IN THE TECH
industry. It is a collection of those concepts and techniques that I have found to be the
most useful, including many topics that I wish I had known earlier—but didn’t.
My degree is in physics, but I also worked as a software engineer for several years. The
book reﬂects this dual heritage. On the one hand, it is written for programmers and others
in the software ﬁeld: I assume that you, like me, have the ability to write your own
programs to manipulate data in any way you want.
On the other hand, the way I think about data has been shaped by my background and
education. As a physicist, I am not content merely to describe data or to make black-box
predictions: the purpose of an analysis is always to develop an understanding for the
processes or mechanisms that give rise to the data that we observe.
The instrument to express such understanding is the model: a description of the system
under study (in other words, not just a description of the data!), simpliﬁed as necessary
but nevertheless capturing the relevant information. A model may be crude (“Assume a
spherical cow ”), but if it helps us develop better insight on how the system works, it is
a successful model nevertheless. (Additional precision can often be obtained at a later
time, if it is really necessary.)
This emphasis on models and simpliﬁed descriptions is not universal: other authors and
practitioners will make different choices. But it is essential to my approach and point of
view.
This is a rather personal book. Although I have tried to be reasonably comprehensive, I
have selected the topics that I consider relevant and useful in practice—whether they are
part of the “canon” or not. Also included are several topics that you won’t ﬁnd in any

other book on data analysis. Although neither new nor original, they are usually not used
or discussed in this particular context—but I ﬁnd them indispensable.
Throughout the book, I freely offer speciﬁc, explicit advice, opinions, and assessments.
These remarks are reﬂections of my personal interest, experience, and understanding. I do
not claim that my point of view is necessarily correct: evaluate what I say for yourself and
feel free to adapt it to your needs. In my view, a speciﬁc, well-argued position is of greater
use than a sterile laundry list of possible algorithms—even if you later decide to disagree
with me. The value is not in the opinion but rather in the arguments leading up to it. If
your arguments are better than mine, or even just more agreeable to you, then I will have
achieved my purpose!
xiii
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
Data analysis, as I understand it, is not a ﬁxed set of techniques. It is a way of life, and it
has a name: curiosity. There is always something else to ﬁnd out and something more to
learn. This book is not the last word on the matter; it is merely a snapshot in time: things I
knew about and found useful today.
“Works are of value only if they give rise to better ones.”
(Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)
Before We Begin
More data analysis efforts seem to go bad because of an excess of sophistication rather
than a lack of it.
This may come as a surprise, but it has been my experience again and again. As a
consultant, I am often called in when the initial project team has already gotten stuck.
Rarely (if ever) does the problem turn out to be that the team did not have the required
skills. On the contrary, I usually ﬁnd that they tried to do something unnecessarily
complicated and are now struggling with the consequences of their own invention!
Based on what I have seen, two particular risk areas stand out:
•
The use of “statistical” concepts that are only partially understood (and given the

relative obscurity of most of statistics, this includes virtually all statistical concepts)
•
Complicated (and expensive) black-box solutions when a simple and transparent
approach would have worked at least as well or better
I strongly recommend that you make it a habit to avoid all statistical language. Keep it
simple and stick to what you know for sure. There is absolutely nothing wrong with
speaking of the “range over which points spread,” because this phrase means exactly what
it says: the range over which points spread, and only that! Once we start talking about
“standard deviations,” this clarity is gone. Are we still talking about the observed width of
the distribution? Or are we talking about one speciﬁc measure for this width? (The
standard deviation is only one of several that are available.) Are we already making an
implicit assumption about the nature of the distribution? (The standard deviation is only
suitable under certain conditions, which are often not fulﬁlled in practice.) Or are we even
confusing the predictions we could make if these assumptions were true with the actual
data? (The moment someone talks about “95 percent anything” we know it’s the latter!)
I’d also like to remind you not to discard simple methods until they have been proven
insufﬁcient. Simple solutions are frequently rather effective: the marginal beneﬁt that
more complicated methods can deliver is often quite small (and may be in no reasonable
relation to the increased cost). More importantly, simple methods have fewer
opportunities to go wrong or to obscure the obvious.
xiv PREFACE
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
True story: a company was tracking the occurrence of defects over time. Of course, the
actual number of defects varied quite a bit from one day to the next, and they were
looking for a way to obtain an estimate for the typical number of expected defects. The
solution proposed by their IT department involved a compute cluster running a neural
network! (I am not making this up.) In fact, a one-line calculation (involving a moving
average or single exponential smoothing) is all that was needed.
I think the primary reason for this tendency to make data analysis projects more

complicated than they are is discomfort: discomfort with an unfamiliar problem space and
uncertainty about how to proceed. This discomfort and uncertainty creates a desire to
bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of
course, the opposite is true: the complexities of the “solution” overwhelm the original
problem, and nothing gets accomplished.
Data analysis does not have to be all that hard. Although there are situations when
elementary methods will no longer be sufﬁcient, they are much less prevalent than you
might expect. In the vast majority of cases, curiosity and a healthy dose of common sense
will serve you well.
The attitude that I am trying to convey can be summarized in a few points:
Simple is better than complex.
Cheap is better than expensive.
Explicit is better than opaque.
Purpose is more important than process.
Insight is more important than precision.
Understanding is more important than technique.
Think more, work less.
Although I do acknowledge that the items on the right are necessary at times, I will give
preference to those on the left whenever possible.
It is in this spirit that I am offering the concepts and techniques that make up the rest of
this book.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, and email addresses
Constant width
Used to refer to language and script elements
PREFACE xv
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0

Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this
book in your programs and documentation. You do not need to contact us for permission
unless youre reproducing a signiﬁcant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require permission.
Selling or distributing a CD-ROM of examples from OReilly books does require
permission. Answering a question by citing this book and quoting example code does not
require permission. Incorporating a signiﬁcant amount of example code from this book
into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp
K. Janert. Copyright 2011 Philipp K. Janert, 978-0-596-80235-6.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at
Safari® Books Online
.
>
Safari
Books online
Safari Books Online is an on-demand digital library that lets you easily search
over 7,500 technology and creative reference books and videos to ﬁnd the
answers you need quickly.
With a subscription, you can read any page and watch any video from our library online.
Read books on your cell phone and mobile devices. Access new titles before they are
available for print, and get exclusive access to manuscripts in development and post
feedback for the authors. Copy and paste code samples, organize your favorites, download
chapters, bookmark key sections, create notes, print out pages, and beneﬁt from tons of
other time-saving features.
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full
digital access to this book and others on similar topics from OReilly and other publishers,

sign up for free at .
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
xvi PREFACE
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
/>To comment or ask technical questions about this book, send email to:

For more information about our books, conferences, Resource Centers, and the O’Reilly
Network, see our website at:

Acknowledgments
It was a pleasure to work with O’Reilly on this project. In particular, O’Reilly has been
most accommodating with regard to the technical challenges raised by my need to include
(for an O’Reilly book) an uncommonly large amount of mathematical material in the
manuscript.
Mike Loukides has accompanied this project as the editor since its beginning. I have
enjoyed our conversations about life, the universe, and everything, and I appreciate his
comments about the manuscript—either way.
I’d like to thank several of my friends for their help in bringing this book about:
•
Elizabeth Robson, for making the connection

•
Austin King, for pointing out the obvious
•
Scott White, for suffering my questions gladly
•
Richard Kreckel, for much-needed advice
As always, special thanks go to PAUL Schrader (Bremen).
The manuscript beneﬁted from the feedback I received from various reviewers. Michael E.
Driscoll, Zachary Kessin, and Austin King read all or parts of the manuscript and provided
valuable comments.
I enjoyed personal correspondence with Joseph Adler, Joe Darcy, Hilary Mason, Stephen
Weston, Scott White, and Brian Zimmer. All very generously provided expert advice on
speciﬁc topics.
Particular thanks go to Richard Kreckel, who provided uncommonly detailed and
insightful feedback on most of the manuscript.
During the preparation of this book, the excellent collection at the University of
Washington libraries was an especially valuable resource to me.
PREFACE xvii
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
Authors usually thank their spouses for their “patience and support” or words to that
effect. Unless one has lived through the actual experience, one cannot fully comprehend
how true this is. Over the last three years, Angela has endured what must have seemed
like a nearly continuous stream of whining, frustration, and desperation—punctuated by
occasional outbursts of exhilaration and grandiosity—all of which before the background
of the self-centered and self-absorbed attitude of a typical author. Her patience and
support were unfailing. It’s her turn now.
xviii PREFACE
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0

CHAPTER ONE
Introduction
IMAGINE YOUR BOSS COMES TO YOU AND SAYS: “HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE OUR
business!”
What would you do? Where would you start? And what would you do next?
It’s this kind of situation that the present book wants to help you with!
Data Analysis
Businesses sit on data, and every second that passes, they generate some more. Surely,
there must be a way to make use of all this stuff. But how, exactly—that’s far from clear.
The task is difﬁcult because it is so vague: there is no speciﬁc problem that needs to be
solved. There is no speciﬁc question that needs to be answered. All you know is the
overall purpose: improve the business. And all you have is “the data.” Where do you start?
You start with the only thing you have: “the data.” What is it? We don’t know! Although
50 GB sure sounds like a lot, we have no idea what it actually contains. The ﬁrst thing,
therefore, is to take a look.
And I mean this literally: the ﬁrst thing to do is to look at the data by plotting it in different
ways and looking at graphs. Looking at data, you will notice things—the way data points
are distributed, or the manner in which one quantity varies with another, or the large
number of outliers, or the total absence of them I don’t know what you will ﬁnd, but
there is no doubt: if you look at data, you will observe things!
These observations should lead to some reﬂection. “Ten percent of our customers drive
ninety percent of our revenue.” “Whenever our sales volume doubles, the number of
1
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
returns goes up by a factor of four.” “Every seven days we have a production run that has
twice the usual defect rate, and it’s always on a Thursday.” How very interesting!
Now you’ve got something to work with: the amorphous mass of “data” has turned into
ideas! To make these ideas concrete and suitable for further work, it is often useful to
capture them in a mathematical form: a model. A model (the way I use the term) is a

mathematical description of the system under study. A model is more than just a
description of the data—it also incorporates your understanding of the process or the
system that produced the data. A model therefore has predictive power: you can predict
(with some certainty) that next Thursday the defect rate will be high again.
It’s at this point that you may want to go back and alert the boss of your ﬁndings: “Next
Thursday, watch out for defects!”
Sometimes, you may already be ﬁnished at this point: you found out enough to help
improve the business. At other times, however, you may need to work a little harder.
Some data sets do not yield easily to visual inspection—especially if you are dealing with
data sets consisting of many different quantities, all of which seem equally important. In
such cases, you may need to employ more-sophisticated methods to develop enough
intuition before being able to formulate a relevant model. Or you may have been able to
set up a model, but it is too complicated to understand its implications, so that you want
to implement the model as a computer program and simulate its results. Such
computationally intensive methods are occasionally useful, but they always come later in
the game. You should only move on to them after having tried all the simple things ﬁrst.
And you will need the insights gained from those earlier investigations as input to the
more elaborate approaches.
And ﬁnally, we need to come back to the initial agenda. To “improve the business” it is
necessary to feed our understanding back into the organization—for instance, in the form
of a business plan, or through a “metrics dashboard” or similar program.
What's in This Book
The program just described reﬂects the outline of this book.
We begin in Part I with a series of chapters on graphical techniques, starting in Chapter 2
with simple data sets consisting of only a single variable (or considering only a single
variable at a time), then moving on in Chapter 3 to data sets of two variables. In Chapter 4
we treat the particularly important special case of a quantity changing over time, a
so-called time series. Finally, in Chapter 5, we discuss data sets comprising more than two
variables and some special techniques suitable for such data sets.
In Part II, we discuss models as a way to not only describe data but also to capture the

understanding that we gained from graphical explorations. We begin in Chapter 7 with a
discussion of order-of-magnitude estimation and uncertainty considerations. This may
2 CHAPTER ONE
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
seem odd but is, in fact, crucial: all models are approximate, so we need to develop a sense
for the accuracy of the approximations that we use. In Chapters 8 and 9 we introduce
basic building blocks that are useful when developing models.
Chapter 10 is a detour. For too many people, “data analysis” is synonymous with
“statistics,” and “statistics” is usually equated with a class in college that made no sense at
all. In this chapter, I want to explain what statistics really is, what all the mysterious
concepts mean and how they hang together, and what statistics can (and cannot) do for
us. It is intended as a travel guide should you ever want to read a statistics book in the
future.
Part III discusses several computationally intensive methods, such as simulation and
clustering in Chapters 12 and 13. Chapter 14 is, mathematically, the most challenging
chapter in the book: it deals with methods that can help select the most relevant variables
from a multivariate data set.
In Part IV we consider some ways that data may be used in a business environment. In
Chapter 16 we talk about metrics, reporting, and dashboards—what is sometimes referred
to as “business intelligence.” In Chapter 17 we introduce some of the concepts required to
make ﬁnancial calculations and to prepare business plans. Finally, in chapter 18, we
conclude with a survey of some methods from classiﬁcation and predictive analytics.
At the end of each part of the book you will ﬁnd an “Intermezzo.” These intermezzos are
not really part of the course; I use them to go off on some tangents, or to explain topics
that often remain a bit hazy. You should see them as an opportunity to relax!
The appendices contain some helpful material that you may want to consult at various
times as you go through the text. Appendix A surveys some of the available tools and
programming environments for data manipulation and analysis. In Appendix B I have
collected some basic mathematical results that I expect you to have at least passing

familiarity with. I assume that you have seen this material at least once before, but in this
appendix, I put it together in an application-oriented context, which is more suitable for
our present purposes. Appendix C discusses some of the mundane tasks that—like it or
not—make up a large part of actual data analysis and also introduces some data-related
terminology.
What's with the Workshops?
Every full chapter (after this one) includes a section titled “Workshop” that contains some
programming examples related to the chapter’s material. I use these Workshops for two
purposes. On the one hand, I’d like to introduce a number of open source tools and
libraries that may be useful for the kind of work discussed in this book. On the other
hand, some concepts (such as computational complexity and power-law distributions)
must be seen to be believed: the Workshops are a way to demonstrate these issues and
allow you to experiment with them yourself.
INTRODUCTION 3
www.it-ebooks.info
O’Reilly-5980006 master October 28, 2010 22:0
Among the tools and libraries is quite a bit of Python and R. Python has become
somewhat the scripting language of choice for scientiﬁc applications, and R is the most
popular open source package for statistical applications. This choice is neither an endorsement
nor a recommendation but primarily a reﬂection of the current state of available software.
(See Appendix A for a more detailed discussion of software for data analysis and related
purposes.)
My goal with the tool-oriented Workshops is rather speciﬁc: I want to enable you to
decide whether a given tool or library is worth spending time on. (I have found that
evaluating open source offerings is a necessary but time-consuming task.) I try to
demonstrate clearly what purpose each particular tool serves. Toward this end, I usually
give one or two short, but not entirely trivial, examples and try to outline enough of the
architecture of the tool or library to allow you to take it from there. (The documentation
for many open source projects has a hard time making the bridge from the trivial,
cut-and-paste “Hello, World” example to the reference documentation.)

What's with the Math?
This book contains a certain amount of mathematics. Depending on your personal
predilection you may ﬁnd this trivial, intimidating, or exciting.
The reality is that if you want to work analytically, you will need to develop some
familiarity with a few mathematical concepts. There is simply no way around it. (You can
work with data without any math skills—look at what any data modeler or database
administrator does. But if you want to do any sort of analysis, then a little math becomes a
necessity.)
I have tried to make the text accessible to readers with a minimum of previous knowledge.
Some college math classes on calculus and similar topics are helpful, of course, but are by
no means required. Some sections of the book treat material that is either more abstract or
will likely be unreasonably hard to understand without some previous exposure. These
sections are optional (they are not needed in the sequel) and are clearly marked as such.
A somewhat different issue concerns the notation. I use mathematical notation wherever
it is appropriate and it helps the presentation. I have made sure to use only a very small
set of symbols; check Appendix B if something looks unfamiliar.
Couldn’t I have written all the mathematical expressions as computer code, using Python
or some sort of pseudo-code? The answer is no, because quite a few essential mathematical
concepts cannot be expressed in a ﬁnite, ﬂoating-point oriented machine (anything
having to do with a limit process—or real numbers, in fact). But even if I could write all
math as code, I don’t think I should. Although I wholeheartedly agree that mathematical
notation can get out of hand, simple formulas actually provide the easiest, most succinct
way to express mathematical concepts.
4 CHAPTER ONE
www.it-ebooks.info

Data Analysis with Open Source Tools docx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về