Springer exploratory analysis of spatial and temporal data a systematic approach (2005) DDU

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.61 MB, 712 trang )

Exploratory Analysis of
Spatial and Temporal Data

Natalia Andrienko · Gennady Andrienko

Exploratory Analysis
of Spatial and
Temporal Data
A Systematic Approach

With 245 Figures and 34 Tables

123

Authors
Natalia Andrienko
Gennady Andrienko
Fraunhofer Institute AIS
Schloss Birlinghoven
53754 Sankt Augustin, Germany

/>
Library of Congress Control Number: 2005936053

ACM Computing Classiﬁcation (1998): J.2, H.3

ISBN-10 3-540-25994-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-25994-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Typeset by the authors
Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig
Cover design: KünkelLopka Werbeagentur, Heidelberg
Printed on acid-free paper

45/3142/YL - 5 4 3 2 1 0

Preface

This book is based upon the extensive practical experience of the authors
in designing and developing software tools for visualisation of spatially
referenced data and applying them in various problem domains. These
tools include methods for cartographic visualisation; non-spatial graphs;
devices for querying, search, and classification; and computer-enhanced
visual techniques. A common feature of all the tools is their high user interactivity, which is essential for exploratory data analysis. The tools can
be used conveniently in various combinations; their cooperative functioning is enabled by manifold coordination mechanisms.
Typically, our ideas for new tools or extensions of existing ones have

arisen from contemplating particular datasets from various domains. Understanding the properties of the data and the relationships between the
components of the data triggered a vision of the appropriate ways of visualising and exploring the data. This resulted in many original techniques,
which were, however, designed and implemented so as to be applicable not
only to the particular dataset that had incited their development but also to
other datasets with similar characteristics. For this purpose, we strove to
think about the given data in terms of the generic characteristics of some
broad class that the data belonged to rather than stick to their specifics.
From many practical cases of moving from data to visualisation, we
gained a certain understanding of what characteristics of data are relevant
for choosing proper visualisation techniques. We learned also that an essential stage on the way from data to the selection or design of proper exploratory tools is to envision the questions an analyst might seek to answer
in exploring this kind of data, or, in other words, the data analysis tasks.
Knowing the questions (or, rather, types of questions), one may look at
familiar techniques from the perspective of whether they could help one to
find answers to those questions. It may happen in some cases that there is a
subset of existing tools that covers all potential question types. It may also
happen that for some tasks there are no appropriate tools. In that case, the
nature of the tasks gives a clue as to what kind of tool would be helpful.
This is an important initial step in designing a new tool.
Having passed along the way from data through tasks to tools many
times, we found it appropriate to share the knowledge that we gained from

VI

Preface

this process with other people. We would like to describe what components may exist in spatially referenced data, how these components may
relate to each other, and what effect various properties of these components and relationships between them may have on tool selection. We
would also like to show how to translate the characteristics of data and
structures into potential analysis tasks, and enumerate the widely accepted

principles and our own heuristics that usually help us in proceeding from
the tasks to the appropriate approaches to accomplishing them, and to the
tools that could support this. In other words, we propose a methodological
framework for the design, selection, and application of visualisation techniques and tools for exploratory analysis of spatially referenced data. Particular attention is paid to spatio-temporal data, i.e. data having both spatial and temporal components.
We expect this book to be useful to several groups of readers. People
practising analysis of spatially referenced data should be interested in becoming familiar with the proposed illustrated catalogue of the state-of-theart exploratory tools. The framework for selecting appropriate analysis
tools might also be useful to them. Students (undergraduate and postgraduate) in various geography-related disciplines could gain valuable information about the possible types of spatial data, their components, and the relationships between them, as well as the impact of the characteristics of the
data on the selection of appropriate visualisation methods. Students could
also learn about various methods of data exploration using visual, highly
interactive tools, and acknowledge the value of a conscious, systematic
approach to exploratory data analysis. The book may be interesting to researchers in computer cartography, especially those imbued with the ideas
of cartographic visualisation, in particular, the ideas widely disseminated
by the special Commission on Visualisation of the International Cartographic Association. Our tools are in full accord with these ideas, and our
data- and task-analytic approach to tool design offers a way of putting
these ideas into practice. It can also be expected that the book will be interesting to researchers and practitioners dealing with any kind of visualisation, not necessarily the visualisation of spatial data. Many of the ideas
and approaches presented are not restricted to only spatially referenced
data, but have a more general applicability.
The topic of the book is much more general than the consideration of
any particular software: we investigate the relations between the characteristics of data, exploratory tasks (questions), and data exploration techniques. We do this first on a theoretical level and then using practical examples. In the examples, we may use particular implementations of the
techniques, either our own implementations or freely available demonstrators. However, the main purpose is not to instruct readers in how to use

Preface

VII

this or that particular tool but to allow them to better understand the ideas
of exploratory data analysis.
The book is intended for a broad reader community and does not require
a solid background in mathematics, statistics, geography, or informatics,

but only a general familiarity with these subjects. However, we hope that
the book will be interesting and useful also to those who do have a solid
background in any or all of these disciplines.

Acknowledgements
This book is a result of a theoretical generalisation of our research over
more than 15 years. During this period, many people helped us to establish
ourselves and grow as scientists. We would like to express our gratitude to
our scientific “parents” Nadezhda Chemeris, Yuri Pechersky, and Sergey
Soloview, without whom our research careers would not have started. We
are also grateful to our colleagues and partners who significantly influenced and encouraged our work from its early stages, namely Leonid Mikulich, Alexander Komarov, Valeri Gitis, Maria Palenova, and Hans Voss.
Since 1997 we have been working at GMD, the German National Research Centre for Information Technology, which was later transformed
into the AIS (Autonomous Intelligent Systems) Fraunhofer Institute. Institute directors Thomas Christaller and Stefan Wrobel and department heads
Hans Voss and Michael May always supported and approved our work.
All our colleagues were always cooperative and helpful. We are especially
grateful to Dietrich Wettschereck, Alexandr Savinov, Peter Gatalsky, Ivan
Denisovich, Mark Ostrovsky, Simon Scheider, Vera Hernandez, Andrey
Martynkin, and Willi Kloesgen for fruitful discussions and cooperation.
Our research was developed in the framework of numerous international
projects. We acknowledge funding from the European Commission and
the friendly support of all our partners. We owe much to Robert Peckham,
Jackie Carter, Jim Petch, Oleg Chertov, Andreas Schuck, Risto Paivinen,
Frits Mohren, Mauro Salvemini, and Matteo Villa. Our work was also
greatly inspired by a fruitful (although informal) cooperation with Piotr
Jankowski and Alexander Lotov.
Our participation in the ICA commissions on Visualisation and Virtual
Environments, Maps and the Internet, and Theoretical Cartography had a
strong influence on the formation and refinement of our ideas. Among all
the members of these commissions, we are especially grateful to Alan
MacEachren, Menno-Jan Kraak, Sara Fabrikant, Jason Dykes, David Fairbain, Terry Slocum, Mark Gahegan, Jürgen Döllner, Monica Wachowicz,

VIII

Preface

Corne van Elzakker, Michael Peterson, Georg Gartner, Alexander Volodtschenko, and Hans Schlichtmann.
Discussions with Ben Shneiderman, Antony Unwin, Robert Haining,
Werner Kuhn, Jonathan Roberts, and Alfred Inselberg were a rich source
of inspiration and provided apt occasions to verify our ideas. Special
thanks are due to the scientists whose books were formative for our research, namely John Tukey, Jacques Bertin, George Klir, and Rudolf Arnheim.
The authors gratefully acknowledge the encouraging comments of the
reviewers, the painstaking work of the copyeditor, and the friendly cooperation of Ralf Gerstner and other people of Springer-Verlag.
We thank our family for the patience during the time that we used for
discussing and writing the book in the evenings, weekends, and during vacations.
Almost all of the illustrations in the book were produced using the
CommonGIS system and some other research prototypes developed in our
institute. Online demonstrators of these systems are available on our Web
site and on the web site of our institute
department People interested in using the software should visit the site of CommonGIS,
.
The datasets used in the book were provided by our partners in various
projects.
1. Portuguese census. The data set was provided by CNIG (Portuguese
National Centre for Geographic Information) within the EU-funded project CommonGIS (Esprit project 28983). The data were prepared by
Joana Abreu, Fatima Bernardo, and Joana Hipolito.
2. Forests in Europe. The dataset was created within the project “Combining Geographically Referenced Earth Observation Data and Forest
Statistics for Deriving a Forest Map for Europe” (15237-1999-08 F1ED
ISP FI). The data were provided to us by EFI (the European Forest Institute within the project EFIS (European Forest Information System), contract number: 17186-2000-12 F1ED ISP FI.
3. Earthquakes in Turkey. The dataset was provided within the project

SPIN! (Spatial Mining for Data of Public Interest) (IST Programme,
project IST-1999-10536) by Valery Gitis and his colleagues.
4. Migration of white storks. The data were provided by the German Research Centre for Ornithology of the Max Planck Society within a German school project called “Naturdetektive”. The data were prepared by
Peter Gatalsky.

Preface

IX

5. Weather in Germany. The dataset was published by Deutscher Wetterdienst at the URL />online/nat/index_monatswerte.htm. Simon Scheider prepared the data
for application of the tools.
6. Crime in the USA. The dataset was published by the US Department of
Justice, URL The data were
prepared by Mohammed Islam.
7. Forest management scenarios. The dataset was created in the project
SILVICS (Silvicultural Systems for Sustainable Forest Resources Management) (INTAS EU-funded project). The data were prepared for
analysis by Alexey Mikhaylov and Peter Gatalsky.
8. Forest fires in Umbria. The dataset was provided within the NEFIS
(Network for a European Forest Information Service) project, an accompanying measure in the Quality of Life and Management of Living
Resources Programme of the European Commission (contract number
QLK5-CT-2002-30638). The data were collected by Regione
dell’Umbria, Servizio programmazione forestale, Perugia, Italy; the survey was performed by Corpo Forestale dello Stato, Italy
9. Health care in Idaho. The dataset was provided by Piotr Jankowski
within an informal cooperation project between GMD and the University of Idaho, Moscow, ID.

August 2005
Sankt Augustin, Germany

Natalia Andrienko

Gennady Andrienko

Contents

1

Introduction ...................................................................................... 1
1.1 What Is Data Analysis? ................................................................. 1
1.2 Objectives of the Book.................................................................. 5
1.3 Outline of the Book ....................................................................... 6
1.3.1 Data ........................................................................................ 6
1.3.2 Tasks ...................................................................................... 8
1.3.3 Tools .................................................................................... 10
1.3.4 General Principles ................................................................ 14
References ............................................................................................ 16

2

Data.................................................................................................. 17
Abstract................................................................................................. 17
2.1 Structure of Data ......................................................................... 18
2.1.1 Functional View of Data Structure ...................................... 21
2.1.2 Other Approaches ................................................................ 25
2.2 Properties of Data........................................................................ 27
2.2.1 Other Approaches ................................................................ 31
2.3 Examples of Data ........................................................................ 34
2.3.1 Portuguese Census ............................................................... 34
2.3.2 Forests in Europe ................................................................. 36
2.3.3 Earthquakes in Turkey ......................................................... 36

2.3.4 Migration of White Storks ................................................... 38
2.3.5 Weather in Germany ............................................................ 40
2.3.6 Crime in the USA................................................................. 41
2.3.7 Forest Management Scenarios ............................................. 42
Summary............................................................................................... 44
References ............................................................................................ 45

3

Tasks ................................................................................................ 47
Abstract................................................................................................. 47
3.1 Jacques Bertin’s View of Tasks .................................................. 49
3.2 General View of a Task............................................................... 53

XII

Contents

3.3 Elementary Tasks ........................................................................ 60
3.3.1 Lookup and Comparison...................................................... 61
3.3.2 Relation-Seeking.................................................................. 69
3.3.3 Recap: Elementary Tasks..................................................... 75
3.4 Synoptic Tasks ............................................................................ 81
3.4.1 General Notes....................................................................... 81
3.4.2 Behaviour and Pattern.......................................................... 83
3.4.3 Types of Patterns.................................................................. 91
3.4.3.1 Association Patterns...................................................... 91
3.4.3.2 Differentiation Patterns................................................. 93
3.4.3.3 Arrangement Patterns ................................................... 94

3.4.3.4 Distribution Summary .................................................. 95
3.4.3.5 General Notes ............................................................... 96
3.4.4 Behaviours over Multidimensional Reference Sets ............. 98
3.4.5 Pattern Search and Comparison ......................................... 107
3.4.6 Inverse Comparison ........................................................... 112
3.4.7 Relation-Seeking................................................................ 115
3.4.8 Recap: Synoptic Tasks ....................................................... 119
3.5 Connection Discovery ............................................................... 124
3.5.1 General Notes..................................................................... 124
3.5.2 Properties and Formalisation ............................................. 127
3.5.3 Relation to the Former Categories ..................................... 134
3.6 Completeness of the Framework............................................... 139
3.7 Relating Behaviours: a Cognitive-Psychology Perspective ...... 143
3.8 Why Tasks? ............................................................................... 148
3.9 Other Approaches...................................................................... 151
Summary............................................................................................. 158
References .......................................................................................... 159
4

Tools............................................................................................... 163
Abstract............................................................................................... 163
4.1 A Few Introductory Notes......................................................... 165
4.2 The Value of Visualisation........................................................ 166
4.3 Visualisation in a Nutshell ........................................................ 171
4.3.1 Bertin’s Theory and Its Extensions .................................... 171
4.3.2 Dimensions and Variables of Visualisation ....................... 182
4.3.3 Basic Principles of Visualisation ....................................... 189
4.3.4 Example Visualisations...................................................... 196
4.4 Display Manipulation................................................................ 207
4.4.1 Ordering ............................................................................. 207

4.4.2 Eliminating Excessive Detail ............................................. 214
4.4.3 Classification...................................................................... 217

Contents

XIII

4.4.4 Zooming and Focusing....................................................... 231
4.4.5 Substitution of the Encoding Function............................... 241
4.4.6 Visual Comparison............................................................. 248
4.4.7 Recap: Display Manipulation............................................. 257
4.5 Data Manipulation..................................................................... 259
4.5.1 Attribute Transformation ................................................... 261
4.5.1.1 “Relativisation”........................................................... 261
4.5.1.2 Computing Changes.................................................... 263
4.5.1.3 Accumulation.............................................................. 268
4.5.1.4 Neighbourhood-Based Attribute Transformations...... 269
4.5.2 Attribute Integration........................................................... 276
4.5.2.1 An Example of Integration ......................................... 278
4.5.2.2 Dynamic Integration of Attributes .............................. 279
4.5.3 Value Interpolation ............................................................ 288
4.5.4 Data Aggregation ............................................................... 293
4.5.4.1 Grouping Methods ...................................................... 294
4.5.4.2 Characterising Aggregates .......................................... 297
4.5.4.3 Visualisation of Aggregate Sizes ................................ 300
4.5.4.4 Sizes Are Not Only Counts......................................... 312
4.5.4.5 Visualisation and Use of Positional Measures............ 316
4.5.4.6 Spatial Aggregation and Reaggregation ..................... 327
4.5.4.7 A Few Words About OLAP........................................ 332

4.5.4.8 Data Aggregation: a Few Concluding Remarks ......... 333
4.5.5 Recap: Data Manipulation ................................................. 335
4.6 Querying.................................................................................... 336
4.6.1 Asking Questions ............................................................... 337
4.6.1.1 Spatial Queries............................................................ 341
4.6.1.2 Temporal Queries ....................................................... 346
4.6.1.3 Asking Questions: Summary ...................................... 349
4.6.2 Answering Questions ......................................................... 351
4.6.2.1 Filtering....................................................................... 353
4.6.2.2 Marking....................................................................... 363
4.6.2.3 Marking Versus Filtering............................................ 371
4.6.2.4 Relations as Query Results ......................................... 373
4.6.3 Non-Elementary Queries.................................................... 381
4.6.4 Recap: Querying ................................................................ 393
4.7 Computational Tools ................................................................. 395
4.7.1 A Few Words About Statistical Analysis........................... 397
4.7.2 A Few Words About Data Mining ..................................... 401
4.7.3 The General Paradigm for Using Computational Tools..... 406
4.7.4 Example: Clustering........................................................... 407
4.7.5 Example: Classification ..................................................... 415

XIV

Contents

4.7.6 Example: Data Preparation ................................................ 423
4.7.7 Recap: Computational Tools.............................................. 425
4.8 Tool Combination and Coordination......................................... 428
4.8.1 Sequential Tool Combination ............................................ 429

4.8.2 Concurrent Tool Combination ........................................... 434
4.8.3 Recap: Tool Combination .................................................. 447
4.9 Exploratory Tools and Technological Progress ........................ 450
Summary............................................................................................. 453
References .......................................................................................... 454
5

Principles ....................................................................................... 461
Abstract............................................................................................... 461
5.1 Motivation ................................................................................. 463
5.2 Components of the Exploratory Process ................................... 465
5.3 Some Examples of Exploration................................................. 467
5.4 General Principles of Selection of the Methods and Tools ....... 480
5.4.1 Principle 1: See the Whole................................................. 481
5.4.1.1 Completeness.............................................................. 483
5.4.1.2 Unification .................................................................. 494
5.4.2 Principle 2: Simplify and Abstract..................................... 506
5.4.3 Principle 3: Divide and Group ........................................... 509
5.4.4 Principle 4: See in Relation................................................ 518
5.4.5 Principle 5: Look for Recognisable ................................... 530
5.4.6 Principle 6: Zoom and Focus ............................................. 540
5.4.7 Principle 7: Attend to Particulars ....................................... 544
5.4.8 Principle 8: Establish Linkages.......................................... 552
5.4.9 Principle 9: Establish Structure.......................................... 572
5.4.10
Principle 10: Involve Domain Knowledge ..................... 579
5.5 General Scheme of Data Exploration: Tasks, Principles,
and Tools .................................................................................... 584
5.5.1 Case 1: Single Referrer, Holistic View Possible................ 587
5.5.1.1 Subcase 1.1: a Homogeneous Behaviour.................... 588

5.5.1.2 Subcase 1.2: a Heterogeneous Behaviour................... 590
5.5.2 Case 2: Multiple Referrers ................................................. 593
5.5.2.1 Subcase 2.1: Holistic View Possible........................... 595
5.5.2.2 Subcase 2.2: Behaviour Explored by Slices
and Aspects.................................................................... 598
5.5.3 Case 3: Multiple Attributes ................................................ 602
5.5.4 Case 4: Large Data Volume ............................................... 606
5.5.5 Final Remarks .................................................................... 611
5.6 Applying the Scheme (an Example).......................................... 613
Summary............................................................................................. 630

Contents

XV

References .......................................................................................... 632
6

Conclusion ..................................................................................... 635

Appendix I: Major Definitions ............................................................. 639
I.1 Data ........................................................................................... 639
I.2 Tasks ......................................................................................... 643
I.3 Tools.......................................................................................... 647
Appendix II: A Guide to Our Major Publications Relevant to This
Book ........................................................................................................ 651
References .......................................................................................... 653
Appendix III: Tools for Visual Analysis of Spatio-Temporal Data
Developed at the AIS Fraunhofer Institute ......................................... 657

References .......................................................................................... 658
Index........................................................................................................ 659

1 Introduction

1.1

What Is Data Analysis?

It seems curious that we have not found a general definition of this term in
the literature. In statistics, for example, data analysis is understood as “the
process of computing various summaries and derived values from the
given collection of data” (Hand 1999, p. 3). It is specially stressed that the
process is iterative: “One studies the data, examines it using some analytic
technique, decides to look at it another way, perhaps modifying it in the
process by transformation or partitioning, and then goes back to the beginning and applies another data analytic tool. This can go round and round
many times. Each technique is being used to probe a slightly different aspect of the data – to ask a slightly different question of the data” (Hand
1999, p. 3).
In the area of geographic information systems (GIS), data analysis is often defined as “a process for looking at geographic patterns in your data
and at relationships between features” (Mitchell 1999, p. 11). It starts with
formulating the question that needs to be answered, followed by choosing
a method on the basis of the question, the type of data available, and the
level of information required (this may raise a need for additional data).
Then the data are processed with the use of the chosen method and the results are displayed. This allows the analyst to decide whether the information obtained is valid or useful, or whether the analysis should be redone
using different parameters or even a different method.
Let us look what is common to these two definitions. Both of them define data analysis as an iterative process consisting of the following activities:
x
x
x

x
x

formulate questions;
choose analysis methods;
prepare the data for application of the methods;
apply the methods to the data;
interpret and evaluate the results obtained.

2

1 Introduction

The difference between statistical analysis and GIS analysis seems to lie
only in the types of data that they deal with and in the methods used. In
both cases, data analysis appears to be driven by questions: the questions
motivate one to do analysis, determine the choice of data and methods, and
affect the interpretation of the results. Since the questions are so important,
what are they?
Neither statistical nor GIS handbooks provide any classification of possible questions but they give instead a few examples. Here are some examples from a GIS handbook (Mitchell 1999):
x Where were most of the burglaries last month?
x How much forest is in each watershed?
x Which parcels are within 500 feet of this liquor store?
For a comparison, here are some examples from a statistical handbook for
geographers (Burt and Barber 1996):
x What major explanatory variables account for the variation in individual
house prices in cities?
x Are locational variables more or less important than the characteristics
of the house itself or of the neighbourhood in which it is located?

x How do these results compare across cities?
It can be noticed that the example questions in the two groups have discernible flavours of the particular methods available in GIS and statistical
analysis, respectively, i.e. the questions have been formulated with certain
analysis methods in mind. This is natural for handbooks, which are intended to teach their readers to use methods, but how does this match the
actual practice of data analysis?
We believe that questions oriented towards particular analysis methods
may indeed exist in many situations, for example, when somebody performs routine analyses of data of the same type and structure. But what
happens when an analyst encounters new data that do not resemble anything dealt with so far? It seems clear that the analyst needs to get acquainted with the data before he/she can formulate questions like those
cited in the handbooks, i.e. questions that already imply what method to
use.
“Getting acquainted with data” is the topic pursued in exploratory data
analysis, or EDA. As has been said in an Internet course in statistics, “Often when working with statistics we wish to answer a specific question
such as does smoking cigars lead to an increased risk of lung cancer? Or
does the number of keys carried by men exceed those carried by women?
... However sometimes we just wish to explore a data set to see what it

1.1 What Is Data Analysis?

3

might tell us. When we do this we are doing Exploratory Data Analysis”
(STAT 2005).
Although EDA emerged from statistics, this is not a set of specific techniques, unlike statistics itself, but rather a philosophy of how data analysis
should be carried out. This philosophy was defined by John Tukey (Tukey
1977) as a counterbalance to a long-term bias in statistical research towards developing mathematical methods for hypothesis testing. As Tukey
saw it, EDA was a return to the original goals of statistics, i.e. detecting
and describing patterns, trends, and relationships in data. Or, in other
words, EDA is about hypothesis generation rather than hypothesis testing.
The concept of EDA is strongly associated with the use of graphical representations of data. As has been said in an electronic handbook on engineering statistics, “Most EDA techniques are graphical in nature with a

few quantitative techniques. The reason for the heavy reliance on graphics
is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing
the data to reveal its structural secrets, and being always ready to gain
some new, often unsuspected, insight into the data. In combination with
the natural pattern-recognition capabilities that we all possess, graphics
provides, of course, unparalleled power to carry this out”
(NIST/SEMATECH 2005).
Is the process of exploratory data analysis also question-driven, like traditional statistical analysis and GIS analysis? On the one hand, it is hardly
imaginable that someone would start exploring data without having any
question in mind; why then start at all? On the other hand, if any questions
are asked, they must be essentially different from the examples cited
above. They cannot be so specific and cannot imply what analysis method
will be used. Appropriate examples can be found in George Klir’s explanation of what empirical investigation is (Klir 1985).
According to Klir, a meaningful empirical investigation implies an object of investigation, a purpose of the investigation of the object, and constraints imposed upon the investigation. “The purpose of investigation can
be viewed as a set of questions regarding the object which the investigator
(or his client) wants to answer. For example, if the object of investigation
is New York City, the purpose of the investigation might be represented by
questions such as ‘How can crime be reduced in the city?’ or ‘How can
transportation be improved in the city?’; if the object of investigation is a
computer installation, the purpose of investigation might be to answer
questions ‘What are the bottlenecks in the installation?’, ‘What can be
done to improve performance?’, and the like; if a hospital is investigated,
the question might be ‘How can the ability to give immediate care to all
emergency cases be increased?’, ‘How can the average time spent by a

4

1 Introduction

patient in the hospital be reduced?’, or ‘What can be done to reduce the
cost while preserving the quality of services?’; if the object of interest of a
musicologist is a musical composer, say Igor Stravinsky, his question is
likely to be ‘What are the basic characteristics of Stravinsky’s compositions which distinguish him from other composers?’ ” (Klir 1985, p. 83).
Although Klir does not use the term “exploratory data analysis”, it is clear
that exploratory analysis starts after collecting data about the object of investigation, and the questions representing the purpose of investigation
remain relevant.
According to the well-known “Information Seeking Mantra” introduced
by Ben Shneiderman (Shneiderman 1996), EDA can be generalised as a
three-step process: “Overview first, zoom and filter, and then details-ondemand”. In the first step, an analyst needs to get an overview of the entire
data collection. In this overview, the analyst identifies “items of interest”.
In the second step, the analyst zooms in on the items of interest and filters
out uninteresting items. In the third step, the analyst selects an item or
group of items for “drilling down” and accessing more details. Again, the
process is iterative, with many returns to the previous steps. Although
Shneiderman does not explicitly state this, it seems natural that it is the
general goal of investigation that determines what items will be found “interesting” and deserving of further examination.
On this basis, we adopt the following view of EDA. The analyst has a
certain purpose of investigation, which motivates the analysis. The purpose
is specified as a general question or a set of general questions. The analyst
starts the analysis with looking what is interesting in the data, where “interestingness” is understood as relevance to the purpose of investigation.
When something interesting is detected, new, more specific questions appear, which motivate the analyst to look for details. These questions affect
what details will be viewed and in what ways. Hence, questions play an
important role in EDA and can determine the choice of analysis methods.
There are a few distinctions in comparison with the example questions
given in textbooks on statistics and GIS:
x EDA essentially involves many different questions;
x the questions vary in their level of generality;
x most of the questions arise in the course of analysis rather than being
formulated in advance.

These peculiarities make it rather difficult to formulate any guidelines for
successful data exploration, any instructions concerning what methods to
use in what situation. Still, we want to try.
There is an implication of the multitude and diversity of questions involved in exploratory data analysis: this kind of analysis requires multiple

1.2 Objectives of the Book

5

tools and techniques to be used in combination, since no single tool can
provide answers to all the questions. Ideally, a software system intended to
support EDA must contain a set of tools that could help an analyst to answer any possible question (of course, only if the necessary information is
available in the data). This ideal will, probably, never be achieved, but a
designer conceiving a system or tool kit for data analysis needs to anticipate the potential questions and at least make a rational choice concerning
which of them to support.

1.2

Objectives of the Book

This is a book about exploratory data analysis and, in particular, exploratory analysis of spatial and temporal data. The originator of EDA, John
Tukey, begins his seminal book with comparing exploratory data analysis
to detective work, and dwells further upon this analogy: “A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he
does not understand where the criminal is likely to have put his fingers, he
will not look in the right places. Equally, the analyst of data needs both
tools and understanding” (Tukey 1977, p. 1).
Like Tukey, we also want to talk about tools and understanding. We
want to consider current computer-based tools suitable for exploratory
analysis of spatial and spatio-temporal data. By “tools”, we do not mean

primarily ready-to-use software executables; we also mean approaches,
techniques, and methods that have, for example, been demonstrated on
pilot prototypes but have not yet come to real implementation.
Unlike Tukey, we have not set ourselves the goal of describing each tool
in detail and explaining how to use it. Instead, we aim to systemise the
tools (which are quite numerous) into a sort of catalogue and thereby lead
readers to an understanding of the principles of choosing appropriate tools.
The ultimate goal is that an analyst can easily determine what tools would
be useful in any particular case of data exploration.
The most important factors for tool selection are the data to be analysed
and the question(s) to be answered by means of analysis. Hence, these two
factors must form part of the basis of our systemisation, in spite of the fact
that every dataset is different and the number of possible questions is infinite. To cope with this multitude, it is necessary to think about data and
questions in a general, domain-independent manner. First, we need to determine what general characteristics of data are essential to the problem of
choosing the right exploratory tools. We want not only to be domain-

6

1 Introduction

independent but also to put aside any specifics of data collection, organisation, storage, and representation formats. Second, we need to abstract a
reasonable number of general question types, or data analysis tasks, from
the myriad particular questions. While any particular question is formulated in terms of a specific domain that the data under analysis are relevant
to, a general task is defined in terms of structural components of the data
and relations between them.
Accordingly, we start by developing a general view of data structure and
characteristics and then, on this basis, build a general task typology. After
that, we try to extend the generality attained to the consideration of existing methods and techniques for exploratory data analysis. We abstract
from the particular tools and functions available in current software packages to types of tools and general approaches. The general tool typology

uses the major concepts of the data framework and of the task typology.
Throughout all this general discussion, we give many concrete examples,
which should help in understanding the abstract concepts.
Although each subsequent element in the chain “datataskstools” refers to the major concepts of the previous element(s), this sort of linkage
does not provide explicit guidelines for choosing tools and approaches in
the course of data exploration. Therefore, we complete the chain by revealing the general principles of exploratory data analysis, which include recommendations for choosing tools and methods but extend beyond this by
suggesting a kit of generic procedures for data exploration and by encouraging a certain amount of discipline in dealing with data.
In this way, we hope to accomplish our goal: to enumerate the tools and
to give understanding of how to choose and use them. In parallel, we hope
to give some useful guidelines for tool designers. We expect that the general typology of data and tasks will help them to anticipate the typical
questions that may arise in data exploration. In the catalogue of techniques,
designers may find good solutions that could be reused. If this is not the
case (we expect that our cataloguing work will expose some gaps in the
datatask space which are not covered by the existing tools), the general
principles and approaches should be helpful in designing new tools.

1.3
1.3.1

Outline of the Book
Data

As we said earlier, we begin with introducing a general view of the structure and properties of the data; this is done in the next chapter, entitled

1.3 Outline of the Book

7

“Data”. The most essential point is to distinguish between characteristic

and referential components of data: the former reflect observations or
measurements while the latter specify the context in which the observations or measurements were made, for example place and/or time. It is
proposed that we view a dataset as a function (in a mathematical sense)
establishing linkages between references (i.e. particular indications of
place, time, etc.) and characteristics (i.e. particular measured or observed
values). The function may be represented symbolically as follows (Fig.
1.1):
f

R

C
c1

r1
c2

r2

c3

r3
r4

c4

Fig. 1.1. The functional view of a dataset

The major theoretical concepts are illustrated by examples of seven specific datasets. Pictures such as the following one (Fig. 1.2) represent visually the structural components of the data:
Attribute

Referrers
Forest types:
x
x
x
x

Data

19_07
19_07
19_07
19_07
19_08
19_08
…

Broadleaved
Coniferous
Mixed
Other
Broadleaved
Coniferous
…

References

% of covered land

Broadleaved
Coniferous
Mixed
Other

5.0
13.8
3.3
34.9
7.3
4.4
…

Characteristics

Fig. 1.2. A visual representation of the structure of a dataset

8

1 Introduction

Those readers who tend to be bored by abstract discussions or cannot
invest much time in reading may skip the theoretical part and proceed from
the abstract material immediately to the examples, which, we hope, will
reflect the essence of the data framework. These examples are frequently
referred to throughout the book, especially those relating to the Portuguese
census and the US crime statistics. If unfamiliar terms occur in the descriptions of the examples, they may be looked up in the list of major definitions in Appendix I.
1.3.2

Tasks

Chapter 3 is intended to propound a comprehensive typology of the possible data analysis tasks, that is, questions that need to be answered by
means of data analysis. Tasks are defined in terms of data components.
Thus, Fig. 1.3 represents schematically the tasks “What are the characteristics corresponding to the given reference?” and “What is the reference corresponding to the given characteristics?”
f

R

C
r

?

f

R

C
?

c

Fig. 1.3. Two types of tasks are represented schematically on the basis of the functional view of data

An essential point is the distinction between elementary and synoptic
tasks. “Elementary” does not mean “simple”, although elementary tasks
are usually simpler than synoptic ones. Elementary tasks deal with elements of data, i.e. individual references and characteristics. Synoptic tasks
deal with sets of references and the corresponding configurations of characteristics, both being considered as unified wholes. We introduce the
terms “behaviour” and “pattern”. “Behaviour” denotes a particular, objectively existing configuration of characteristics, and “pattern” denotes the

way in which we see and interpret a behaviour and present it to other people. For example, we can qualify the behaviour of the midday air temperature during the first week of April as an increasing trend. Here, “increasing
trend” is the pattern resulting from our perception of the behaviour.
The major goal of exploratory data analysis may be viewed generally as
building an appropriate pattern from the overall behaviour defined by the
entire dataset, for example, “What is the behaviour of forest structures in
the territory of Europe?” or “What is the behaviour of the climate of Germany during the period from 1991 to 2003?”

1.3 Outline of the Book

9

We consider the complexities that arise in exploring multidimensional
data, i.e. data with two or more referential components, for example space
and time. Thus, in the following two images (Fig. 1.4), the same spaceand time-referenced data are viewed as a spatial arrangement of local behaviours over time and as a temporal sequence of momentary behaviours
over the territory:

Fig. 1.4. Two possible views of the same space- and time-referenced data

This demonstrates that the behaviour of multidimensional data may be
viewed from different perspectives, and each perspective reveals some aspect of it, which may be called an “aspectual” behaviour. In principle, each
aspectual behaviour needs to be analysed, but the number of such behaviours multiplies rapidly with increasing number of referential components:
6 behaviours in three-dimensional data, 24 in four-dimensional data, 120
in five-dimensional data, and so on.
We introduce and describe various types of elementary and synoptic
tasks and give many examples. The description is rather extended, and we
shall again make a recommendation for readers who wish to save time but
still get the essence. At the end of the section dealing with elementary
tasks, we summarise what has been said in a subsection named “Recap:
Elementary Tasks”. Analogously, there is a summary of the discussion of

10

1 Introduction

synoptic tasks, named “Recap: Synoptic Tasks”. Readers may proceed
from the abstract of the chapter directly to the first recap and then to the
second. The formal notation in the recaps may be ignored, since it encodes
symbolically what has been said verbally. If unfamiliar terms are encountered, they may be looked up in Appendix I.
After the recaps, we recommend that one should read the introduction to
connection discovery tasks (Sect. 3.5), which refer to relations between
behaviours such as correlations, dependencies, and structural links between components of a complex behaviour. The section “Other approaches” is intended for those who are interested in knowing how our
approach compares with others.
1.3.3

Tools

Chapter 4 systemises and describes the tools that may be used for exploratory data analysis. We divide the tools into five broad categories: visualisation, display manipulation, data manipulation, querying, and computation.
We discuss the tools on a conceptual level, as “pure” ideas distilled from
any specifics of the implementation, rather than describe any particular
software systems or prototypes.
One of our major messages is that the main instrument of EDA is the
brain of a human explorer, and that all other tools are subsidiary. Among
these subsidiary tools, the most important role belongs to visualisation as
providing the necessary material for the explorer’s observation and thinking. The outcomes of all other tools need to be visualised in order to be
utilised by the explorer.
In considering visualisation tools, we formulate the general concepts
and principles of data visualisation. Our treatment is based mostly upon
the previous research and systemising work done in this area by other researchers, first of all Jacques Bertin. We begin with a very brief overview

of that work. For those who still find this overview too long, we suggest
that they skip it and go immediately to our synopsis of the basic principles
of visualisation. If any unknown terms are encountered, readers may, as
before, consult Appendix I.
After the overview of the general principles of visualisation, we consider several examples, such as the visualisation of the movement of white
storks flying from Europe to Africa for a winter vacation (Fig. 1.5).
In the next section, we discuss display manipulation – various interactive operations that modify the encoding of data items in visual elements
of a display and thereby change the appearance of the display. We are interested in such operations that can facilitate the analysis and help in

1.3 Outline of the Book

11

Fig. 1.5. A visualisation of the movement of white storks.

Fig. 1.6. An example of a display manipulation technique: focusing

grasping general patterns or essential distinctions, rather than just “beautifying” the picture (Fig. 1.6).
Data manipulation basically means derivation of new data from existing
data for more convenient or more comprehensive analysis. One of the
classes of data manipulation, attribute transformation, involves deriving
new attributes on the basis of existing attributes. For example, from values
of a time-referenced numeric attribute, it is possible to compute absolute
and relative amounts of change with respect to previous moments in time
or selected moments (Fig. 1.7).
Besides new attributes, it is also possible to derive new references. We
pay much attention to data aggregation, where multiple original references
are substituted by groups considered as wholes. This approach allows an
explorer to handle very large amounts of data. The techniques for data aggregation and for analysis on the basis of aggregation are quite numerous

and diverse; here we give just a few example pictures (Fig. 1.8).

Springer exploratory analysis of spatial and temporal data a systematic approach (2005) DDU

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về