Tải bản đầy đủ (.pdf) (21 trang)

Quantitative Methods and Applications in GIS - Chapter 7 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.2 MB, 21 trang )


127

7

Principal Components,
Factor, and Cluster
Analyses, and Application
in Social Area Analysis

This chapter discusses three important multivariate statistical analysis methods:
principal components analysis (PCA), factor analysis (FA), and cluster analysis
(CA). PCA and FA are often used together for data reduction by structuring many
variables into a limited number of components (factors). The techniques are partic-
ularly useful for eliminating variable collinearity and uncovering latent variables.
Applications of the methods are widely seen in socioeconomic studies (also see case
study 8 in Section 8.4). While the PCA and FA group variables, the CA classifies
many observations into categories according to similarity among their attributes.
In other words, given a dataset as a table, the PCA and FA reduce the number of
columns and the CA reduces the number of rows.
Social area analysis is used to illustrate the techniques, as it employs all three
methods. The interpretation of social area analysis results also leads to a review and
comparison of three classic models on urban structure, namely, the concentric zone
model, the sector model, and the multinuclei model. The analysis demonstrates how
analytical statistical methods synthesize descriptive models into one framework.
Beijing, the capital city of China, on the verge of forming its social areas after
decades under a socialist regime, is chosen as the study area for a case study. Usage
of GIS in this case study is limited to mapping for spatial patterns.
Section 7.1 discusses principal components and factor analysis. Section 7.2
explains cluster analysis. Section 7.3 reviews social area analysis. A case study on
the social space in Beijing is presented in Section 7.4 to provide a new perspective


to the fast-changing urban structure in China. The chapter is concluded with a
discussion and brief summary in Section 7.5.

7.1 PRINCIPAL COMPONENTS AND FACTOR ANALYSIS

Principal components and factor analysis are often used together for data reduction.
Benefits of this approach include uncovering latent variables for easy interpretation
and removing multicollinearity for subsequent regression analysis. In many socio-
economic applications, variables extracted from census data are often correlated
with each other, and thus contain duplicated information to some extent. Principal
components and factor analysis use fewer factors to represent the original variables,
and thus simplify the structure for analysis. Resulting component or factor scores

2795_C007.fm Page 127 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC

128

Quantitative Methods and Applications in GIS

are uncorrelated to each other (if not rotated or orthogonally rotated), and thus can
be used as explanatory variables in regression analysis.
Despite the commonalities, principal components and factor analysis are “both
conceptually and mathematically very different” (Bailey and Gatrell, 1995, p. 225).
Principal components analysis uses the same number of variables (components) to
simply transform the original data, and thus is a mathematical transformation (strictly
speaking, not a statistical operation). Factor analysis uses fewer variables (factors)
to capture most of the variation among the original variables (with error terms), and
thus is a statistical analysis process. Principal components attempts to explain the
variance of observed variables, whereas factor analysis intends to explain their

intercorrelations (Hamilton, 1992, p. 252). In many applications (as in ours), the
two methods are used together. In SAS, principal components analysis is offered as
an option under the procedure for factor analysis.

7.1.1 P

RINCIPAL

C

OMPONENTS

F

ACTOR

M

ODEL

In formula,

principal components analysis

(PCA) transforms original data on

K

observed variables


Z

k

to data on

K

principal components

F

k

that are independent
from (uncorrelated with) each other:
(7.1)
Retaining only the

J

largest components (

J

<

K

), we have

(7.2)
where the discarded components are represented by the residual term

v

k

, such as

v

k

=

l

k

,

J

+1

F

J

+1


+

l

k

,

J

+2

F

J

+2

+ … +

l

kK

F

K

(7.3)

Equations 7.2 and 7.3 represent a model termed

principal components factor
analysis

(PCFA). The PCFA retains the largest components to capture most of the
variance while discarding minor components with small variance. The PCFA is the
method used in social area analysis (Cadwallader, 1996, p. 137) and is simply
referred to as factor analysis in the remainder of this chapter.
In a true

factor analysis

(FA), the residual (error) term, denoted as

u

k

to distin-
guish it from

v

k

in a PCFA, is unique to each variable

Z


k

:
The

u

k

are termed

unique factors

(in contrast to

common factors F

j

). In the PCFA,
the residual

v

k

is a linear combination of the discarded components (

F


J

+1

, …,

F

K

) and
thus cannot be uncorrelated like the

u

k

in a true FA (Hamilton, 1992, p. 252).
ZlFlF lF lF
kk k kjj kKK
= + ++ ++
11 2 2

ZlFlF lFv
kk k kJJ k
=++++
11 2 2

ZlFlF lFu
kk k kJJ k

=++++
11 2 2


2795_C007.fm Page 128 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC

Principal Components, Factor, and Cluster Analyses, and Application

129

7.1.2 F

ACTOR

L

OADINGS

, F

ACTOR

S

CORES

,

AND


E

IGENVALUES

For convenience, the original data of observed variables

Z

k

are first

standardized

1

prior to the PCA and FA analysis, and the initial values for components (factors)
are also standardized. When both

Z

k

and

F

j


are standardized, the

l

kj

in Equations 7.1
and 7.2 are

standardized coefficients

in the regression of variables

Z

k

on components
(factors)

F

j

, also termed

factor loadings

. For example,


l

k

1

is the loading of variables

Z

k

on standardized component

F

1

. Factor loading reflects the strength of relations
between variables and components.
Conversely, the components

F

j

can be reexpressed as a linear combination of
the original variables

Z


k

:
(7.4)
Estimates of these components (factors) are termed

factor scores

. Estimates of

a

kj


are

factor score coefficients

, i.e., coefficients in the regression of factors
on variables.
The components

F

j

are constructed to be uncorrelated with each other and are
ordered such that the first component


F

1

has the largest sample variance (

λ

1

),

F

2

the
second largest, and so on. The variances

λ

j

corresponding to various components
are termed

eigenvalues

, and


λ

1

>

λ

2



> ….
Since standardized variables have variances of 1, the total variance of all
variables also equals the number of variables, such as

λ

1

+

λ

2

+ … +

λ


K

=

K

(7.5)
Therefore, the

proportion

of total variance explained by the

j

th component is

λ

j

/

K

.
Eigenvalues provide a basis for judging which components (factors) are impor-
tant and which are not, and thus deciding how many components to retain. One may
also follow a rule of thumb that only eigenvalues greater than 1 are important (Griffith

and Amrhein, 1997, p. 169). Since the variance of each standardized variable is 1,
a component with

λ

< 1 accounts for less than an original variable’s variation, and
thus does not serve the purpose of data reduction.
The eigenvalue-1 rule is arbitrary. A

scree graph

plots eigenvalues against
component (factor) number and provides a more useful guidance (Hamilton, 1992,
p. 258). For example, Figure 7.1 shows the scree graph of eigenvalues in a case of
14 components (using the result from case study 7 in Section 7.4). The graph levels
off after component 4, indicating that components 5 to 14 account for relatively
little additional variance. Therefore, four components may be retained as principal
components.
Outputs from statistical analysis software such as SAS include important infor-
mation, such as factor loadings, eigenvalues, and proportions (of total variance).
Factor scores can be saved in a predefined external file. The factor analysis procedure
in SAS also outputs a correlation matrix between the observed variables for analysts
to examine their relations.
FaZaZ aZ
jj j KjK
=+++
11 2 2


2795_C007.fm Page 129 Friday, February 3, 2006 12:14 PM

© 2006 by Taylor & Francis Group, LLC

130

Quantitative Methods and Applications in GIS

7.1.3 R

OTATION

Initial results from PCFA are often hard to interpret as variables load across factors.
While fitting the data equally well,

rotation

generates simpler structure and more
interpretable factors by maximizing the loading (positive or negative) of each vari-
able on one factor and minimizing the loadings on the others. As a result, we can
detect which factor (latent variable) captures the information contained in what
variables (observed), and subsequently label the factors adequately.

Orthogonal rotation

generates independent (uncorrelated) factors, an important
property for many applications. A widely used orthogonal rotation method is

Varimax
rotation

, which maximizes the variance of the squared loadings for each factor, and

thus polarizes loadings (either high or low on factors). Varimax rotation is often the
rotation technique used in social area analysis.

Oblique rotation

(e.g.,

promax rotation

)
generates even greater polarization, but allows correlation between factors. In SAS,
an option is provided to specify which rotation to use.
As a summary, Figure 7.2 illustrates the process of PCFA:
1. The original dataset of

K

observed variables with

n

records is first
standardized to a dataset of

Z

scores with the same number of variables
and records.
2. PCA then uses


K

uncorrelated components to explain all the variance of
the

K

variables.
3. PCFA keeps only

J

(

J

<

K

) principal components to capture most of
the variance.
4. A rotation method is used to load each variable strongly on one factor
(and near zero on the others) for easier interpretation.
The SAS procedure for factor analysis (FA) is FACTOR, which also reports
the principal components analysis (PCA) results preceding those of FA. The
following sample SAS statements implement the factor analysis that uses four
factors to capture the structure of 14 variables, x1 through x14, and adopts the
Varimax rotation technique:


FIGURE 7.1

Scree graph for principal components analysis.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Component
Variance proportion
1 2 3 4 5 6 7 8 9 10 11 12 13 14

2795_C007.fm Page 130 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC

Principal Components, Factor, and Cluster Analyses, and Application

131

proc factor out=FACTSCORE (replace=yes)
nfact=4 rotate=varimax;
var x1-x14;
The SAS data set FACTSCORE has the factor scores, which can be saved to an
external file. Note that a SAS program is not case sensitive.
7.2 CLUSTER ANALYSIS
Cluster analysis (CA) groups observations according to similarity among their

attributes. As a result, the observations within a cluster are more similar than
observations between clusters, as measured by the clustering criterion. Note the
difference between CA and another similar multivariate analysis technique —
discriminant function analysis (DFA). Both group observations into categories based
on the characteristic variables. Categories are unknown in CA but known in DFA.
See Appendix 7A for further discussion on DFA.
Geographers have a long-standing interest in cluster analysis (CA) that has been
developed in applications such as regionalization and city classification. In the case
of social area analysis, cluster analysis is used to further analyze the results from
factor analysis (i.e., factor scores of various components across space) and group
areas into different types of social areas.
A key element in deciding assignment of observations to clusters is distance,
measured in various ways. The most commonly used distance measure is
Euclidean distance:
FIGURE 7.2 Data processing steps in principal components factor analysis.
n recor
d
s
Original
data set
1 2 3 K
K variables
(1)
Standardize
n records
1
2
3
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


n
1 2 3 K
K variables
Z scores
(2)
PCA
K variables
1
2
3
.

.
.
.
.
.
K
Component
loadings
1 2 3 K
K components
PCFA
(3)
K variables
Factor
loadings
1 2 3 J (J < K)
J factors
(4)
Rotation
K variables
1
2
3
.
.
.
.
.
.
K

Maximizing
loadings on
one factor
1 2 3 J
J factors
1
2
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


n

1
2
3
.
.
.
.
.
.
K
2795_C007.fm Page 131 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
132 Quantitative Methods and Applications in GIS
(7.6)
where x
ik
and x
jk
are the kth variable value of the K-dimensional observations for
individuals i and j. When K = 2, Euclidean distance is simply the straight-line
distance between observations i and j in a two-dimensional space. Like the various
distance measures discussed in Chapter 2, distance measures here also include
Manhattan (or city block) distance and others (e.g., Minkowski distance, Canberra
distance) (Everitt et al., 2001, p. 40).
The most widely used clustering method is the agglomerative hierarchical
methods (AHMs). The methods produce a series of groupings: the first consists of
single-member clusters, and the last consists of a single cluster of all members. The
results of these algorithms can be summarized with a dendrogram, a tree diagram
showing the history of sequential grouping process. See Figure 7.3 for the example
illustrated below. In the diagram, the clusters are nested and each cluster is a member

of a larger, higher-level cluster.
For illustration, an example is used to explain a simple AHM, the single-linkage
method or the nearest-neighbor method. Consider a dataset of four observations with
the following distance matrix:
FIGURE 7.3 Dendrogram for the clustering analysis example.
Data points
Distance
1.0
2.0
3.0
4.0
5.0
1234
C1
C2
C3
dxx
ij ik jk
k
K
=−
=

(( ))
/2
1
12
D
1
1

2
3
4
0
30
650
9740
=














2795_C007.fm Page 132 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 133
The smallest no-zero entry in the above matrix D
1
is (2 → 1) = 3, and therefore
individuals 1 and 2 are grouped together to form the first cluster C1. Distances
between this cluster and the other two individuals are defined according to the

nearest-neighbor criterion:
A new matrix is now obtained with cells representing distances between cluster
C1 and individuals 3 and 4, or between individuals 3 and 4:
The smallest no-zero entry in D
2
is (4 → 3) = 4, and thus individuals 3 and 4
are grouped to form a cluster C2. Finally, clusters C1 and C2 are grouped together,
with distance equal to 5, to form one cluster C3 containing all four members. The
process is summarized in a dendrogram in Figure 7.3, where the height represents
the distance at which each fusion is made.
Similarly, the complete linkage (farthest-neighbor) method uses the maximum
distance between pair of objects (one in one cluster and one in the other); the
average linkage method uses the average distance between pair of objects; and the
centroid method uses squared Euclidean distance between individuals and cluster
means (centroids).
Another commonly used AHM is Ward’s method. The objective at each stage is
to minimize the increase in the total within-cluster error sum of squares given by
where
in which x
ck,i
is the value for the kth variable for the ith observation in the cth cluster,
and is the mean of the kth variable in the cth cluster.
Each clustering method has its advantages and disadvantages. A desirable clus-
tering should produce clusters of similar size, densely located, compact in shape,
dddd
ddd
()
()
min{ , }
min{ ,

12 3 13 23 23
12 4 14 24
5===
= }} ==d
24
7
D
2
12
3
4
0
50
740
=










()
EE
c
c
C

=
=

1
Exx
c
ck j ck
k
K
i
n
c
=−
()
=
∑∑
,
2
1
x
ck
2795_C007.fm Page 133 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
134 Quantitative Methods and Applications in GIS
and internally homogeneous (Griffith and Amrhein, 1997, p. 217). The single-linkage
method tends to produce unbalanced and straggly clusters and should be avoided in
most cases. If outlier is a major concern, the centroid method should be used. If
compactness of clusters is a primary objective, the complete linkage method should
be used. Ward’s method tends to find same size and spherical clusters and is rec-
ommended if no single overriding property is desired (Griffith and Amrhein, 1997,

p. 220). The case study in this chapter also uses Ward’s method.
The choice for the number of clusters depends on objectives of specific appli-
cations. Similar to the selection of factors based on the eigenvalues in factor analysis,
one may also use a scree plot to assist in the decision. In the case of Ward’s method,
a graph of R
2
vs. the number of clusters helps choose the number, beyond which
little more homogeneity is attained by further mergers.
In SAS, the procedure CLUSTER implements the cluster analysis and the
procedure TREE generates the dendrogram. The following sample SAS statements
use Ward’s method for clustering and cut off the dendrogram at nine clusters:
proc cluster method=ward outtree=tree;
id subdist_id; /* variable for labeling ids */
var factor1-factor4; /* variables used */
proc tree out=bjcluster ncl=9;
id subdist_id;
7.3 SOCIAL AREA ANALYSIS
The social area analysis was developed by Shevky and Williams (1949) in a study
of Los Angeles and was later elaborated on by Shevky and Bell (1955) in a study
of San Francisco. The basic thesis is that the changing social differentiation of society
leads to residential differentiation within cities. The studies classified census tracts
into types of social areas based on three basic constructs: economic status (social
rank), family status (urbanization), and segregation (ethnic status). Originally the
three constructs were measured by six variables: economic status was captured by
occupation and education; family status by fertility, women labor participation, and
single-family houses; and ethnic status by percentage of minorities (Cadwallader,
1996, p. 135). In a factor analysis, an idealized factor loadings matrix probably looks
like Table 7.1. Subsequent studies using a large number and variety of measures
generally confirmed the validity of the three constructs (Berry, 1972, p. 285;
Hartshorn, 1992, p. 235).

Geographers made an important advancement in social area analysis by ana-
lyzing the spatial patterns associated with these dimensions (e.g., Rees, 1970; Knox,
1987). The socioeconomic status factor tends to exhibit a sector pattern: tracts with
high values for variables, such as income and education, form one or more sectors,
and low-status tracts form other sectors. The family status factor tends to form
concentric zones: inner zones are dominated by tracts with small families with
either very young or very old household heads, and tracts in outer zones are mostly
2795_C007.fm Page 134 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 135
occupied by large families with middle-age household heads. The ethnic status
factor tends to form clusters, each of which is dominated by a particular ethnic
group. Superimposing the three constructs generates a complex urban mosaic,
which can be grouped into various social areas by cluster analysis. See Figure 7.4.
By studying the spatial patterns from social area analysis, three classic models for
urban structure — Burgess’s (1925) concentric zone model, Hoyt’s (1939) sector
model, and the Ullman–Harris (Harris and Ullman, 1945) multinuclei model —
are synthesized into one framework. In other words, each of the three models reflects
one specific dimension of urban structure and is complementary to the others.
There are at least three criticisms of the factorial ecological approach to under-
standing residential differentiation in cities (Cadwallader, 1996, p. 151). First, the
analysis results are sensitive to research design, such as variables selected and
measured, analysis units, and factor analysis methods. Second, it is still a descriptive
form of analysis and fails to explain the underlying process that causes the patterns.
Third, the social areas identified by the studies are merely homogeneous, but not
necessarily functional regions or cohesive communities. Despite the criticisms, social
area analysis helps us understand residential differentiation within cities, and serves
as an important instrument for studying intraurban social spatial structure. Applications
of social area analysis can be seen on cities in developed countries, particularly rich
on cities in North America (see a review by Davies and Herbert, 1993), and also on

some cities in developing countries (e.g., Berry and Rees, 1969; Abu-Lughod, 1969).
7.4 CASE STUDY 7: SOCIAL AREA ANALYSIS IN BEIJING
This case study is developed on the basis of a research project reported in Gu et al.
(2005). Detailed research design and interpretation of the results can be found in
the original paper. This section shows the procedures to implement the study, with
emphasis on illustrating the three statistical methods. In addition, the study illustrates
how to test the spatial structure of factors by regression models with dummy vari-
ables. Since the 1978 economic reforms in China, and particularly the 1984 urban
reforms, including the urban land use reform and the housing reform, urban
landscape in China has changed significantly. Many large cities have been on the
TABLE 7.1
Idealized Factor Loadings in Social Area Analysis
Economic Status Family Status Ethnic Status
Occupation I O O
Education I O O
Fertility O I O
Female labor participation O I O
Single-family house O I O
Minorities O O I
Note: I denotes a number close to 1 or –1; O denotes a number close to 0.
2795_C007.fm Page 135 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
136 Quantitative Methods and Applications in GIS
transition from self-contained work unit neighborhood systems to more differentiated
urban space. As the capital city of China, Beijing offers an interesting case to look
into this important change in urban structure in China.
The study area is the contiguous urbanized area of Beijing, with 107 subdistricts
(jiedao), excluding the 2 remote suburban districts (Mentougou and Fangshan) and
23 subdistricts on the periphery of inner suburbs (also rural and lack of complete
data). See Figure 7.5. The study area had a total population of 5.9 million, and the

subdistricts had an average population of 55,200 in 1998. Subdistrict has been the
basic administrative unit in Beijing for decades, and also the lowest geographic level
reported in government statistical reports accessible by the public. Therefore, it was
the analysis unit used in this research. Because of the lack of socioeconomic data
in the national census of population, most of the data used in this research were
extracted from the 1998 statistical yearbooks of individual districts in Beijing. Some
data, such as personal income and individual living space, were obtained through a
survey of households conducted in 1998.
FIGURE 7.4 Conceptual model for urban mosaic.
High SES
(a) Socioeconomic status (SES)
Low SES
Small families
Overlay
(d) Urban mosaic
(b) Family status
Large families
Ethnic enclaves
(
c
)
Ethnic status
2795_C007.fm Page 136 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 137
The following datasets are provided:
1. Shapefile bjsa contains 107 urban subdistricts (jiedao) in Beijing.
2. Text file bjattr.csv is the attribute dataset.
In the shapefile bjsa, the field sector identifies four sectors (= 1 for NE, 2
for SE, 3 for SW, and 4 for NW) and the field ring identifies four zones (= 1 for

the most inner zone, 2 for the next inner zone, and so on). Sectors and zones are
needed for testing spatial structures of social space. Both the shapefile bjsa and
the attribute file bjattr.csv contain a field ref_id (subdistrict IDs) as the
common key for joining them together. The text file bjattr.csv has 14 socio-
economic variables (X1 to X14) for social area analysis. Variable names and their
basic statistics for these 14 variables are summarized in Table 7.2.
1. Executing principal components factor analysis in SAS: See the sample
SAS program FA_Clust.sas in Appendix 7B (also provided on the CD).
The first part of the program uses the SAS procedure PROC FACTOR to
FIGURE 7.5 Study area for Beijing’s social area analysis.
District boundary
Study area
Shijingshan
Haidian
Fengtai
Xicheng
Xuanwu
Chong-
wen
Dong-
cheng
Chaoyang
2795_C007.fm Page 137 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
138 Quantitative Methods and Applications in GIS
execute the principal components factor analysis (PCFA). The program
reads the attribute dataset bjattr.csv, and uses four factors to capture
most of the information in the original 14 variables (x1, x2, …, x14). The
output factor scores are saved in a text file factscore.csv, containing
the original 14 variables and the 4 factor scores.

The SAS procedure FACTOR reports the result of principal components
analysis (PCA) preceding that of factor analysis (FA). The number of
factors (four, in our case) is determined by analyzing the eigenvalues from
PCA (see Table 7.3).
2
In deciding the number of components (factors) to
include in the next step of FA, one has to make a trade-off between the
total variance explained (higher by including more components) and inter-
pretability of factors (better with less components). Following the eigen-
value-1 rule (see Section 7.1.2), we retain four factors. The four components
explain over 70% of the total variance. The choice of retaining four
components is also supported by analyzing the scree graph in Figure 7.1.
TABLE 7.2
Basic Statistics for Socioeconomic Variables in Beijing (n = 107)
Index Variables Mean Std. Dev. Minimum Maximum
X1 Population density (persons/km
2
)
a
14,797.09 13,692.93 245.86 56,378.00
X2 Natural growth rate (%) –1.11 2.79 –16.41 8.58
X3 Sex ratio (M/F)

1.03 0.08 0.72 1.32
X4 Labor participation ratio (%)
b
0.60 0.06 0.47 0.73
X5 Household size (persons/household)

2.98 0.53 2.02 6.55

X6 Dependency ratio
c
1.53 0.22 1.34 2.14
X7 Income (yuan/person) 29,446.49 127,223.03 7505.00 984,566.00
X8 Public service density (no./km
2
)
d
8.35 8.60 0.05 29.38
X9 Industry density (no./km
2
)

1.66 1.81 0.00 10.71
X10 Office/retail density (no./km
2
) 14.90 15.94 0.26 87.86
X11 Ethnic enclave (0, 1)
e
0.10 0.31 0.00 1.00
X12 Floating population ratio (%)
a
6.81 7.55 0.00 65.59
X13 Living space (m
2
/person) 8.89 1.71 7.53 15.10
X14 Housing price (yuan/m
2
) 6686.54 3361.22 1400.00 18,000.00
a

The household registration (hukou) system in China classifies residents into two categories: permanent
and temporal residents. Temporal residents were those migrants from rural areas who had not obtained
the permanent residence status in a city, also called floating population. Population density here is
measured as number of permanent residents per square kilometer.
b
Labor participation ratio is percentage of persons in the labor force out of the total population at the
working ages, i.e., 18 to 60 years old for males and 18 to 55 years old for females.
c
Dependency ratio is the number of dependents divided by the number of persons in the labor force.
d
Public service density is the number of governmental agencies, nonprofit organizations, educational units,
hospitals, and postal and telecommunication units per square kilometer.
e
Ethnic enclave is a dummy variable that identifies whether a minority (mainly Muslims in Beijing) or
migrant concentrated area was present in a subdistrict.
2795_C007.fm Page 138 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 139
The Varimax rotation technique is used to maximize the loading of a
variable on one factor and minimize the loadings on all others. Table 7.4
presents the rotated factor structure (variables are reordered to highlight
the factor loading structure). The four factors are labeled to reflect major
variables loaded to each factor:
TABLE 7.3
Eigenvalues from Principal Components Analysis
Component Eigenvalue Proportion Cumulative
1 4.9231 0.3516 0.3516
2 2.1595 0.1542 0.5059
3 1.4799 0.1057 0.6116
4 1.2904 0.0922 0.7038

5 0.8823 0.0630 0.7668
6 0.8286 0.0592 0.8260
7 0.6929 0.0495 0.8755
8 0.5903 0.0422 0.9176
9 0.3996 0.0285 0.9462
10 0.2742 0.0196 0.9658
11 0.1681 0.0120 0.9778
12 0.1472 0.0105 0.9883
13 0.1033 0.0074 0.9957
14 0.0608 0.0043 1.0000
TABLE 7.4
Factor Loadings in Social Area Analysis
Variables
Land Use
Intensity
Neighborhood
Dynamics
Socioeconomic
Status Ethnicity
Public service density 0.8887 0.0467 0.1808 0.0574
Population density 0.8624 0.0269 0.3518 0.0855
Labor participation ratio
–0.8557 0.2909 0.1711 0.1058
Office/retail density 0.8088 –0.0068 0.3987 0.2552
Housing price 0.7433 –0.0598 0.1786 –0.1815
Dependency ratio 0.7100 0.1622 –0.4873 –0.2780
Household size 0.0410 0.9008 –0.0501 0.0931
Floating population ratio 0.0447 0.8879 0.0238 –0.1441
Living space –0.5231 0.6230 –0.0529 0.0275
Income 0.1010 0.1400 0.7109 –0.1189

Natural growth rate –0.2550 0.2566 –0.6271 0.1390
Ethnic enclave 0.0030 –0.1039 –0.1263 0.6324
Sex ratio –0.2178 0.2316 –0.1592 0.5959
Industry density 0.4379 –0.1433 0.3081 0.5815
2795_C007.fm Page 139 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
140 Quantitative Methods and Applications in GIS
a. “Land use intensity” is by far the most important factor, explaining
35.16% of the total variance and capturing mainly six variables: three
density measures (population density, public service density, and office
and retail density), housing price, and two demographic variables (labor
participation ratio and dependency ratio).
b. “Neighborhood dynamics” accounts for 15.42% of the total variance
and includes three variables: floating population ratio, household size,
and living space.
c. “Socioeconomic status” accounts for 10.42% of the total variance and
includes two variables: average annual income per capita and popula-
tion natural growth rate.
d. “Ethnicity” accounts for 9.22% of the total variance and includes three
variables: ethnic enclave, sex ratio, and industry density.
2. Executing cluster analysis in SAS: The second part of the program
FA_Clust.sas uses the SAS procedure PROC CLUSTER to execute
the cluster analysis (CA) and produces a complete dendrogram tree of
clustering. The procedure PROC TREE uses the option NCL=5 to define
the number of clusters, based on which the dendrogram tree is cut off.
The program saves the result in a text file cluster5.csv (rename the
field name cluster to cluster5 for clarification). Repeat the cluster
analysis by changing the option to NCL=9, and save the result to
cluster9.csv (rename the field name cluster to cluster9
for clarification).

The study begins with five basic clusters and expands to nine clusters
revealing more detailed spatial patterns. For instance, cluster 2 identified
in the five-cluster scenario is further divided to clusters 2, 4, and 5 in the
nine-cluster scenario. Each cluster represents a social area.
3. Mapping factor patterns in ArcGIS: Open the shapefile bjsa in ArcGIS
and join the text file factscore.csv to it based on the common key
ref_id. Map the field factor1 (factor score for “land use intensity”)
as shown in Figure 7.6a, factor2 (factor score for “neighborhood
dynamics”) as in Figure 7.6b, factor3 (factor score for “socioeconomic
status”) as in Figure 7.6c, and factor4 (factor score for “ethnicity”) as
in Figure 7.6d.
4. Mapping social areas in ArcGIS: Similar to step 3, join both
cluster9.csv and cluster5.csv to the shapefile bjsa in ArcGIS,
and map the social areas as shown in Figure 7.7. The five basic social
areas are shown in different area patterns, and the nine detailed social
areas are identified by their cluster numbers.
For understanding the characteristics of each social area, use the Summa-
rize tool on the merged table in ArcGIS to compute the mean values of
factor scores within each cluster. The results are reported in Table 7.5.
The clusters are labeled by analyzing the factor scores and the locations
relative to the city center.
5. Testing spatial structure by regressions with dummy variables: Regres-
sion models can be constructed to test whether the spatial pattern of a
2795_C007.fm Page 140 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 141
factor is better characterized as zones or sectors (Cadwallader, 1981).
Based on the circular ring roads, Beijing is divided into four zones, coded
by three dummy variables (x
2

, x
3
, and x
4
). Similarly, three additional
dummy variables (y
2
, y
3
, and y
4
) are used to code the four sectors (NE, SE,
SW, and NW). Table 7.6 shows how the zones and sectors are coded by
the dummy variables.
A simple linear regression model for testing the zonal structure can be
written as
(7.7)
FIGURE 7.6 Spatial patterns of factor scores.
(a) (b)
Legend
−1.535921–1.156531
−1.156530–0.562178
−0.562177−0.278436
−0.278437−1.282308
1.282309−2.427528
Factor 1 scores
Legend
−1.459063–0.732648
−0.732647–0.231689
−0.231688−0.335100

0.335101−1.024712
1.024713−5.796850
Factor 2 scores
(c) (d)
Legend
−1.547825–0.811655
−0.811654–0.223554
−0.223553–0.612911
0.612912–2.056387
2.056388–5.579335
Factor 3 scores
Legend
−2.627556–1.038891
−1.038890–0.327085
−0.327084−0.472905
0.472906−1.999327
1.999328−4.359766
Factor 4 scores
Fbbx bxbx
i
=+ + +
122 3344
,
2795_C007.fm Page 141 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
142 Quantitative Methods and Applications in GIS
FIGURE 7.7 Social areas in Beijing.
TABLE 7.5
Characteristics of Social Areas (Clusters)
Clusters Index

No. of
Subdistricts
Averages of Factor Scores
Five
Clusters Nine Clusters
Land Use
Intensity
Neighbor-
hood
Dynamics
Socio-
economic
Status Ethnicity
1 1. Suburban moderate density 21 –0.2060 0.6730 –0.6932 0.3583
2 2. Inner suburban moderate
income
23 –0.4921 –0.5159 –0.0522 0.4143
4. Inner city moderate income 22 0.8787 –0.1912 0.5541 0.1722
5. Outer suburban moderate
income
21 –0.8928 –0.8811 0.0449 –0.7247
3 3. Outer suburban
manufacturing with high
floating population
6 –1.4866 2.0667 0.3611 0.1847
9. Outer suburban with
highest floating population
1 0.1041 5.7968 –0.2505 –1.8765
4 7. Inner city high income 2 0.7168 0.9615 5.1510 –0.8112
8. Inner city ethnic enclave 1 1.8731 –0.0147 1.8304 4.3598

5 6. Inner city low income 10 2.0570 0.0335 –1.1423 –0.7591
5
5
1
9
5
2
1
5
5
5
5
5
1
2
3
5
2
1
3
5
5
5
5
2
1
2
1
1
5

5
1
1
5
2
5
2
5
5
2
1
3
1
2
4
6
2
4
4
4
1
3
4
2
1
4
4
4
5
4

1
1
4
3
2
2
2
4
6
6
2
4
1
1
1
2
6
7
4
2
4
2
6
2
2
5
2
4
1
4

2
4
6
3
4
6
4
6
8
2
7
6
1
1
4
4
4
Legend
Sub mod-den
Inner sub mod-inc
Outer sub mod-inc
Inner city mod-inc
Outer sub mod-inc
Inner city low-inc
Inner city high-inc
Inner city ethnic
Outer sub float pop
6
2795_C007.fm Page 142 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC

Principal Components, Factor, and Cluster Analyses, and Application 143
where F
i
is the score of a factor (i = 1, 2, 3, and 4), the constant term b
1
is the average factor score in zone 1 (when x
2
= x
3
= x
4
= 0, also referred
to as reference zone), and the coefficient b
2
, b
3
, or b
4
is the difference of
average factor scores between zone 1 and zone 2, 3, or 4, respectively.
Similarly, a regression model for testing the sectoral structure can be
written as
(7.8)
where notations have interpretations similar to those in Equation 7.7.
Export the attribute table of shapefile bjsa (joined with factscore.csv)
to an external file zone_sect.dbf containing the factor scores, and
the fields ring (identifying zones) and sector (identifying sectors). In
the file zone_sect.dbf, use Excel or SAS to create and compute the
dummy variables x
2

, x
3
, x
4
, y
2
, y
3
, and y
4
according to Table 7.6, and run
regression models in Equations 7.7 and 7.8. The results are presented in
Table 7.7. A sample SAS program BJreg.sas is provided in the CD
for reference.
7.5 DISCUSSION AND SUMMARY
R
2
in Table 7.7 indicates whether the zonal or sectoral model is a good fit, and an
individual t statistic (in parenthesis) indicates whether a coefficient is statistically
significant (i.e., whether a zone or a sector is significantly different from the reference
zone or sector). Clearly, the land use intensity pattern fits the zonal model well, and
the negative coefficients b
2
, b
3
, and b
4
are all statistically significant and indicate
declining land use intensity from inner to outer zones. The neighborhood dynamics
pattern is better characterized by the sectoral model, and the positive coefficient c

4
(statistically significant) confirms high portions of floating population in the north-
west sector of Beijing. The socioeconomic status factor displays both zonal and
sectoral patterns, but a stronger sectoral structure. The negative coefficients b
3
and b
4
(statistically significant) in the socioeconomic status model imply that factor scores
decline toward the third and fourth zones, and the positive coefficient c
3
(statistically
TABLE 7.6
Zones and Sectors Coded by Dummy Variables
Zones Sectors
Index and Location Codes
Index and
Location Codes
1. Inside second ring x
2
= x
3
= x
4
= 0 1. NE y
2
= y
3
= y
4
= 0

2. Between second and third rings x
2
= 1, x
3
= x
4
= 0 2. SE y
2
= 1, y
3
= y
4
= 0
3. Between third and fourth rings x
3
= 1, x
2
= x
4
= 0 3. SW y
3
= 1, y
2
= y
4
= 0
4. Outside fourth ring x
4
= 1, x
2

= x
3
= 0 4. NW y
4
= 1, y
2
= y
3
= 0
Fccy cycy
i
=+ + +
122 3344
,
2795_C007.fm Page 143 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
144 Quantitative Methods and Applications in GIS
significant) indicates higher factor scores in the southwest sector, mainly because
of two high-income subdistricts in Xuanwu District. The ethnicity factor does not
conform to either the zonal or sectoral model. Ethnic enclaves scatter citywide and
may be best characterized by a multiple nuclei model.
Land use intensity is clearly the primary factor forming the concentric social
spatial structure in Beijing. From the inner city (clusters 4, 6, 8, and 9) to inner
suburbs (clusters 1 and 2) and to remote suburbs (clusters 3, 5, and 7), population
densities as well as densities of public services, offices, and retails declined along
with land prices. The neighborhood dynamics, mainly the influence of floating
population, is the second factor shaping the formation of social areas. Migrants are
attracted to economic opportunities in the fast-growing Haidian District (cluster 1)
and manufacturing jobs in the Shijingshan District (cluster 3). The effects of the
third factor (socioeconomic status) can be found in the emergence of the high-income

areas in two inner city subdistricts (cluster 8), and the differentiation between middle-
income (cluster 1) and low-income areas (clusters 2, 3, and 5) in suburbs. The fourth
factor of ethnicity does not come to play until the number of clusters is expanded
to nine.
In Western cities, the socioeconomic status construct plays a dominant force in
forming a sectoral pattern, along with the family structure construct featuring a zonal
TABLE 7.7
Regressions for Testing Zonal vs. Sectoral Structures (n = 107)
Factors
Land Use
Intensity
Neighborhood
Dynamics
Socioeconomic
Status Ethnicity
Zonal model
b
1
1.2980
***
(12.07)
–0.1365
(–0.72)
0.4861
**
(2.63)
–0.0992
(–0.51)
b
2

–1.2145
***
(–7.98)
0.0512
(0.19)
–0.4089
(–1.57)
0.1522
(0.56)
b
3
–1.8009
***
(–11.61)
–0.0223
(–0.08)
–0.8408
**
(–3.16)
–0.0308
(–0.11)
b
4
–2.1810
***
(–14.47)
0.4923
(1.84)
–0.7125
**

(–2.75)
0.2596
(0.96)
R
2
0.697 0.046 0.105 0.014
Sectoral model
c
1
0.1929
(1.14)
–0.3803
**
(–2.88)
–0.3833
**
(–2.70)
–0.2206
(–1.32)
c
2
–0.1763
(–0.59)
–0.3511
(–1.52)
0.4990
*
(2.01)
0.6029
*

(2.06)
c
3
–0.2553
(–0.86)
0.0212
(0.09)
1.6074
***
(6.47)
0.4609
(1.58)
c
4
–0.3499
(–1.49)
1.2184
***
(6.65)
0.1369
(0.69)
0.1452
(0.63)
R
2
0.022 0.406 0.313 0.051
Note:
*
, Significant at 0.05;
**

, significant at 0.01; and
***
, significant at 0.001.
2795_C007.fm Page 144 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 145
pattern and the ethnicity construct exhibiting a multinuclei pattern. In Beijing, the
factors of socioeconomic status and ethnicity remain effective but move to less
prominent roles, and the family status factor is almost absent in Beijing. Census
data and corresponding spatial data (e.g., TIGER files in the U.S.) are conveniently
available for almost any cities in developed countries, and implementing social area
analysis in these cities is fairly easy. However, reliable data sources are often a large
obstacle for social area studies in cities in developing countries, and future studies
can certainly benefit from data of better quality, i.e., data with more socioeconomic,
demographic, and housing variables and in smaller geographic units.
APPENDIX 7A: DISCRIMINANT FUNCTION ANALYSIS
Certain categorical things bear some characteristics, each of which can be measured
in a quantitative way. The goal of discriminant function analysis (DFA) is to find a
linear function of the characteristic variables and use the function to classify future
observations into the above known categories. DFA is different from cluster analysis,
in which categories are unknown. For example, we know that females and males
bear different bone structures. Now some remnants of bodies are found, and we
need to identify the gender of the bodies by the DFA.
Here we use a two-category example to illustrate the concept. Say we have two
types of objects, A and B, measured by p characteristic variables. The first type has
m observations and the second type has n observations. In other words, the observed
data are
X
ijA
(i = 1, 2, …, m; j = 1, 2, …, p)

X
ijB
(i = 1, 2, …, n; j = 1, 2, …, p)
The objective is to find a discriminant function R such that
(A7.1)
where c
k
(k = 1, 2, …, p) and R
0
are constants.
After substituting all m observations X
ijA
into R, we have m values of R(A).
Similarly, we have n values of R(B). The R(A)’s have a statistical distribution, and
so do the R(B)’s. The goal is to find a function R, such that the distributions of
R(A) and R(B) are most remote to each other. This goal is met by two conditions,
such as:
1. The mean difference is maximized.
2. The variances are minimized (i.e., with narrow bands of
distribution curves).
RcXR
kk
k
p
=−
=

0
1
QRA RB=−() ()

FS S
AB
=+
22
2795_C007.fm Page 145 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
146 Quantitative Methods and Applications in GIS
That is equivalent to maximizing V = Q/F by selecting coefficients c
k
. Once the
c
k
’s are obtained, we simply use the pooled average of estimated means of R(A) and
R(B) to represent R
0
:
(A7.2)
For any given sample, we can calculate its R value and compare to R
0
. If it is
greater than R
0
, it is category A; otherwise, it is category B.
DFA is implemented in PROC DISCRIM, or other procedures such as PROC
STEPDISC or PROC CANDISC in SAS.
APPENDIX 7B: SAMPLE SAS PROGRAM FOR FACTOR AND
CLUSTER ANALYSES
/* FA_Clust.SAS runs Factor Analysis & Cluster Analysis
for social area analysis in Beijing */
/* By Fahui Wang on 2-4-05 */

/* read the attribute data */
proc import
datafile="c:\gis_quant_book\projects\bj\bjattr.csv"
out=bj1 dbms=dlm replace;
delimiter=', ';
getnames=yes;
proc means;
/* Run factor analysis */
proc factor out=fscore(replace=yes)
nfact=4 rotate=varimax; /* 4 factors used */
var x1-x14;
/* export factor score data */
proc export data=fscore dbms=csv
outfile="c:\gis_quant_book\projects\bj\factscore.csv";
/* Run cluster analysis */
/* Factor scores are first weighted by relative importance
i.e., variance portions accounted for (based on FA) */
data clust; set fscore;
factor1 = 0.3516*factor1;
factor2 = 0.1542*factor2;
RmRAnRBmn
0
=+ +[() ()]/( )
2795_C007.fm Page 146 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC
Principal Components, Factor, and Cluster Analyses, and Application 147
factor3 = 0.1057*factor3;
factor4 = 0.0922*factor4;
proc cluster method=ward outtree=tree;
id ref_id; var factor1-factor4; /*plot dendrogram */

proc tree out=bjclus ncl=9; /*cut the tree at 9 clusters*/
id ref_id;
/* export the cluster analysis result */
proc export data=bjclus dbms=csv
outfile="c:\gis_quant_book\projects\bj\cluster9.csv";
run;
NOTES
1. Data standardization involves the process of converting a series of data x to a new
series x′ with mean equal to 0 and standard deviation equal to 1: .
2. The choice of factor number does not affect the result of PCA. One may start with
an arbitrary number (say, 3 or 5) of factors, and results remain the same.

=−
()
xxx
σ
2795_C007.fm Page 147 Friday, February 3, 2006 12:14 PM
© 2006 by Taylor & Francis Group, LLC

×