Tải bản đầy đủ (.pdf) (31 trang)

Phương pháp chẩn đoán hình ảnh medical image analysis methods (phần 10)

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.99 MB, 31 trang )

2089_book.fm copy Page 363 Tuesday, May 10, 2005 9:34 PM

10

Graph-Based Analysis of
Amino Acid Sequences
Luciano da Fontoura Costa

CONTENTS
10.1 Introduction
10.2 Complex-Networks Concepts and Tools
10.2.1 Brief Historic Perspective
10.2.2 Basic Mathematical Concepts
10.2.2.1 Graph Theory Basics
10.2.2.2 Probabilistic Concepts
10.2.2.3 Random Graph Models
10.2.2.4 Small-World and Scale-Free Models
10.3 Complex-Networks Approaches to Bioinformatics
10.4 Sequences of Amino Acids as Weighted, Directed Complex
Networks
10.5 Results
10.5.1 Zebra Fish
10.5.2 Xenopus
10.5.3 Rat
10.6 Discussion
10.7 Concluding Remarks and Future Work
Acknowledgments
References

10.1 INTRODUCTION
One of the most essential features underlying natural phenomena and dynamical


systems are the many connections, implications, and causalities between the several
involved elements and processes. For instance, the whole dynamics of gene activation
can be understood as a highly complex network of interactions, in the sense that
some genes are enhanced while others are inhibited by several environmental factors,
including the current biochemical composition of the individual (such as the presence
of specific genes/proteins) as well as external effects such as temperature and
interaction with other individuals. Interestingly, such a network of effects extends
much beyond the individual in time and space, in the sense that any living being is

Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 364 Tuesday, May 10, 2005 9:34 PM

364

Medical Image Analysis

affected by history (i.e., evolutionary processes) and spatial interactions (i.e., ecology). Although biology can only be fully understood and explained by considering
the whole of such an intricate network of effects, reductionist approaches can still
provide many insights about biological phenomena that are more localized in time
and space, such as the genetic dynamics during an individual lifetime or an infectious
process.
The large masses of data produced by experimental works in biology, molecular
biology, and genetics can only be properly organized, analyzed, and modeled by
using computer concepts including databases, networks, parallel computing, and
artificial intelligence, with special emphasis placed on signal processing and pattern
recognition. The incorporation of such modern computer concepts and tools into
biology and genetics has been called bioinformatics [1]. The applications of this
new area to genetics are manifold, ranging from nucleotide analysis to animal

development. Among the several signal-processing methods considered in bioinformatics [2], we have the application of Markov random fields to model the sequences
of nucleotides, the use of correlation and covariance to characterize sequences of
nucleotides and amino acids, and wavelets [2, 3].
One particularly important problem concerns the analysis of proteins, the basic
blocks of life [4, 5]. Constituted by sequences of amino acids, proteins participate
in all vital processes, acting as catalysts; providing the mechanical scaffolding for
cells, organs, and tissues; and participating in DNA expression. Proteins are polymers
of amino acids, determined from the DNA through the process of protein expression.
Many of the properties of proteins derive from their spatial shape and electrical
affinities, which are both defined by the specific sequences of constituent amino
acids [4, 5]. Therefore, given the sequence of amino acids specified by the DNA,
the protein folds into specific forms while taking into account the interactions
between the amino acids and external influence of chaperones. It remains an open
problem how to determine the structural properties of proteins from the respective
amino acid sequences, a problem known as protein folding [4, 5]. Except for some
basic motifs, such as alpha-helices and beta-sheets, which are structures that appear
repeatedly in proteins, the prediction of protein shape constitutes an intense research
area. Experimentally, the sequences of amino acids underlying proteins can be
obtained by using sequencing machines capable of reading the nucleotides, which
are subsequently translated into amino acids by considering triples of nucleotides,
the so-called codons, translated according to the genetic code [3–5].
By being inherently oriented toward representing connections and implications,
graphs stand out as one of the most general and interesting data structures that can
be used to represent biological systems. Basically, a graph is a representational
structure composed of nodes, which are connected through directed or undirected
edges. Any structure or phenomenon can be represented to varying degrees of
completeness in terms of graphs, where each node would correspond to an aspect
of the phenomenon and the edges to interactions. Such a potential for representation
and modeling is greatly extended by the many types of graphs, including those with
weighted edges, different types of coexisting nodes or edges, and hypergraphs, to

name only a few. Interestingly, most biological phenomena can be properly represented in terms of graphs, including gene activation, metabolic networks, evolution
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 365 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

365

(recall that hierarchical structures such as trees are special kinds of graphs), ecological interactions, and so on. However, despite the natural potential of graphs for
representing and studying natural phenomena, their application was timid until the
recent advent of the area of complex networks. One of the possible reasons for that
is that graphs had been often understood as representations of static interactions, in
the sense that the connections between nodes were typically assumed not to change
with time. Thus, the uses of graphs in biology, for instance, were mainly constrained
to representing evolutionary hierarchies (in terms of trees) and metabolic networks.
This situation underwent an important recent change sparked mainly by the
pioneering developments in random networks by Rapoport [6] and Erdös and Rényi
[7], Watts and Strogatz small-world models [8], and by Barabási scale-free networks
[9]. The research of such types of complex graphs became united under the name
of complex networks [10–12]. Now, in addition to the inherent potential of graphs
to nicely represent natural phenomena, important connections were established with
dynamics systems, statistical physics, and critical phenomena, while many possibilities for multidisciplinary research were established between areas such as graph
theory, statistical physics, nonlinear dynamical systems, and complexity theory.
Despite such promising perspectives, one of the often overlooked reasons why
complex networks have become so important for modern science is that studies in
this area tend to investigate the dynamical evolution of the graphs [10–12], which
can provide key insights about the relationship between the topology and function
of such complex systems. For example, one of the most interesting properties

exhibited by random graphs is the abrupt appearance, as new edges are progressively
added at random, of a giant cluster that dominates the graph structure and connections henceforth. Thus, in addition to being typically large (several studies in complex
networks consider infinitely large graphs), the graphs were now used to model
growing processes. Allied to the inherent vocation of graphs to represent connections,
interactions, and causality, the possibility of modeling dynamical evolution in terms
of complex networks has made this area into one of the most promising scientific
concepts and tools.
The present chapter is aimed at addressing how complex-network research has
been applied to bioinformatics, with special attention given to the characterization
and analysis of amino acid sequences in proteins. The text starts by reviewing the
basic context, concepts, and tools of complex-network research and continues by
presenting some of the main applications of this area in bioinformatics. The remainder of the chapter describes the more specific investigation of amino acid sequences
in terms of complex networks obtained for graphs derived from subsequence strings.

10.2 COMPLEX-NETWORKS CONCEPTS AND TOOLS
10.2.1 BRIEF HISTORIC PERSPECTIVE
The beginnings of complex-network research can be traced back to the pioneering
and outstanding works by Rapoport [6] and Erdos and Renyi [7], who concentrated
attention on the type of networks currently known as random networks. This name
is somewhat misleading in the sense that many other network models are also
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 366 Tuesday, May 10, 2005 9:34 PM

366

Medical Image Analysis

random. The essential property of random networks as understood in graph theory,

therefore, is not only being random, but to follow a particular probabilistic model,
namely the uniform random distribution [13]. In other words, given a set of N nodes,
connections are established by choosing pairs of nodes according to the uniform
probability density. In the case of undirected graphs, the edges are uniformly sampled
out of the N(N–1)/2 possible connections. Consequently, random networks correspond to the maximum entropy hypothesis of connectivity evolution, providing a
suitable null hypothesis against which several real and theoretical models can be
compared and contextualized.
One of the most interesting features of random networks is the fact that the
progressive addition of new edges tends to abruptly form a giant, dominating cluster
(or connected component) in the graph. Such a critical transition is particularly
interesting not only because it represents a sudden change of the network connectivity, but because it provides a nice opportunity for connecting graph theory to
statistical physics. Indeed, the appearance of the giant cluster can be understood as
a percolation of the graph, similar to critical phenomena (phase transitions) underlying the transformation of ice into water. Basically, percolation corresponds to an
abrupt change of some property of the analyzed system as some parameter is
continually varied. This interesting connection between graph theory and statistical
physics has provided unprecedented opportunities for multidisciplinary works and
applications, nicely bridging the gap between areas such as complexity analysis,
which is typical of graph theory, and the study of systems involving large numbers
of elements, typical in statistical physics. In addition to such an exciting perspective,
random networks attracted much interest as possible models of real structures and
phenomena in nature, with special emphasis given to the Internet and the World
Wide Web.
After the fruitful studies of Rapoport and Erdos and Renyi, the study of large
networks (note that the term complex network was not typical at those times) went
through a period of continuing academic investigation followed by few applications,
except for promising investigations in areas such as sociology. Indeed, one of the
next important steps shaping the modern area of complex networks was the investigation of personal interactions in society, of which the 1998 work by Watts and
Strogatz [8] represents the basic reference. Basically, experimental investigations
regarding social contacts led to the result that the average length between any two
nodes (i.e. persons) is rather small, hence the name small-world networks. The

typical mathematical model of such networks starts with a regular graph, which
subsequently has a percentage of its connections rewired according to uniform
probability. Although such investigations brought many insights to the area, the
small-world property was later verified to be an almost ubiquitous property of
complex networks. The subsequent investigations of the topological properties of
the Internet and WWW performed by Albert and Barabási [9] led to the important
discovery that the statistical distribution of the node degrees (i.e., the number of
connections of a node) in several complex networks tends to follow a power law,
indicating scale-free behavior. Unlike the random model, this property favors the
appearance of nodes concentrating many of the connections, the so-called hubs.
Such underlying structure has several implications, such as resilience to attack, which
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 367 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

367

is particularly fragile for hub attacks. From then on, the developments in complexnetwork research boomed, covering several types of natural systems, from epidemics
to economy. The interested reader is encouraged to check the excellent surveys of
this area [10–12] for complementary information.

10.2.2 BASIC MATHEMATICAL CONCEPTS
This section provides a brief introductory review of basic concepts and measurements
in graph theory, statistics, random graphs, and small-work and scale-free networks.
Readers who are already familiar with such topics can proceed directly to Section
10.2.3.
10.2.2.1 Graph Theory Basics

Basically, a typical graph [14–17] in complex-network theory [10–12] involves a
collection of N nodes i = 1, 2, …, N that are connected through edges (i,j) that can
have weights w(i,j). Such a data structure is precise and completely represented by
the respective weight matrix W, where each entry W(j,i) represents the weight of
edge (i,j). Nonexistent edges are represented as null entries in that matrix. The
adjacency matrix K of the graph is a matrix where the value 1 is assigned to an
element (i,j) whenever there is an edge connecting node j to I, and 0 otherwise. The
adjacency matrix can be obtained from the weight matrix by setting each element
larger or equal to a specific threshold value T to 1, assigning 0 otherwise. Such
adjacency matrices, henceforth represented as KT, provide indication about the network structure defined by the weights that are higher than the threshold. Therefore,
the adjacency matrix for high values of T can be understood as the strongest component, or “kernel,” of the weighted graph. Observe that it is also possible to consider
the complementary matrix of KT with respect to K, which is defined as follows. Each
element (i,j) of such a matrix, hence abbreviated as QT, receives value 1 iff KT(i,j)
= 0 and K(i,j) 0. An undirected graph is characterized by undirected edges, so that
K(j,i) = 1 iff K(i,j) = 1, i.e., K is symmetric. A directed graph, or digraph, is
characterized by directed edges and not necessarily by a symmetric adjacency matrix.
One of the most basic and interesting local feature of a graph or network is the
number of connections of a specific node i, which is called the node degree and
often abbreviated as ki. Observe that a directed graph has two types of such a degree,
the indegree and the outdegree, corresponding to the number of incoming and
outgoing edges, respectively.
Figure 10.1 illustrates the concepts introduced here with respect to an undirected
graph G and a directed graph H, identifying the nodes, edges, and weights. This
figure also shows the respective weight matrices WG and WH and adjacency matrices
AG and AH. The degree of node 1 in G is 2, the outdegree of node 1 in H is 2, and
the indegree of node 1 in H is 1. N is equal to 4 for both graphs.
A great part of the importance of graphs stems from their generality for representing, in an intuitive and explicit way, virtually any discrete structure while emphasizing the involved entities (nodes) and connections. Indeed, virtually every data
structure (e.g., tree, queue, list) is a particular case of a graph. In addition, graphs
Copyright 2005 by Taylor & Francis Group, LLC



2089_book.fm copy Page 368 Tuesday, May 10, 2005 9:34 PM

368

Medical Image Analysis

G:

1

H:

weight

2

4

3

node

2

3

1
2


2

3

edge

1

1

4

4

0
2
WG =
4
0

2
0
0
0

4
0
0
1


0
0
1
0

0
2
WH =
3
0

0
0
0
0

2
0
0
0

0
0
1
0

0
1
1
0


1
0
0
0

1
0
0
0

0
0
0
0

0
1
AH =
1
0

0
0
0
0

1
0
0

0

0
0
0
0

AG =

2

FIGURE 10.1 Basic concepts in graph theory: examples of undirected (G) and directed (H)
graphs, with respective nodes, edges, and weights. The weight matrices of G and H are WG
and WH, and the respective adjacency matrices considering threshold T = 1 are given as AG
and AH.

can be used to represent the most general mesh of points used for numeric simulation
of dynamic systems, from the regular orthogonal lattice used in image representation
to the most intricate adaptive triangulations. As such, graphs are poised to provide
one of the keys for connecting not only structure and function, but also several
different biological areas and even the whole of science.
Several measurements or features have been proposed and used to express
meaningful and useful global properties of the network structure. In similar fashion
to feature selection in the area of pattern recognition (e.g., [13]), the choice of such
features has to take into account the specific problem of interest. For instance, a
problem of communication along the network needs to take into account the distance
between nodes. It should be observed that, in most cases, the selected set of features
is degenerated, in the sense that it is not enough to reproduce the original network
structure. Therefore, great attention must be paid when deriving general conclusions
based on incomplete sets of measurements, as is almost always the case. Some of

the more traditional network measurements are reviewed in the following paragraph.
The global measurement, usually derived from the node degree, is its average
value <k> along the whole network. Observe that, for a digraph, the average indegree
and outdegree are necessarily identical. The average node degree gives a first idea
about the overall connectivity of the network. Additional information about the
network connectivity can be obtained from the average clustering coefficient <C>.
Given one specific node i, the immediately connected nodes are identified, and the
ratio between the number of connections between them and the maximum possible
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 369 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

369

value of those connections defines the clustering coefficient of node i, i.e., Ci. This
feature tends to express the local connectivity around each node. Another interesting
and frequently used network measurement is the length between any two nodes i
and j, here denoted as L(i,j). This distance may refer either to the minimal sum of
weight along a path from i to j, or to the total number of edges between those two
nodes. The present work is restricted to the latter. The respectively derived global
feature is the average length considering all possible pairs of network nodes, hence
<L>. This measurement provides an idea not only about the proximity between
nodes, but also about the overall network connectivity, in the sense that low averagedistance values tend to indicate a densely connected structure. Another interesting
measurement that has been used to characterize complex networks is the betweenness
centrality. Roughly, the betweenness centrality of a specific network node in an
undirected graph corresponds to the number of shortest paths between any pair of
node in the network that cross that node [18].

10.2.2.2 Probabilistic Concepts
Any measurement whose outcome cannot be exactly predicted, such as the weight
of an inhabitant of Chicago, can be represented in terms of a random variable [13,
19]. Such variables can be completely characterized in terms of the respective density
functions, which can be approximated in terms of the respective relative frequency
histogram. Alternatively, a random variable can also be represented in terms of its
(possibly) infinite moments, including the mean, variance, and so on. Statistical
density functions of special interest for this chapter include the uniform distribution,
which assigns the same probability to any possible measurement, and the Poisson
distribution, which is characterized in terms of a ratio of event occurrence per length,
area, or volume. For instance, we may have that the chance of having a failure in
an electricity transmission cable is equal to one failure per 10,000 km. Therefore,
the chance of observing the event along the considered structure (e.g., the transmission cable) is also equiprobable along the considered parameter (e.g., length or time).
Such concepts can be immediately extended to multivariate measurements by
introducing the concept of random vector. For instance, the temperature and pressure
of an inhabitant of Chicago can be represented as the two-dimensional random vector
[T, P]. Such statistical entities are also completely characterized, in statistical terms,
by their respective multivariate densities. Statistical and probabilistic concepts and
techniques are essential for representing and modeling natural phenomena and biological data because of the intrinsic variation of such measurements.
10.2.2.3 Random Graph Models
The first type of complex networks to be systematically investigated were the random
graphs [6, 7, 10–12, 20]. In using such graphs, one starts with N unconnected nodes
and progressively adds edges between pairs of nodes chosen according to the uniform
distribution. Although the measurements described in Section 2.2.1 are useful for
characterizing the structure of such networks, it is also important to take into account
parameters and measurements governing their dynamical evolution, including the
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 370 Tuesday, May 10, 2005 9:34 PM


370

Medical Image Analysis

critical phenomenon of percolation. As more connections are progressively added
to a growing network, there is a definite tendency to form a giant cluster (percolation), which henceforth dominates the growing dynamics. Given a network, a cluster
is understood as the set of nodes (and respective interconnecting edges) such that
one can reach any node while starting from any other node in the cluster, i.e., the
cluster is a connected component of the graph. The giant cluster corresponds to the
cluster with the largest number of nodes at a given step of the network evolution.
For an undirected random network, this phenomenon has been found to take place
when the percentage of existing connections with respect to the maximum possible
number of connections is about 1/N [5].
10.2.2.4 Small-World and Scale-Free Models
The types of complex networks known as small world and scale free were identified
and studied years after Erdos and Renyi investigated random graphs. Small-world
networks [8, 10] are characterized by a short path from any pairs of its constituent
nodes. A typical example of such a network is the social interactions within a given
society, in the sense that there are just a few (about five or six) relations between
any two persons. Characterized later than small-world models, the scale-free networks [10–12] are characterized by the fact that the statistical distribution of the
respective node degrees follows a power law, i.e., the representation of such a density
in a log-log plot produces a straight line. Such densities, unlike those observed for
other types of networks, implies a substantially higher chance of having nodes of
high degree, which are traditionally called hubs. As reviewed in the next section,
such nodes have been identified as playing an especially important role in biological
networks. Scale-free networks can be produced by using the preferential-attachment
growth strategy [10–12], characterized by the progressive addition of new nodes
with fixed number of edges that are connected preferentially with nodes of higher
degree, giving rise to the paradigm that has become known as “the rich get richer.”

At the same time, scale-free networks have also been shown to be less resilient to
random node attachments than other types of networks, such as random graphs [10].

10.3 COMPLEX-NETWORKS APPROACHES TO
BIOINFORMATICS
Several possibilities of using complex network and statistical physics in biology
have been described and revised by Bose in his interesting and extensive survey
[21]. Special attention is given to relationships between the network’s topology and
functional properties, and the following three situations are covered in considerable
depth:
1. The topology of complex biological networks, such as metabolic and
protein interaction
2. Nonlinear dynamics in gene expression
3. The effect of stochasticity on the network dynamics
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 371 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

371

While we review in the following some of the most representative works applying
complex-network research to biology, the reader is encouraged to complement and
extend our revision by referring to Bose’s survey.
Metabolic reactions, one of the key elements of life, were among the first to be
studied by complex-network approaches. Such networks have their nodes representing the molecular compounds (or substrates), and the edges indicate the metabolic
reactions connecting substrates. Incoming links to a substrate are understood to
correspond to the reactions of which that substrate is a product. The pioneering

investigation by Jeong et al. [22] considered networks that are available for 43
organisms, yielding average node indegree and outdegree in the range from 2.5 to
4, with the respective distribution being understood as scale free with exponents
close to 2.2. The metabolic reactions of E. coli have been studied as undirected
graphs by Wagner and Fell [23], yielding average node degree of 7 and a clustering
coefficient (approximately 0.3) much larger than could be obtained for a random
network. An interesting investigation into whether the duplication of information in
genomes can significantly affect the power law exponents was reported by Chung
et al. [24]. By using probabilistic methods as the means to analyze the evolution of
graphs under duplication mechanisms, those authors were able to show that such
mechanisms can produce networks with low power-law exponents, which are compatible with many biological networks [25].
The decomposition of biochemical networks into hierarchies of subnetworks,
i.e., networks obtained by considering a subset of the nodes of the original graph
and some of the respective edges, has been addressed by Holme and Huss [18].
These authors use the algorithm of Girvan and Newman [26] for tracing subnetworks,
in a form adapted to bipartite representations of biochemical networks. The underlying principle of the algorithm is the fact that vertices between densely connected
areas have high betweenness centrality, such that removal with high degree leads to
the partition of the whole network into subnetworks that are contained in previous
clusters, thereby producing a hierarchy of subnetworks.
Another extremely important type of biological network, corresponding to
genomic regulatory systems (i.e., the set of processes controlling gene expression),
has also been subject of increasing attention in complex-network research. This type
of directed network is characterized by having nodes corresponding to components
of the system, with the edges representing the gene-expression regulations [11]. An
important type of network in this category is that obtained from protein-protein
interactions. In this type of network, each node corresponds to a protein, and the
directed edges represent the interactions. A model of regulatory networks has been
described by Kuo and Banzhaf [27]. A pioneering approach in this area is the work
of Jeong et al. [28], which considered protein–protein interaction networks of S.
cerevisiae, containing thousands of edges and nodes. The degree distribution was

interpreted as following scale-free behavior with an approximate exponent of 2.5.
One of the most important conclusions of that investigation was that the removal of
the most-connected proteins (i.e., hubs, the nodes of a complex network receiving
a large number of connections) can have disastrous effects on the proper functioning
of the individual. The issue of protein–protein interaction networks has also been
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 372 Tuesday, May 10, 2005 9:34 PM

372

Medical Image Analysis

considered in a number of other works, including Qin et al. [29], Wagner [30],
Pastor-Satorras et al. [31], and in studies of the properties and evolution of such
networks. Another related work, described by Wuchty [32], considered graphs
obtained by assigning a node to every protein domain (or module) and an edge
whenever two such domains are found in the same protein.
The important problem of determining protein function has been addressed from
the perspective of networks of physical interaction by Vazquez et al. [33]. Their
method is based on the minimization of the number of interacting proteins with
different categories, so that the function estimation can be performed on a global
scale while considering the entire connectivity of the protein network. The obtained
results corroborate the validity of using protein-protein interaction networks as a
means of inferring protein function, despite the unavoidable presence of imperfections and the incompleteness of protein networks.
The analysis of gene-expression networks in terms of embedded complex logistics maps (ECLM), a hybrid method blending some concepts from wavelets and
coupled logistics maps, has been reported by Shaw [34]. That study considered 112
genes collected at nine different time instants along 25 days, with each time point
being fitted to an ECLM model with high Pearson correlation coefficient, and the

connections between genes were determined by considering models with high pairwise correlation. The obtained connections were interpreted as following scale-free
behavior in both topology and dynamics.
A work by Bumble et al. [35] suggests that the study of pathways of network
syntheses of genes, metabolism, and proteins should be extended to the investigation
of the causes and treatment of diseases. Their approach involves methods capable
of yielding, for a specific set of candidate reactions, a complete metabolic pathway
network. Interesting results are obtained by investigating qualitative attributes,
including relationships regarding the connectivity between vertices and the strength
of connections, the relationship of interaction energies and chemical potentials with
the coordination number of the lattice models, and how the stability of the networks
are related to their topology.
An interesting approach to analyzing the amino acid sequences of a protein in
terms of subsequently overlapping strings of length K has been described by Hao
et al. [36]. The strings of amino acids are represented as graphs by associating each
possible subsequence of length K to each graph node, and having the edges represent
the observed successive transitions of subsequences. Their investigation targeted the
reconstruction of the original sequences from the overlapping string networks, which
can be approached by counting the number of Eulerian loops (i.e., a cyclic sequence
of connected edges that are followed without repetition). More specifically, the
sequences are reconstructed while starting with the same initial subsequence, using
each of the subsequences the same number of times as observed in the original data,
and respecting a fixed sequence length. It was therefore verified that the reconstruction is unique for K ≥ 5 for the majority of the considered networks (PDB.SEQ
database [37]).
The present work addresses co-occurrence strings of amino acids (or any other
basic biological element) similar to the scheme described in the previous paragraph,
but here the subsequences do not necessarily overlap, and the number of times a
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 373 Tuesday, May 10, 2005 9:34 PM


Graph-Based Analysis of Amino Acid Sequences

i-1

i

i+m-g-1

i+m-g

i+m-1

i+m

373

i+m+n-g-1

i+m+n+g

g
FIGURE 10.2 The grouping scheme considered in this work, including two successive windows of size m and n, with overlap of g elements.

subsequence is followed by another is represented by the weight of the respective
edge in the associated graph, following the same scheme used for concept association
as described in the literature [38, 39]. More specifically, whenever a subsequence
of amino acids B is followed by another subsequence C, the weight of the edge
connecting the two nodes representing those subsequences is increased by 1. Therefore, such a weighted, direct graph provides information about the number of times
a specific subsequence is followed by other possible subsequences, which can be

related to the statistical concept of correlation, with the difference that the sequence
of the data is, unlike in the correlation, taken into account. As such, the obtained
graph can be explored to characterize and model sequences of amino acids according
to varying subsequence sizes. Moreover, by thresholding the weight matrix for
subsequent threshold values, it is possible to identify subgraphs of the network
corresponding to a strongly connected kernel of subsequences.

10.4 SEQUENCES OF AMINO ACIDS AS WEIGHTED,
DIRECTED COMPLEX NETWORKS
A protein can be specified in terms of its respective sequence of amino acids,
represented by the string S = A1 A2 … AN, where each element Ai corresponds to
one of the 20 possible amino acids, as indicated in Table 10.1.
It is possible to subsume an amino acid sequence S, by grouping subsequences
of amino acids into new numerical codes with higher values, in a way similar to
that described by Hao et al. [36]. The grouping scheme adopted in this work is
illustrated in Figure 10.2, where the first and second group contains m and n amino
acids, respectively. While it is possible to consider m n, we henceforth adopt m =
n. The groups are taken with an overlap of g positions, with 0 ≤ g ≤ m.
For each reference position i, we have two numerical codes B and C, obtained
as follows
B = (Ai–1)20m–1 + … + (Ai+m–2–1)20 + Ai+m-1

(10.1)

C = (Ai+m–g–1) 20n–1 + … + (Ai+m+n–g–2–1) 20 + Ai+m+n–g–1

(10.2)

and


Therefore, we have that 1 ≤ B and C ≤ 20m.
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 374 Tuesday, May 10, 2005 9:34 PM

374

Medical Image Analysis

TABLE 10.1
Amino Acids and Respective
Numerical Codes
Abbreviation

Numerical Code

A
R
D
N
C
E
Q
G
H
I
L
K
M

F
P
S
T
W
Y
V

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

An example of this coding scheme is given in the following. Let the original

protein sequence in abbreviated amino acids be
S = MEQWPLLFVVALCI
or, in numerical codes
S = (13)(6)(7)(18)(15)(11)(11)(14)(20)(20)(1)(11)(5)(10)
For m = n = 2 and g = 0, we have:

Copyright 2005 by Taylor & Francis Group, LLC

i

B

C

1
2
3
4
5
6
7
8
9
10
11

246
107
138
355

291
211
214
280
400
381
11

138
355
291
211
214
280
400
381
11
205
90


2089_book.fm copy Page 375 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

375

Similarly, for m = n = 3 and g = 1, we obtain:
i


B

C

1
2
3
4
5
6
7
8
9

4907
2138
2755
7091
5811
4214
4280
5600
7981

2755
7091
5811
4214
4280
5600

7981
205
4090

Observe that the different ranges of i obtained in these two examples is a direct
consequence of the fact that the larger size of the subsequences in the second example
reduces the number of possible subsequence associations.
Now, having defined the grouping scheme and the resulting sequences B and C,
the graph representing the subsequent (with possible overlap) co-occurrences of
numerical codes in this sequence is obtained as follows:
1. Each code in the sequences B and C is represented as one of the N nodes
of the graph, whose number corresponds to the code produced for the
respective sequence. For instance, the sequence (13)(6) implies a graph
with two nodes identified as 13 and 6 containing a direct edge following
from node 13 to node 6. Therefore, for a given m = n, we have a maximum
of 20m nodes, numbered from 1 to 20m. Observe, however, that the resulting network does not necessarily include all possible nodes, allowing a
reduction of the network size.
2. Every time a code B is followed by a code C, the weight of the edge
connecting from node B to C is incremented by 1. In other words, the
weight of the edge uniting two specific sequences B and C is equal to the
number of times those two sequences are found to follow one another, in
that same order, along the analyzed sequence of amino acids.
Figure 10.3 illustrates the graph obtained from the sequence (13)(6)(7)(18)(15)
(11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14) considering m = 1, where each node
is represented by the respective code, and the edge weights (shown in italics)
represent the number of successive subsequence (in this case a single amino acid)
transitions.
In this sense, the obtained graph represents the “unidirectional” correlations
between two subsequent (with possible overlap) subsequences of amino acids in the
analyzed protein. Such a network can be understood as a statistical model of the

original protein for the specific correlation length implied by m and g. As such, it
is possible to obtain simulated sequences of amino acids following such statistical
models by performing Monte-Carlo simulation over the outdegrees of each node, in
the sense that each outgoing edge is taken with frequency corresponding to its
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 376 Tuesday, May 10, 2005 9:34 PM

376

Medical Image Analysis

1
13

1

1

6

18

7

1
1
10


15
1
2
1
1

1
20
1

2
14

1
1

11

5

1

FIGURE 10.3 The network obtained for m = 1 for the amino acid sequence (13)(6)(7)(18)(15)
(11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14). The weights of the edges are shown in italics.

respective normalized weight (i.e., the sum of the weights of the outgoing edges
must add up to 1). Therefore, the transition probabilities are proportional to the respective weights. Observe that the statistically normalized weight matrix of the network
corresponds to a Markov chain, as the sum of any of its columns will be equal to 1.
By thresholding the weight matrix for successive values of T (see Section 12.2.2),
it is possible to obtain a family of graphs that can be understood as follows. The

clusters defined for the highest values of T represent the kernels of the whole
weighted network, corresponding to the subsequence associations that are most
representative and more frequent along the whole protein. As the threshold is lowered, these kernels are augmented by incorporation of new nodes and merging of
existing clusters. Such a threshold-based evolution of the graph can be related to
the evolutionary history of the protein formation, in the sense that the kernels would
have appeared first and served as organizing structures around which the rest of the
molecule evolved. At the same time, the strongest connections in the obtained
network also reflect the repetition of basic protein motifs, such as alpha helices and
beta sheets.

10.5 RESULTS
In the following investigations, we consider proteins from three animal species:
zebra fish, Xenopus (frog), and rat. The gene sequencing data were obtained from
the NIH Gene Collection repository ( files \verb+dr_mgc_
cds_aa.fasta, \verb+xl_mgc_cds_aa.fasta, and \verb+rn_mgc_cds_aa.fasta). The raw
data consisted of sequences of amino acids for the 2948, 1977, and 640 proteins
(each containing on the average of 400 amino acids) in each of those files. The
obtained results, which considers m = n = 2 and g = 0, are presented respectively
for each species in the following subsections. The average node degree was obtained
by adding all columns of the adjacency matrix. The clustering coefficient was
obtained by identifying the n nodes connected to each node and dividing the number
of existing edges between those nodes by n(n – 1)/2, i.e., the maximum number of
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 377 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

377


edges between those nodes. The minimum distances were calculated by using Dijkstra’s method [14].

10.5.1 ZEBRA FISH
The obtained 400×400 weight matrix (recall from the previous section that 400 =
20m = 202) had a maximum value of 487, obtained for the transition from SS to SS,
and a minimum value of zero was obtained for 15,274 transitions. The maximum
weight for transition between different nodes was 170, observed for the transition
from EE to ED. The performed measurements included the average node degree
(Figure 10.4(a)), clustering coefficient (Figure 10.5(a)), average length (Figure
10.6(a)), and maximum cluster size (Figure 10.7(a)) for the series of thresholded
matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170.
We also calculated the indegree and outdegree densities, which are shown in
Figure 10.8(a) and Figure 10.8(b), respectively, for T = 0. It is clear from this figure
that both node degrees tend to be similar to one another, presenting a plateau for 6
< log(k) < 8.4 followed by a sharp decrease of node degree. The self-connections
between nodes representing subsequences, immediately obtained from the diagonal
of the respective adjacency matrices of two identical amino acids, are given in Table
10.2.
The initial kernel was also identified for T = 95, with the obtained digraph shown
in Figure 10.9, where the edge widths correspond to the respective weights. Observe
that although the original graph was thresholded at T to obtain the kernel in Figure
10.9, the graph in that figure incorporates all edges, including those with weight
smaller than T, to provide a more comprehensive visualization of the obtained kernel.
This fully connected (except self-connections, which were not considered in this
case) digraph presents dominance of the E, D, and S amino acids, with strong
connections obtained for the node EE. The maximum weight was 170, obtained for
the transition from node EE to ED.

10.5.2 XENOPUS

The weight matrix had a maximum value of 293, obtained for the transition from
EE to EE, and a minimum value of zero was obtained for 22,787 transitions. The
maximum weight for transition between different nodes was 207, observed for the
transition from GP to LQ. The performed measurements included the average node
degree (Figure 10.4(b)), clustering coefficient (Figure 10.5(b)), average length (Figure 10.6(b)), and maximum cluster size (Figure 10.7(b)) for the series of thresholded
matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170.
The indegree and outdegree densities are shown in Figure 10.10(a) and Figure
10.10(b), respectively, for T = 0. Both densities again tend to be similar to one
another, presenting a plateau for 6 < log(k) < 8 followed by a sharp decrease of
node degree. The self-connections between nodes representing subsequences of two
identical amino acids are given in Table 10.2. The initial kernel containing nine
nodes was identified for T = 64, with the obtained digraph shown in Figure 10.11,
which is dominated by the P and G amino acids.
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 378 Tuesday, May 10, 2005 9:34 PM

378

Medical Image Analysis

3200
2800
2400
2000
1600
1200
800
400

T

0
0

20

40

60

80

100

120

140

160

180

(a)
2400
2000
1600
1200
800
400

T

0
0

20

40

60

80

100

120

140

160

180

(b)

FIGURE 10.4 The average node degree as a function of the weight threshold T (solid line
= KT, dashed line = QT) for (a) zebra-fish data, (b) Xenopus, and (c) rat.

10.5.3 RAT
The weight matrix had a maximum value of 98, obtained for the transition from LL

to LL, and a minimum value of zero was obtained for 69,792 transitions. Such a
large number of null transitions is a consequence of the smaller number of proteins
available for this animal in the original data. The maximum weight for transition
between different nodes was 35, observed for the transition from LL to AA. The
performed measurements included the average node degree (Figure 10.4(c)), clustering coefficient (Figure 10.5(c)), average length (Figure 10.6(c)), and maximum
cluster size (Figure 10.7(c)) for the series of thresholded matrices KT (solid lines)
and QT (dashed lines) obtained for T = 1, 2, …, 35.
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 379 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

379

600
500
400
300
200
100
T

0
0

10

20


30

40

50
(c)

60

70

80

90

100

FIGURE 10.4 (continued)

The indegree and outdegree densities are shown in Figure 10.12(a) and Figure
10.12(b), respectively, for T = 0. Both of the resulting node degrees were again
similar to one another, presenting a plateau for 4 < log(k) < 6 followed by a moderate
decrease of node degree. The self-connections between nodes representing subsequences of two identical amino acids are given in Table 10.2. The initial kernel was
also identified for T = 22, with the obtained digraph shown in Figure 10.13. The
dominant amino acids were L and A.

10.6 DISCUSSION
Despite the different number of proteins and overall amino acid sequence lengths
available for each of the three species, the clustering coefficient, average length, and

maximum cluster size are determined from the respective adjacency matrices (not the
weight), and therefore they are more significant statistically so that we can attempt a
comparison between such measurements in the case of zebra fish and Xenopus.
It is clear from Figure 10.4 that, as expected, the average node degree <kT> of
the graph KT decreases monotonically with the threshold value T, while the opposite
happens for QT. The abrupt way in which the average node degree varies for the
thresholded and complementary matrices suggests that a kind of phase transition
(critical phenomenon) takes place as the values of T are increased.
As shown in Figure 10.5, the average clustering coefficient for KT tends to
decrease steadily with the threshold values, undergoing a relatively abrupt transition
(near T = 20 for zebra fish), while the clustering coefficient of QT increases even
more abruptly near T = 10, suggesting a phase transition also for this measurement.
Generally, the local connectivity reaches less than 10% of its maximum value after
just one-third of the considered T excursion, which suggests that the network connectivity is dominated by stronger connections surrounded by much smaller connection weights.
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 380 Tuesday, May 10, 2005 9:34 PM

380

Medical Image Analysis

1.0
0.9
0.8
0.7
0.6
0.5
0.4

0.3
0.2
0.1
T

0
0

20

40

60

80

100

120

140

160

180

(a)
1.0
0.9
0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.1
T

0
0

20

40

60

80

100

120

140

160

180


(b)

FIGURE 10.5 Average clustering coefficient as a function of the weight threshold T (solid
line = KT, dashed line = QT) for (a) zebra fish, (b) Xenopus, and (c) rat data.

The average lengths of KT shown in Figure 10.6 suffer from the typical problem
that such distances tend to fall as a consequence of the disappearance of connections.
In other words, because nonexistent edges are not considered for the average length
calculation, a network containing no connections has null average length, less than
for a fully connected network, for which the average length would be 1 (overlooking
self-connections). To any extent, the average length presents a sharp discontinuity
(near T = 80 for zebra fish, 60 for Xenopus, and 20 for rat), possibly indicating that
a large number of edges are cut by thresholds larger than these values. At the same
time, the maximum average lengths in each case are similar and relatively small.
An abrupt increase of the average length is observed for QT for small values of T,
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 381 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

381

0.8
0.7
0.6
0.5
0.4
0.3

0.2
0.1
T

0
0

10

20

30

40

50

60

70

80

90

100

(c)

FIGURE 10.5 (continued)


indicating that that matrix indeed suffers an abrupt change of its connection for
small threshold values.
The graphs in Figures 10.7 show that the maximum cluster size for KT decreases
steadily for higher threshold values, as expected. The maximum cluster size for QT
remained fixed at 400, confirming that the complementary matrix is highly connected. As indicated in Figure 10.8, Figure 10.10, and Figure 10.12, the node degree
densities tend to present two distinct regions: one plateau portion at the left-hand
side, followed by an abrupt descending portion at the right-hand side of the graph.
While the indegree and outdegree densities also produced similar profiles for
the three species, the respective kernels identified at different threshold levels
(because of the different length of the amino acid sequences) were found to be rather
different, with distinct pairs of amino acids dominating each kernel. While such a
result may be strongly affected by the different amounts of data available for each
of the considered species, it may also suggest different fundamental structures for
the amino acid sequencing in those animals.

10.7 CONCLUDING REMARKS AND FUTURE WORK
This chapter has addressed the promising perspective of using modern complexnetwork concepts and tools as a means of characterizing, modeling, and analyzing
biological sequences, with special attention given to amino acid sequences in proteins. After presenting a brief historic perspective of complex-network research and
some of its most representative applications to bioinformatics, the basic concepts of
complex networks and respective topological measurements were presented.
The problem of characterizing proteins in terms of weighted digraphs obtained
from consecutive (with possible overlap) subsequences of amino acids was addressed
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 382 Tuesday, May 10, 2005 9:34 PM

382


Medical Image Analysis

3.8
3.4
3.0
2.6
2.2
1.8
1.4
T

1.0
0

20

40

60

80

100

120

140

160


180

100

120

140

160

180

(a)
3.8
3.4
3.0
2.6
2.2
1.8
1.4
T

1.0
0

20

40

60


80
(b)

FIGURE 10.6 Average length as a function of the weight threshold T (solid line = KT, dashed
line = QT) for (a) zebra fish, (b) Xenopus, and (c) rat data.

next, with respect to a specific protein in zebra fish, Xenopus, and rat. This investigation included the calculation of the average node degree, average clustering
coefficient, the average length (in number of edges), and the size of the maximum
cluster in the graph for a sequence of threshold values. The obtained curves were
found to provide interesting insights about the structure of the overall protein,
especially regarding the appearance of critical transitions of several of the considered
measurements as T was increased. In addition, kernels were identified for each case,
suggesting an interesting basic organization in the amino acid sequences. Despite
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 383 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

383

3.2
2.8
2.4
2.0
1.6
1.2
0.8

0.4
T

0
0

10

20

30

40

50

60

70

80

90

100

(c)

FIGURE 10.6 (continued)


TABLE 10.2
Self-Connections of Subsequences Composed
of Two Identical Amino Acids

Subsequence

Number of
Self-Connections
(Zebra fish)

Number of
Self-Connections
(Xenopus)

Number of
Self-Connections
(Rat)

AA
RR
DD
NN
CC
EE
QQ
GG
HH
II
LL
KK

MM
FF
PP
SS
TT
WW
YY
VV

274
85
216
23
14
467
216
310
67
8
161
188
0
24
299
487
52
0
6
13


126
41
186
13
3
293
95
79
48
8
126
104
4
9
176
233
16
0
2
19

27
10
11
2
3
49
11
21
0

2
98
29
0
0
71
61
6
0
0
3

Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 384 Tuesday, May 10, 2005 9:34 PM

384

Medical Image Analysis

400
360
320
280
240
200
160
120
80

40
T

0
0

20

40

60

80

100

120

140

160

180

(a)
400
360
320
280
240

200
160
120
80
40
T

0
0

20

40

60

80

100

120

140

160

180

(b)


FIGURE 10.7 Maximum cluster size of KT as a function of the weight threshold T for (a)
zebra fish, (b) Xenopus, and (c) rat data.

the different sizes of the amino acid sequences, which do imply problems of statistical meaningfulness, some interesting trends have been identified regarding the
comparison of the measurements obtained for the three different species, especially
the general similarity between the topological properties for each species while
completely different kernels and dominant amino acids have been identified for those
cases.
Future extensions of this work include the consideration of other m, n, and g
configurations, the use of additional structural features such as betweenness centrality as well as the ratios suggested in the literature [40, 41], and the identification of
the hierarchical backbone of the directed network, as suggested in the literature [39].
Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 385 Tuesday, May 10, 2005 9:34 PM

Graph-Based Analysis of Amino Acid Sequences

385

400
360
320
280
240
200
160
120
80
40

T

0
0

10

20

30

40

50
(c)

60

70

80

90

100

FIGURE 10.7 (continued)

It would also be possible to consider the progressive merging of nodes and connected
components into the initial kernel to obtain the hierarchical structure underlying the

growth of the kernel, with possible applications to the complex problem of protein
folding [42]. Finally, it would be interesting to use such measurements to compare
proteins (in terms of amino acids and bases) from the same or distinct individuals,
as well as to infer philogenetic evolution of the proteins. In the case of DNA analysis,
the obtained topological measurements can provide a means for distinguishing
between coding and noncoding regions.

ACKNOWLEDGMENTS
The author is grateful to Fundação de Amparo à Pesquisa do Estado de São Paulo
— FAPESP (proc. 99/12765-2), Conselho Nacional de Desenvolvimento Científico
e Tecnológico — CNPq (proc. 308231/03-1), and the Human Frontier Science
Program for financial support.

Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 386 Tuesday, May 10, 2005 9:34 PM

386

Medical Image Analysis

5
4
3
2
1

Log (k)


0
6.0

6.4

6.8

7.2

7.6

8.0

8.4

8.8

9.2

9.6

(a)
5
4
3
2
1

Log (k)


0
6.0

6.4

6.8

7.2

7.6

8.0

8.4

8.8

9.2

9.6

(b)

FIGURE 10.8 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by
the intensity of the edges (zebra-fish data).

Copyright 2005 by Taylor & Francis Group, LLC


2089_book.fm copy Page 387 Tuesday, May 10, 2005 9:34 PM


Graph-Based Analysis of Amino Acid Sequences

ED

387

DE

DD

EE

AE

EL

SE

EK

LE

SD

FIGURE 10.9 The ten-node kernel obtained for T = 95 for zebra-fish data. The weights are
represented in terms of the edge widths. The maximum and minimum weights are 170 and
zero, the latter corresponding to self-connections, as these have been excluded from the matrix
used to obtain this picture.


Copyright 2005 by Taylor & Francis Group, LLC


×