Tải bản đầy đủ (.pdf) (10 trang)

Managing and Mining Graph Data part 31 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.44 MB, 10 trang )

A Survey of Clustering Algorithms for Graph Data 285
Here 𝑁𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗) refers to the number of (global) short-
est paths between 𝑖 and 𝑗 which pass through 𝑒, and 𝑁𝑢𝑚𝑆ℎ𝑜𝑟𝑡𝑃 𝑎𝑡ℎ𝑠(𝑖, 𝑗)
refers to the number of shortest paths between 𝑖 and 𝑗. Note that the value
of 𝑁𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠)(𝑒, 𝑖, 𝑗) may be 0 if none of the shortest paths
between 𝑖 and 𝑗 contain 𝑒. The algorithm ranks the edges by order of their
betweenness, and and deletes the edge with the highest score. The between-
ness coefficients are recomputed, and the process is repeated. The set of con-
nected components after repeated deletion form the natural clusters. A variety
of termination-criteria (eg. fixing the number of connected components) can
be used in conjunction with the algorithm.
A key issue is the efficient determination of edge-betweenness centrality.
The number of paths between any pair of nodes can be exponentially large,
and it would seem that the computation of the betweenness measure would be
a key bottleneck. It has been shown in [36], that the network structure index
can also be used in order to estimate edge-betweenness centrality effectively
by pairwise node sampling.
2.5 The Spectral Clustering Method
Eigenvector techniques are often used in multi-dimensional data in order
to determine the underlying correlation structure in the data. It is natural to
question as to whether such techniques can also be used for the more general
case of graph data. It turns out that this is indeed possible with the use of a
method called spectral clustering.
In the spectral clustering method, we make use of the node-node adjacency
matrix of the graph. For a graph containing 𝑛 nodes, let us assume that we have
a 𝑛 ×𝑛 adjacency matrix, in which the entry (𝑖, 𝑗) correspond to the weight of
the edge between the nodes 𝑖 and 𝑗. This essentially corresponds to the similar-
ity between nodes 𝑖 and 𝑗. This entry is denoted by 𝑤
𝑖𝑗
, and the corresponding
matrix is denoted by 𝑊. This matrix is assumed to be symmetric, since we


are working with undirected graphs. Therefore, we assume that 𝑤
𝑖𝑗
= 𝑤
𝑗𝑖
for
any pair (𝑖, 𝑗). All diagonal entries of the matrix 𝑊 are assumed to be 0. As
discussed earlier, the aim of any node partitioning algorithm is minimize (a
function of) the weights across the partitions. The spectral clustering method
constructs this minimization function in terms of the matrix structure of the
adjacency matrix, and another matrix which is referred to as the degree matrix.
Thedegree matrix 𝐷 is simply a diagonal matrix, in which all entries are
zero except for the diagonal values. The diagonal entry 𝑑
𝑖𝑖
is equal to the sum
of the weights of the incident edges. In other words, the entry 𝑑
𝑖𝑗
is defined as
follows:
𝑑
𝑖𝑗
=

𝑛
𝑗=1
𝑤
𝑖𝑗
𝑖 = 𝑗
0 𝑖 ∕= 𝑗
286 MANAGING AND MINING GRAPH DATA
We formally define the Laplacian Matrix as follows:

Definition 9.2. (Laplacian Matrix) The Laplacian Matrix 𝐿 is defined by
subtracting the weighted adjacency matrix from the degree matrix. In other
words, we have:
𝐿 = 𝐷 − 𝑊 (9.4)
This matrix encodes the structural behavior of the graph effectively and its
eigenvector behavior can be used in order to determine the important clusters
in the underlying graph structure. We can be shown that the Laplacian matrix 𝐿
is positive semi-definite i.e., for any 𝑛-dimensional row vector 𝑓 = [𝑓
1
. . . 𝑓
𝑛
]
we have 𝑓 ⋅𝐿 ⋅ 𝑓
𝑇
≥ 0. This can be easily shown by expressing 𝐿 in terms of
its constituent entries which are a function of the corresponding weights 𝑤
𝑖𝑗
.
Upon expansion, it can be shown that:
𝑓 ⋅𝐿 ⋅ 𝑓
𝑇
= (1/2) ⋅
𝑛

𝑖=1
𝑛

𝑗=1
𝑤
𝑖𝑗

⋅ (𝑓
𝑖
− 𝑓
𝑗
)
2
(9.5)
We summarize as follows.
Lemma 9.3. The Laplacian matrix 𝐿 is positive semi-definite. Specifically, for
any 𝑛-dimensional row vector 𝑓 = [𝑓
1
. . . 𝑓
𝑛
], we have:
𝑓 ⋅𝐿 ⋅ 𝑓
𝑇
= (1/2) ⋅
𝑛

𝑖=1
𝑛

𝑗=1
𝑤
𝑖𝑗
⋅ (𝑓
𝑖
− 𝑓
𝑗
)

2
At this point, let us examine some interpretations of the vector 𝑓 in terms
of the underlying graph partitioning. Let us consider the case in which each
𝑓
𝑖
is drawn from the set {0, 1}, and this determines a two-way partition by
labeling each node either 0 or 1. The particular partition to which the node
𝑖 belongs is defined by the corresponding label. Note that the expansion of
the expression 𝑓 ⋅ 𝐿 ⋅ 𝑓
𝑇
from Lemma 9.3 simply represents the sum of the
weights of the edges across the partition defined by 𝑓 . Thus, the determination
of an appropriate value of 𝑓 for which the function 𝑓 ⋅ 𝐿 ⋅ 𝑓
𝑇
is minimized
also provides us with a good node partitioning. Unfortunately, it is not easy to
determine the discrete values of 𝑓 which determine this optimum partitioning.
Nevertheless, we will see later in this section that even when we restrict 𝑓 to
real values, this provides us with the intuition necessary to create an effective
partitioning.
An immediate observation is that the indicator vector 𝑓 = [1 . . . 1] is an
eigenvector with a corresponding eigenvalue of 0. We note that 𝑓 = [1 . . . 1]
must be an eigenvector, since 𝐿 is positive semi-definite and 𝑓 ⋅𝐿 ⋅𝑓
𝑇
can be 0
only for eigenvectors with 0 eigenvalues. This observation can be generalized
A Survey of Clustering Algorithms for Graph Data 287
further in order to determine the number of connected components in the graph.
We make the following observation.
Lemma 9.4. The number of (linearly independent) eigenvectors with zero

eigenvalues for the Laplacian matrix 𝐿 is equal to the number of connected
components in the underlying graph.
Proof: Without loss of generality, we can order the vertices corresponding
to the particular connected component that they belong to. In this case, the
Laplacian matrix takes on the following block form, which is illustrated below
for the case of three connected components.
𝐿 = 𝐿
1
0 0
0 𝐿
2
0
0 0 𝐿
3
Each of the blocks 𝐿
1
, 𝐿
2
and 𝐿
3
is a Laplacian itself of the corresponding
component. Therefore, the corresponding indicator vector for that component
is an eigenvector with corresponding eigenvalue 0. The result follows. □
We observe that connected components are the most obvious examples of
clusters in the graph. Therefore, the determination of eigenvectors correspond-
ing to zero eigenvalues provides us information about this (relatively rudimen-
tary set) of clusters. Broadly speaking, it may not be possible to glean such
clean membership behavior from the other eigenvectors. One of the problems
is that other than this particular rudimentary set of eigenvectors (which corre-
spond to the connected components), the vector components of the other eigen-

vectors are drawn from the real domain rather than the discrete {0, 1} domain.
Nevertheless, because of the nature of the natural interpretation of 𝑓 ⋅𝐿 ⋅𝑓
𝑇
in
terms of the weights of the edges across nodes with very differing values of 𝑓
𝑖
,
it is natural to cluster together nodes for which the values of 𝑓
𝑖
are as similar
as possible across any particular eigenvector on the average. This provides us
with the intuition necessary to define an effective spectral clustering algorithm,
which partitions the data set into 𝑘 clusters for any arbitrary value of 𝑘. The
algorithm is as follows:
Determine the 𝑘 eigenvectors with the smallest eigenvalues. Note that
each eigenvector has as many components as the number of nodes. Let
the component of the 𝑗th eigenvector for the 𝑖th node be denoted by 𝑝
𝑖𝑗
.
Create a new data set with as many records as the number of nodes. The
𝑖th record in this data set corresponds to the 𝑖th node, and has 𝑘 com-
ponents. The record for this node is simply the eigenvector components
for that node, which are denoted by 𝑝
𝑖1
. . . 𝑝
𝑖𝑘
.
288 MANAGING AND MINING GRAPH DATA
Since we would like to cluster nodes with similar eigenvector compo-
nents, we use any conventional clustering algorithm (e.g. 𝑘-means) in or-

der to create 𝑘 clusters from this data set. Note that the main focus of the
approach was to create a transformation of a structural clustering algo-
rithm into a more conventional multi-dimensional clustering algorithm,
which is easy to solve. The particular choice of the multi-dimensional
clustering algorithm is orthogonal to the broad spectral approach.
The above algorithm provides a broad framework for the spectral clustering al-
gorithm. The input parameter for the above algorithm is the number of clusters
𝑘. In practice, a number of variations are possible in order to tune the quality
of the clusters which are found. Some examples are as follows:
It is not necessary to use the same number of eigenvectors as the input
parameter for the number of clusters. In general, one should use at least
as many eigenvectors as the number of clusters to be created. However,
the exact number of eigenvectors to be used in order to get the optimum
results may vary with the particular data set. This can be known only
with experimentation.
There are other ways of creating normalized Laplacian matrices which
can provide more effective results in some situations. Some classic ex-
amples of such Laplacian matrices in terms of the adjacency matrix 𝑊 ,
degree matrix 𝐷 and the identity matrix 𝐼 are defined as follows:
𝐿
𝐴
= 𝐼 − 𝐷
−(1/2)
⋅ 𝑊 ⋅𝐷
−(1/2)
𝐿
𝐵
= 𝐼 − 𝐷
−1
⋅ 𝑊

More details on the different methods which can be used for effective spectral
graph clustering may be found in [9].
2.6 Determining Quasi-Cliques
A different way of determining massive graphs in the underlying data is
that of determining quasi-cliques. This technique is different from many other
partitioning algorithms, in that it focuses on definitions which maximize edge
densities within a partition, rather than minimizing edge densities across par-
titions. A clique is a graph in which every pair of nodes are connected by an
edge. A quasi-clique is a relaxation on this concept, and is defined by im-
posing a lower bound on the degree of each vertex in the given set of nodes.
Specifically, we define a 𝛾-quasiclique is as follows:
Definition 9.5. A 𝑘-graph (𝑘 ≥ 1) 𝐺 is a 𝛾-quasiclique if the degree of each
node in the corresponding sub-graph of vertices is at least 𝛾 ⋅ 𝑘.
A Survey of Clustering Algorithms for Graph Data 289
The value of 𝛾 always lies in the range (0, 1]. We note that by choosing 𝛾 = 1,
this definition reverts to that of standard cliques. Choosing lower values of 𝛾
allows for the relaxations which are more true in the case of real applications.
This is because we rarely encounter complete cliques in real applications, and
at least some edges within a dense subgraph would always be missing. A vertex
is said to be critical, if its degree in the corresponding subgraph is the smallest
integer which is at least equal to 𝛾 ⋅ 𝑘.
The earliest piece of work on this problem is from [1] The work in [1] uses
a greedy randomized adaptive search algorithm GRASP, to find a quasi-clique
with the maximum size. A closely related problem is that of finding find-
ing frequently occurring cliques in multiple data sets. In other words, when
multiple graphs are obtained from different data sets, some dense subgraphs
occur frequently together in the different data sets. Such graphs help in deter-
mining important dense patterns of behavior in different data sources. Such
techniques find applicability in mining important patterns in graphical repre-
sentations of customers. The techniques are also helpful in mining cross-graph

quasi-cliques in gene expression data. A description of the application of the
technique to the problem of gene-expression data may be found in [33]. An
efficient algorithm for determining cross graph quasi-cliques was proposed in
[32]. The main restriction of the work proposed in [32] is that the support
threshold for the algorithms is assumed to be 100%. This restriction has been
relaxed in subsequent work [43]. The work in [43] examines the problem of
mining frequent closed quasi-cliques from a graph database with arbitrary sup-
port thresholds. In [31] a multi-graph version of the quasi-clique problem was
explored. However, instead of finding the complete set of quasi-cliques in the
graph, they proposed an approximation algorithm to cover all the vertices in
the graph with a minimum number of 𝑝-quasi-complete subgraphs. Thus, this
technique is more suited for summarization of the overall graph with a smaller
number of densely connected subgraphs.
2.7 The Case of Massive Graphs
A closely related problem is that of dense subgraph determination in mas-
sive graphs. This problem is frequently encountered in large graph data sets.
For example, the problem of determining large subgraphs of web graphs was
studied in [5, 22]. A min-hash approach was first used in [5] in order to deter-
mine syntactically related clusters. This paper also introduces the advantages
of using a min-hash approach in the context of graph clustering. Subsequently,
the approach was generalized to the case of large dense graphs with the use of
recursive application of the basic min-hash algorithm.
The broad idea in the min-hash approach is to represent the outlinks of a
particular node as sets. Two nodes are considered similar, if they share many
290 MANAGING AND MINING GRAPH DATA
outlinks. Thus, consider a node 𝐴 with an outlink set 𝑆
𝐴
and a node 𝐵 with
outlink set 𝑆
𝐵

. Then the similarity between the two nodes is defined by the
Jaccard coefficient, which is defined as
𝑆
𝐴
∩𝑆
𝐵
𝑆
𝐴
∪𝑆
𝐵
. We note that explicit enumera-
tion of all the edges in order to compute this can be computationally inefficient.
Rather, a min-hash approach is used in order to perform the estimation. This
min-hash approach is as follows. We sort the universe of nodes in a random
order. For any set of nodes in random sorted order, we determine the first node
𝐹 𝑖𝑟𝑠𝑡(𝐴) for which an outlink exists from 𝐴 to 𝐹 𝑖𝑟𝑠𝑡(𝐴). We also determine
the first node 𝐹𝑖𝑟𝑠𝑡(𝐵) for which an outlink exists from 𝐵 to 𝐹𝑖𝑟𝑠𝑡(𝐵). It can
be shown that the Jaccard coefficient is an unbiased estimate of the probability
that 𝐹 𝑖𝑟𝑠𝑡(𝐴) and 𝐹 𝑖𝑟𝑠𝑡(𝐵) are the same node. By repeating this process over
different permutations over the universe of nodes, it is possible to accurately
estimate the Jaccard coefficient. This is done by using a constant number of
permutations 𝑐 of the node order. The actual permutations are implemented
by associated 𝑐 different randomized hash values with each node. This cre-
ates 𝑐 sets of hash values of size 𝑛. The sort-order for any particular set of
hash-values defines the corresponding permutation order. For each such per-
mutation, we store the minimum node index of the outlink set. Thus, for each
node, there are 𝑐 such minimum indices. This means that, for each node, a
fingerprint of size 𝑐 can be constructed. By comparing the fingerprints of two
nodes, the Jaccard coefficient can be estimated. This approach can be further
generalized with the use of every 𝑠 element set contained entirely with 𝑆

𝐴
and
𝑆
𝐵
. Thus, the above description is the special case when 𝑠 is set to 1. By
using different values of 𝑠 and 𝑐, it is possible to design an algorithm which
distinguishes between two sets that are above or below a certain threshold of
similarity.
The overall technique in [22] first generates a set of 𝑐 shingles of size 𝑠
for each node. The process of generating the 𝑐 shingles is extremely straight-
forward. Each node is processed independently. We use the min-wise hash
function approach in order to generate subsets of size 𝑠 from the outlinks at
each node. This results in 𝑐 subsets for each node. Thus, for each node, we
have a set of 𝑐 shingles. Thus, if the graph contains a total of 𝑛 nodes, the total
size of this shingle fingerprint is 𝑛 ×𝑐 ×𝑠𝑝, where 𝑠𝑝 is the space required for
each shingle. Typically 𝑠𝑝 will be 𝑂(𝑠), since each shingle contains 𝑠 nodes.
For each distinct shingle thus created, we can create a list of nodes which
contain it. In general, we would like to determine groups of shingles which
contain a large number of common nodes. In order to do so, the method in
[22] performs a second-order shingling in which the meta-shingles are created
from the shingles. Thus, this further compresses the graph in a data structure
of size 𝑐×𝑐. This is essentially a constant size data structure. We note that this
group of meta-shingles have the the property that they contain a large num-
A Survey of Clustering Algorithms for Graph Data 291
ber of common nodes. The dense subgraphs can then be extracted from these
meta-shingles. More details on this approach may be found in [22].
The min-hash approach is frequently used for graphs which are extremely
large and cannot be easily processed by conventional quasi-clique mining algo-
rithms. Since the min-hash approach summarizes the massive graph in a small
amount of space, it is particularly useful in leveraging the small space represen-

tation for a variety of query-processing techniques. Examples of such applica-
tions include the web graph and social networks. In the case of web graphs, we
desire to determine closely connected clusters of web pages with similar con-
tent. The related problem in social networks is that of finding closely related
communities. The min-hash approach discussed in [5, 22] precisely helps us
achieve this goal, because we can process the summarized min-hash structure
in a variety of ways in order to extract the important communities from the
summarized structure. More details of this approach may be found in [5, 22].
3. Clustering Graphs as Objects
In this section, we will discuss the problem of clustering entire graphs in
a multi-graph database, rather than examining the node clustering problem
within a single graph. Such situations are often encountered in the context of
XML data, since each XML document can be regarded as a structural record,
and it may be necessary to create clusters from a large number of such objects.
We note that XML data is quite similar to graph data in terms of how the data
is organized structurally. The attribute values can be treated as graph labels
and the corresponding semi-structural relationships as the edges. In has been
shown in [2, 10, 28, 29] that this structural behavior can be leveraged in order
to create effective clusters.
3.1 Extending Classical Algorithms to Structural Data
Since we are examining entre graphs in this version of the clustering prob-
lem, the problem simply boils down to that of clustering arbitrary objects,
where the objects in this case have structural characteristics. Many of the
conventional algorithms discussed in [24] (such as 𝑘-means type partitional
algorithms and hierarchical algorithms can be extended to the case of graph
data. The main changes required in order to extend these algorithms are as
follows:
Most of the underlying classical algorithms typically use some form of
distance function in order to measure similarity. Therefore, we need
appropriate measures in order to define similarity (or distances) between

structural objects.
292 MANAGING AND MINING GRAPH DATA
Many of the classical algorithms (such as 𝑘-means) use representative
objects such as centroids in critical intermediate steps. While this is
straightforward in the case of multi-dimensional objects, it is much more
challenging in the case of graph objects. Therefore, appropriate meth-
ods need to be designed in order to create representative objects. Fur-
thermore, in some cases it may be difficult to create representatives in
terms of single objects. We will see is that it is often more robust to use
representative summaries of the underlying objects.
There are two main classes of conventional techniques, which have been
extended to the case of structural objects. These techniques are as follows:
Structural Distance-based Approach: This approach computes struc-
tural distances between documents and uses them in order to compute
clusters of documents. One of the earliest work on clustering tree struc-
tured data is the XClust algorithm [28], which was designed to cluster
XML schemas in order for efficient integration of large numbers of Doc-
ument Type Definitions (DTDs) of XML sources. It adopts the agglom-
erative hierarchical clustering method which starts with clusters of single
DTDs and gradually merges the two most similar clusters into one larger
cluster. The similarity between two DTDs is based on their element sim-
ilarity, which can be computed according to the semantics, structure, and
context information of the elements in the corresponding DTDs. One of
the shortcomings of the XClust algorithm is that it does not make full
use of the structure information of the DTDs, which is quite important
in the context of clustering tree-like structures. The method in [7] com-
putes similarity measures based on the structural edit-distance between
documents. This edit-distance is used in order to compute the distances
between clusters of documents.
S-GRACE is hierarchical clustering algorithm [29]. In [29], an XML

document is converted to a structure graph (or s-graph), and the distance
between two XML documents is defined according to the number of
the common element-subelement relationships, which can capture bet-
ter structural similarity relationships than the tree edit distance in some
cases [29].
Structural Summary Based Approach: In many cases, it is possible
to create summaries from the underlying documents. These summaries
are used for creating groups of documents which are similar to these
summaries. The first summary-based approach for clustering XML doc-
uments was presented in [10]. In [10], the XML documents are modeled
as rooted ordered labeled trees. A framework for clustering XML docu-
ments by using structural summaries of trees is presented. The aim is to
improve algorithmic efficiency without compromising cluster quality.
A Survey of Clustering Algorithms for Graph Data 293
A second approach for clustering XML documents is presented in [2].
This technique is a partition-based algorithm. The primary idea in this
approach is to use frequent-pattern mining algorithms in order to deter-
mine the summaries of frequent structures in the data. The technique
uses a 𝑘-means type approach in which each cluster center comprises
a set of frequent patterns which are local to the partition for that clus-
ter. The frequent patterns are mined using the documents assigned to a
cluster center in the last iteration. The documents are then further re-
assigned to a cluster center based on the average similarity between the
document and the newly created cluster centers from the local frequent
patterns. In each iteration the document-assignment and the mined fre-
quent patterns are iteratively re-assigned, until the cluster centers and
document partitions converge to a final state. It has been shown in [2]
that such a structural summary based approach is significantly superior
to a similarity function based approach as presented in [7]. The method
of also superior to the structural approach in [10] because of its use of

more robust representations of the underlying structural summaries.
Since the most recent algorithm is the structural summary method discussed in
[2], we will discuss this in more detail in the next section.
3.2 The XProj Approach
In this section, we will present XProj, which is a summary-based approach
for clustering of XML documents. The pseudo-code for clustering of XML
documents is illustrated in Figure 9.1. The primary approach is to use a sub-
structural modification of a partition based approach in which the clusters of
documents are built around groups of representative sub-structures. Thus, in-
stead of a single representative of a partition-based algorithm, we use a sub-
structural set representative for the structural clustering algorithm. Initially,
the document set 𝒟 is randomly divided into 𝑘 partitions with equal size, and
the sets of sub-structure representatives are generated by mining frequent sub-
structures of size 𝑙 from these partitions. In each iteration, the sub-structural
representatives (of a particular size, and a particular support level) of a given
partition are the frequent structures from that partition. These structural rep-
resentatives are used to partition the document collection and vice-versa. We
note that this can be a potentially expensive operation because of the deter-
mination of frequent substructures; in the next section, we will illustrate an
interesting way to speed it up. In order to actually partition the document col-
lection, we calculate the number of nodes in a document which are covered
by each sub-structural set representative. A larger coverage corresponds to
a greater level of similarity. The aim of this approach is that the algorithm
will determine the most important localized sub-structures over time. This
294 MANAGING AND MINING GRAPH DATA
Algorithm XProj(Document Set: 𝒟, Minimum Support:
𝑚𝑖𝑛
𝑠𝑢𝑝, Structural Size: 𝑙, NumClusters: 𝑘 )
begin
Initialize representative sets 𝒮

1
. . . 𝒮
𝑘
;
while (𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛 =false)
begin
Assign each document 𝐷 ∈ 𝒟 to one of the sets in
{𝒮
1
. . . 𝒮
𝑘
} using coverage based similarity criterion;
/* Let the corresponding document partitions be
denoted by ℳ
1
. . . ℳ
𝑘
; */
Compute the freq. substructures of size 𝑙 from each
set ℳ
𝑖
using sequential transformation paradigm;
if (∣ℳ
𝑖
∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) ≥ 1
set 𝒮
𝑖
to frequent substructures of size 𝑙 from ℳ
𝑖
;

/* If (∣ℳ
𝑖
∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, 𝒮
𝑖
remains unchanged; */
end;
end
Figure 9.1. The Sub-structural Clustering Algorithm (High Level Description)
is analogous to the projected clustering approach which determines the most
important localized projections over time. Once the partitions have been com-
puted, we use them to re-compute the representative sets. These re-computed
representative sets are defined as the frequent sub-structures of size 𝑙 from
each partition. Thus, the representative set 𝑆
𝑖
is defined as the substructural
set from the partition ℳ
𝑖
which has size 𝑙, and which has absolute support
no less than (∣ℳ
𝑖
∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝). Thus, the newly defined representative set
𝑆
𝑖
also corresponds to the local structures which are defined from the parti-
tion ℳ
𝑖
. Note that if the partition ℳ
𝑖
contains too few documents such that
(∣ℳ

𝑖
∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, the representative set 𝑆
𝑖
remains unchanged.
Another interesting observation is that the similarity function between a
document and a given representative set is defined by the number of nodes
in the document which are covered by that set. This makes the similarity func-
tion more sensitive to the underlying projections in the document structures.
This leads to more robust similarity calculations in most circumstances.
In order to ensure termination, we need to design a convergence criterion.
One useful criterion is based on the increase of the average sub-structural
self-similarity over the 𝑘 partitions of documents. Let the partitions of doc-
uments with respect to the current iteration be ℳ
1
. . . ℳ
𝑘
, and their corre-
sponding frequent sub-structures of size 𝑙 be 𝒮
1
. . . 𝒮
𝑘
respectively. Then,
the average sub-structural self-similarity at the end of the current iteration

×