Tải bản đầy đủ (.pdf) (10 trang)

Managing and Mining Graph Data part 53 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.39 MB, 10 trang )

Graph Mining Applications to Social Network Analysis 509
[11] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski,
and D. Wagner. Maximizing modularity is hard. Arxiv preprint
physics/0608255, 2006.
[12] T. Bu and D. Towsley. On distinguishing between internet power law
topology generators. In Twenty-First Annual Joint Conference of the
IEEE Computer and Communications Societies, volume 2, pages 638–
647 vol.2, 2002.
[13] L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and
C. Sohler. Counting triangles in data streams. In PODS ’06: Proceedings
of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Prin-
ciples of database systems, pages 253–262, New York, NY, USA, 2006.
ACM.
[14] D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and
algorithms. ACM Comput. Surv., 38(1):2, 2006.
[15] A. Clauset, M. Mewman, and C. Moore. Finding community structure in
very large networks. Arxiv preprint cond-mat/0408187, 2004.
[16] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and
the prediction of missing links in networks. Nature, 453:98–101, 2008.
[17] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions
in empirical data. arXiv, 706, 2007.
[18] J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks
from the enron email corpus "it’s always about the people. enron is no
different". Comput. Math. Organ. Theory, 11(3):201–228, 2005.
[19] Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification
of dense communities in the web. In WWW ’07: Proceedings of the 16th
international conference on World Wide Web, pages 461–470, New York,
NY, USA, 2007. ACM.
[20] P. Erd
-
os and A. R


«
enyi. On the evolution of random graphs. Publ. Math.
Inst. Hung. Acad. Sci, 5:17–61, 1960.
[21] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships
of the internet topology. In SIGCOMM ’99: Proceedings of the confer-
ence on Applications, technologies, architectures, and protocols for com-
puter communication, pages 251–262, New York, NY, USA, 1999. ACM.
[22] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of
web communities. In KDD ’00: Proceedings of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
150–160, New York, NY, USA, 2000. ACM.
[23] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense sub-
graphs in massive graphs. In VLDB ’05: Proceedings of the 31st inter-
510 MANAGING AND MINING GRAPH DATA
national conference on Very large data bases, pages 721–732. VLDB
Endowment, 2005.
[24] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based cluster-
ing for social networks. Journal Of The Royal Statistical Society Series
A, 127(2):301–354, 2007.
[25] R. Hanneman and M. Riddle. Introduction to Social Network Methods.
hanneman/, 2005.
[26] P. D. Hoff and M. S. H. Adrian E. Raftery. Latent space approaches to
social network analysis. Journal of the American Statistical Association,
97(460):1090–1098, 2002.
[27] J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Natural communities
in large linked networks. In KDD ’03: Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data
mining, pages 541–546, New York, NY, USA, 2003. ACM.
[28] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online
social networks. In KDD ’06: Proceedings of the 12th ACM SIGKDD

international conference on Knowledge discovery and data mining, pages
611–617, New York, NY, USA, 2006. ACM.
[29] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the
web for emerging cyber-communities. Comput. Netw., 31(11-16):1481–
1493, 1999.
[30] M. Latapy. Main-memory triangle computations for very large (sparse
(power-law)) graphs. Theor. Comput. Sci., 407(1-3):458–473, 2008.
[31] J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of vi-
ral marketing. In EC ’06: Proceedings of the 7th ACM conference on
Electronic commerce, pages 228–237, New York, NY, USA, 2006. ACM.
[32] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic
evolution of social networks. In KDD ’08: Proceeding of the 14th ACM
SIGKDD international conference on Knowledge discovery and data
mining, pages 462–470, New York, NY, USA, 2008. ACM.
[33] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-
messaging network. In WWW ’08: Proceeding of the 17th international
conference on World Wide Web, pages 915–924, New York, NY, USA,
2008. ACM.
[34] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densifica-
tion and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1):2,
2007.
[35] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical
properties of community structure in large social and information net-
Graph Mining Applications to Social Network Analysis 511
works. In WWW ’08: Proceeding of the 17th international conference on
World Wide Web, pages 695–704, New York, NY, USA, 2008. ACM.
[36] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cas-
cading behavior in large blog graphs. In SIAM International Conference
on Data Mining (SDM 2007), 2007.
[37] B. McClosky and I. V. Hicks. Detecting cohesive groups.

ivhicks/CokplexAlgorithmPaper.pdf, 2009.
[38] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattachar-
jee. Measurement and analysis of online social networks. In IMC ’07:
Proceedings of the 7th ACM SIGCOMM conference on Internet measure-
ment, pages 29–42, New York, NY, USA, 2007. ACM.
[39] A. A. Nanavati, S. Gurumurthy, G. Das, D. Chakraborty, K. Dasgupta,
S. Mukherjea, and A. Joshi. On the structural properties of massive tele-
com call graphs: findings and implications. In CIKM ’06: Proceedings
of the 15th ACM international conference on Information and knowledge
management, pages 435–444, New York, NY, USA, 2006. ACM.
[40] M. Newman. The structure and function of complex networks. SIAM
Review, 45:167–256, 2003.
[41] M. Newman. Power laws, Pareto distributions and Zipf’s law. Contem-
porary physics, 46(5):323–352, 2005.
[42] M. Newman. Finding community structure in networks using the eigen-
vectors of matrices. Physical Review E (Statistical, Nonlinear, and Soft
Matter Physics), 74(3), 2006.
[43] M. Newman. Modularity and community structure in networks. PNAS,
103(23):8577–8582, 2006.
[44] M. Newman, A L. Barabasi, and D. J. Watts, editors. The Structure and
Dynamics of Networks. 2006.
[45] M. Newman and M. Girvan. Finding and evaluating community structure
in networks. Physical Review E, 69:026113, 2004.
[46] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochas-
tic blockstructures. Journal of the American Statistical Association,
96(455):1077–1087, 2001.
[47] G. Palla, I. Der
«
enyi, I. Farkas, and T. Vicsek. Uncovering the overlapping
community structure of complex networks in nature and society. Nature,

435:814–818, 2005.
[48] C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable
tool for data mining in massive graphs. In KDD ’02: Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 81–90, New York, NY, USA, 2002. ACM.
512 MANAGING AND MINING GRAPH DATA
[49] S. Papadopoulos, A. Skusa, A. Vakali, Y. Kompatsiaris, and N. Wagner.
Bridge bounding: A local approach for efficient community discovery in
complex networks. Feb 2009.
[50] P. Sarkar and A. W. Moore. Dynamic social network analysis using latent
space models. SIGKDD Explor. Newsl., 7(2):31–40, 2005.
[51] T. Schank and D. Wagner. Finding, counting and listing all triangles in
large graphs, an experimental study. In Workshop on Experimental and
Efficient Algorithms, 2005.
[52] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse frame-
work for combining multiple partitions. J. Mach. Learn. Res., 3:583–617,
2003.
[53] L. Tang and H. Liu. Relational learning via latent social dimensions. In
KDD ’09: Proceeding of the 15th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, 2009.
[54] L. Tang and H. Liu. Uncovering cross-dimension group structures in
multi-dimensional networks. In SDM workshop on Analysis of Dynamic
Networks, 2009.
[55] L. Tang, H. Liu, J. Zhang, N. Agarwal, and J. J. Salerno. Topic taxonomy
adaptation for group profiling. ACM Trans. Knowl. Discov. Data, 1(4):1–
28, 2008.
[56] L. Tang, H. Liu, J. Zhang, and Z. Nazeri. Community evolution in
dynamic multi-mode networks. In KDD ’08: Proceeding of the 14th
ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 677–685, New York, NY, USA, 2008. ACM.

[57] S. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual
model for the internet topology. In Global Telecommunications Confer-
ence, volume 3, pages 1667–1671, 2001.
[58] J. Travers and S. Milgram. An experimental study of the small world
problem. Sociometry, 32(4):425–443, 1969.
[59] C. E. Tsourakakis. Fast counting of triangles in large real networks with-
out counting: Algorithms and laws. IEEE International Conference on
Data Mining, 0:608–617, 2008.
[60] K. Wakita and T. Tsurumi. Finding community structure in mega-scale
social networks: [extended abstract]. In WWW ’07: Proceedings of the
16th international conference on World Wide Web, pages 1275–1276,
New York, NY, USA, 2007. ACM.
[61] S. Wasserman and K. Faust. Social Network Analysis: Methods and Ap-
plications. Cambridge University Press, 1994.
[62] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’
networks. Nature, 393:440–442, 1998.
Graph Mining Applications to Social Network Analysis 513
[63] K. Yu, S. Yu, and V. Tresp. Soft clsutering on graphs. In NIPS, 2005.
Chapter 17
SOFTWARE-BUG LOCALIZATION WITH
GRAPH MINING
Frank Eichinger
Institute for Program Structures and Data Organization (IPD)
Universit-at Karlsruhe (TH), Germany

Klemens B
-
ohm
Institute for Program Structures and Data Organization (IPD)
Universit-at Karlsruhe (TH), Germany


Abstract In the recent past, a number of frequent subgraph mining algorithms has been
proposed They allow for analyses in domains where data is naturally graph-
structured. However, caused by scalability problems when dealing with large
graphs, the application of graph mining has been limited to only a few domains.
In software engineering, debugging is an important issue. It is most challenging
to localize bugs automatically, as this is expensive to be done manually. Several
approaches have been investigated, some of which analyze traces of repeated
program executions. These traces can be represented as call graphs. Such graphs
describe the invocations of methods during an execution. This chapter is a sur-
vey of graph mining approaches for bug localization based on the analysis of
dynamic call graphs. In particular, this chapter first introduces the subproblem
of reducing the size of call graphs, before the different approaches to localize
bugs based on such reduced graphs are discussed. Finally, we compare selected
techniques experimentally and provide an outlook on future issues.
Keywords: Software Bug Localization, Program Call Graphs
© Springer Science+Business Media, LLC 2010
C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data,
Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_17,
515
516 MANAGING AND MINING GRAPH DATA
1. Introduction
Software quality is a huge concern in industry. Almost any software con-
tains at least some minor bugs after being released. In order to avoid bugs,
which incur significant costs, it is important to find and fix them before the re-
lease. In general, this results in devoting more resources to quality assurance.
Software developers usually try to find and fix bugs by means of in-depth code
reviews, along with testing and classical debugging. Locating bugs is consid-
ered to be the most time consuming and challenging activity in this context [6,
20, 24, 26] where the resources available are limited. Therefore, there is a need

for semi-automated techniques guiding the debugging process [34]. If a devel-
oper obtains some hints where bugs might be localized, debugging becomes
more efficient.
Research in the field of software reliability has been extensive, and various
techniques have been developed addressing the identification of defect-prone
parts of software. This interest is not limited to software-engineering research.
In the machine-learning community, automated debugging is considered to be
one of the ten most challenging problems for the next years [11]. So far, no bug
localization technique is perfect in the sense that it is capable of discovering
any kind of bug. In this chapter, we look at a relatively new class of bug local-
ization techniques, the analysis of call graphs with graph-mining techniques.
It can be seen as an approach orthogonal to and complementing existing tech-
niques.
Graph mining, or more specifically frequent subgraph mining, is a rela-
tively young discipline in data mining. As described in the other chapters of
this book, there are many different techniques as well as numerous applications
for graph mining. Probably the most prominent application is the analysis of
chemical molecules. As the NP-complete problem of subgraph isomorphism
[16] is an inherent part of frequent subgraph mining algorithms, the analysis of
molecules benefits from the relatively small size of most of them. Compared
to the analysis of molecular data, software-engineering artifacts are typically
mapped to graphs that are much larger. Consequently, common graph-mining
algorithms do not scale for these graphs. In order to make use of call graphs
which reflect the invocation structure of specific program executions, it is key
to deploy a suitable call-graph-reduction technique. Such techniques help to
alleviate the scalability problems to some extent and allow to make use of
graph-mining algorithms in a number of cases. As we will demonstrate, such
approaches work well in certain cases, but some challenges remain. Besides
scalability issues that are still unsolved, some call-graph-reduction techniques
lead to another challenge: They introduce edge weights representing call fre-

quencies. As graph-mining research has concentrated on structural and cat-
egorical domains, rather than on quantitative weights, we are not aware of
Software-Bug Localization with Graph Mining 517
any algorithm specialized in mining weighted graphs. Though this chapter
presents a technique to analyze graphs with weighted edges, the technique is a
composition of established algorithms rather than a universal weighted graph
mining algorithm. Thus, besides mining large graphs, weighted graph mining
is a further challenge for graph-mining research driven by the field of software
engineering.
The remainder of this chapter is structured as follows: Section 2 introduces
some basic principles of call graphs, bugs, graph mining and bug localization
with such graphs. Section 3 gives an overview of related work in software
engineering employing data-analysis techniques. Section 4 discusses different
call-graph-reduction techniques. The different bug-localization approaches are
presented and compared in Section 5 and Section 6 concludes.
2. Basics of Call Graph Based Bug Localization
This section introduces the concept of dynamic call graphs in Subsec-
tion 2.1. It presents some classes of bugs in Subsection 2.2 and Subsection 2.3
explains how bug localization with call graphs works in principle. A brief
overview of key aspects of graph and tree mining in the context of this chapter
is given in Subsection 2.4.
2.1 Dynamic Call Graphs
Call graphs are either static or dynamic [17]. A static call graph [1] can
be obtained from the source code. It represents all methods
1
of a program as
nodes and all possible method invocations as edges. Dynamic call graphs are
of importance in this chapter. They represent an execution of a particular pro-
gram and reflect the actual invocation structure of the execution. Without any
further treatment, a call graph is a rooted ordered tree. The main-method of a

program usually is the root, and the methods invoked directly are its children.
Figure 17.1a is an abstract example of such a call graph where the root Node 𝑎
represents the main-method.
Unreduced call graphs typically become very large. The reason is that, in
modern software development, dedicated methods typically encapsulate every
single functionality. These methods call each other frequently. Furthermore,
iterative programming is very common, and methods calling other methods
occur within loops, executed thousands of times. Therefore, the execution of
even a small program lasting some seconds often results in call graphs consist-
ing of millions of edges.
The size of call graphs prohibits a straightforward mining with state-of-
the-art graph-mining algorithms. Hence, a reduction of the graphs which com-
1
In this chapter, we use method interchangeably with function.
518 MANAGING AND MINING GRAPH DATA
presses the graphs significantly but keeps the essential properties of an individ-
ual execution is necessary. Section 4 describes different reduction techniques.
2.2 Bugs in Software
In the software-engineering literature, there is a number of different defi-
nitions of bugs, defects, errors, failures, faults and the like. For the purpose
of this chapter, we do not differentiate between them. It is enough to know
that a bug in a program execution manifests itself by producing some other
results than specified or by leading to some unexpected runtime behavior such
as crashes or non-terminating runs. In the following, we introduce some types
of bugs which are particularly interesting in the context of call graph based bug
localization.
a
b c
b b b
(a)

a
b c
b b b
(b)
a
b c
b b b
(c)
Figure 17.1. An unreduced call graph, a call graph with a structure affecting bug, and a call graph
with a frequency affecting bug.
Crashing and non-crashing bugs: Crashing bugs lead to an unex-
pected termination of the program. Prominent examples include null
pointer exceptions and divisions by zero. In many cases, e.g., depending
on the programming language, such bugs are not hard to find: A stack
trace is usually shown which gives hints where the bug occurred. Harder
to cope with are non-crashing bugs, i.e., failures which lead to faulty re-
sults without any hint that something went wrong during the execution.
As non-crashing bugs are hard to find, all approaches to discover bugs
with call-graph mining focus on them and leave aside crashing bugs.
Occasional and non-occasional bugs: Occasional bugs are bugs which
occur with some but not with any input data. Finding occasional bugs
is particularly difficult, as they are harder to reproduce, and more test
cases are necessary for debugging. Furthermore, they occur more fre-
quently, as non-occasional bugs are usually detected early, and occa-
sional bugs might only be found by means of extensive testing. As all
bug-localization techniques presented in this chapter rely on comparing
call graphs of failing and correct program executions, they deal with oc-
Software-Bug Localization with Graph Mining 519
casional bugs only. In other words, besides examples of failing program
executions, there needs to be a certain number of correct executions.

Structure and call frequency affecting bugs: This distinction is par-
ticularly useful when designing call graph based bug-localization tech-
niques. Structure affecting bugs are bugs resulting in different shapes
of the call graph where some parts are missing or occur additionally
in faulty executions. An example is presented in Figure 17.1b, where
Node 𝑏 called from 𝑎 is missing, compared to the original graph in Fig-
ure 17.1a. In this example, a faulty if-condition in Node 𝑎 could have
caused the bug. In contrast, call frequency affecting bugs are bugs which
lead to a change in the number of calls of a certain subtree in faulty ex-
ecutions, rather than to completely missing or new substructures. In the
example in Figure 17.1c, a faulty loop condition or a faulty if-condition
inside a loop in Method 𝑐 are typical causes for the increased number of
calls of Method 𝑏.
As probably any bug-localization technique, call graph based bug localiza-
tion is certainly not able to find all kinds of software bugs. For example, it
is possible that bugs do not affect the call graph at all. For instance, if some
mathematical expression calculates faulty results, this does not necessarily af-
fect subsequent method calls and call graph mining can not detect this. There-
fore, call graph based bug localization should be seen as a technique which
complements other techniques, as the ones we will describe in Section 3. In
this chapter we concentrate on deterministic bugs of single-threaded programs
and leave aside bugs which are specific for such situations. However, the tech-
niques described in the following might locate such bugs as well.
2.3 Bug Localization with Call Graphs
So far, several approaches have been proposed to localize bugs by means
of call-graph mining [9, 13, 14, 25]. We will present them in detail in the
following sections. In a nutshell, the approaches consist of three steps:
1 Deduction of call graphs from program executions,
assignment of labels correct or failing.
2 Reduction of call graphs.

3 Mining of call graphs,
analysis of the resulting frequent subgraphs.
Step 1: Deriving call graphs is relatively simple. They can be obtained by
tracing program executions while testing, which is assumed to be done anyway.
Furthermore, a classification of program executions as correct or failing is

×