advances in machine learning applications in software engineering

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.33 MB, 499 trang )

TEAM LinG
i
Advances in
Machine Learning
Applications in
Software Engineering
Du Zhang
California State University, USA
Jeffrey J.P. Tsai
University of Illinois at Chicago, USA
Hershey • London • Melbourne • Singapore
IDEA GROUP PUBLISHING
TEAM LinG
ii
Acquisitions Editor: Kristin Klinger
Development Editor: Kristin Roth
Senior Managing Editor: Jennifer Neidig
Managing Editor: Sara Reed
Assistant Managing Editor: Sharon Berger
Copy Editor: Amanda Appicello
Typesetter: Amanda Appicello
Cover Design: Lisa Tosheff
Printed at: Integrated Book Technology
Published in the United States of America by
Idea Group Publishing (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail:
Web site:

and in the United Kingdom by
Idea Group Publishing (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site:
Copyright © 2007 by Idea Group Inc. All rights reserved. No part of this book may be reproduced, stored or
distributed in any form or by any means, electronic or mechanical, including photocopying, without written
permission from the publisher.
Product or company names used in this book are for identication purposes only. Inclusion of the names of the
products or companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Advances in machine learning applications in software engineering / Du Zhang and Jeffrey J.P. Tsai, editors.
p. cm.
Summary: “This book provides analysis, characterization and renement of software engineering data in
terms of machine learning methods. It depicts applications of several machine learning approaches in software
systems development and deployment, and the use of machine learning methods to establish predictive models
for software quality while offering readers suggestions by proposing future work in this emerging research
eld” Provided by publisher.
Includes bibliographical references and index.
ISBN 1-59140-941-1 (hardcover) ISBN 1-59140-942-X (softcover) ISBN 1-59140-943-8 (ebook)
1. Software engineering. 2. Self-adaptive software. 3. Application software. 4. Machine learning. I. Zhang,
Du. II. Tsai, Jeffrey J P.
QA76.758.A375 2007
005.1 dc22
2006031366
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are
those of the authors, but not necessarily of the publisher.
TEAM LinG
iii
Advances in Machine
Learning Applications in
Software Engineering
Table of Contents
Preface vi
Section I: Data Analysis and Renement
Chapter I
A Two-Stage Zone Regression Method for Global Characterization of a Project
Database 1
J. J. Dolado, University of the Basque Country, Spain
D. Rodríguez, University of Reading, UK
J. Riquelme, University of Seville, Spain
F. Ferrer-Troyano, University of Seville, Spain
J. J. Cuadrado, University of Alcalá de Henares, Spain
Chapter II
Intelligent Analysis of Software Maintenance Data 14
Marek Reformat, University of Alberta, Canada
Petr Musilek, University of Alberta, Canada
Efe Igbide, University of Alberta, Canada
Chapter III
Improving Credibility of Machine Learner Models in Software Engineering 52
Gary D. Boetticher, University of Houston – Clear Lake, USA
Section II: Applications to Software Development
Chapter IV
ILP Applications to Software Engineering 74
Daniele Gunetti, Università degli Studi di Torino, Italy

TEAM LinG
iv
Chapter V
MMIR: An Advanced Content-Based Image Retrieval System Using a
Hierarchical Learning Framework 103
Min Chen, Florida International University, USA
Shu-Ching Chen, Florida International University, USA
Chapter VI
A Genetic Algorithm-Based QoS Analysis Tool for Recongurable
Service-Oriented Systems 121
I-Ling Yen, University of Texas at Dallas, USA
Tong Gao, University of Texas at Dallas, USA
Hui Ma, University of Texas at Dallas, USA
Section III: Predictive Models for Software Quality and Relevancy
Chapter VII
Fuzzy Logic Classiers and Models in Quantitative Software Engineering 148
Witold Pedrycz, University of Alberta, Canada
Giancarlo Succi, Free University of Bolzano, Italy
Chapter VIII
Modeling Relevance Relations Using Machine Learning Techniques 168
Jelber Sayyad Shirabad, University of Ottawa, Canada
Timothy C. Lethbridge, University of Ottawa, Canada
Stan Matwin, University of Ottawa, Canada
Chapter IX
A Practical Software Quality Classication Model Using Genetic
Programming 208
Yi Liu, Georgia College & State University, USA
Taghi M. Khoshgoftaar, Florida Atlantic University, USA
Chapter X
A Statistical Framework for the Prediction of Fault-Proneness 237

Yan Ma, West Virginia University, USA
Lan Guo, West Virginia University, USA
Bojan Cukic, West Virginia University, USA
Section IV: State-of-the-Practice
Chapter XI
Applying Rule Induction in Software Prediction 265
Bhekisipho Twala, Brunel University, UK
Michelle Cartwright, Brunel University, UK
Martin Shepperd, Brunel University, UK
TEAM LinG
v
Chapter XII
Application of Genetic Algorithms in Software Testing 287
Baowen Xu, Southeast University & Jiangsu Institute of Software Quality,
China
Xiaoyuan Xie, Southeast University & Jiangsu Institute of Software Quality,
China
Liang Shi, Southeast University & Jiangsu Institute of Software Quality,
China
Changhai Nie, Southeast University & Jiangsu Institute of Software Quality,
China
Section V: Areas of Future Work

Chapter XIII
Formal Methods for Specifying and Analyzing Complex Software Systems 319
Xudong He, Florida International University, USA
Huiqun Yu, East China University of Science and Technology, China
Yi Deng, Florida International University, USA
Chapter XIV
Practical Considerations in Automatic Code Generation 346

Paul Dietz, Motorola, USA
Aswin van den Berg, Motorola, USA
Kevin Marth, Motorola, USA
Thomas Weigert, Motorola, USA
Frank Weil, Motorola, USA
Chapter XV
DPSSEE: A Distributed Proactive Semantic Software Engineering
Environment 409
Donghua Deng, University of California, Irvine, USA
Phillip C Y. Sheu, University of California, Irvine, USA
Chapter XVI
Adding Context into an Access Control Model for Computer Security Policy 439
Shangping Ren, Illinois Institute of Technology, USA
Jeffrey J.P. Tsai, University of Illinois at Chicago, USA
Ophir Frieder, Illinois Institute of Technology, USA
About the Editors 457
About the Authors 458
Index 467

TEAM LinG
vi
preface
Machine learning is the study of how to build computer programs that improve their perfor-
mance at some task through experience. The hallmark of machine learning is that it results
in an improved ability to make better decisions. Machine learning algorithms have proven
to be of great practical value in a variety of application domains. Not surprisingly, the eld
of software engineering turns out to be a fertile ground where many software development
and maintenance tasks could be formulated as learning problems and approached in terms
of learning algorithms.
To meet the challenge of developing and maintaining large and complex software systems

in a dynamic and changing environment, machine learning methods have been playing an
increasingly important role in many software development and maintenance tasks. The past
two decades have witnessed an increasing interest, and some encouraging results and publi-
cations in machine learning application to software engineering. As a result, a crosscutting
niche area emerges. Currently, there are efforts to raise the awareness and prole of this
crosscutting, emerging area, and to systematically study various issues in it. It is our intention
to capture, in this book, some of the latest advances in this emerging niche area.
Machine Learning Methods
Machine learning methods fall into the following broad categories: supervised learning,
unsupervised learning, semi-supervised learning, analytical learning, and reinforcement
learning. Supervised learning deals with learning a target function from labeled examples.
Unsupervised learning attempts to learn patterns and associations from a set of objects that
do not have attached class labels. Semi-supervised learning is learning from a combination of
labeled and unlabeled examples. Analytical learning relies on domain theory or background
knowledge, instead of labeled examples, to learn a target function. Reinforcement learning
is concerned with learning a control policy through reinforcement from an environment.
TEAM LinG
vii
There are a number of important issues in machine learning:
• How is a target function represented and specied (based on the formalism used to
represent a target function, there are different machine learning approaches)? What
are the interpretability, complexity, and properties of a target function? How does it
generalize?
• What is the hypothesis space (the search space)? What are its properties?
• What are the issues in the search process for a target function? What are heuristics
and bias utilized in searching for a target function?
• Is there any background knowledge or domain theory available for the learning pro-
cess?
• What properties do the training data have?
• What are the theoretical underpinnings and practical issues in the learning process?

The following are some frequently-used machine learning methods in the aforementioned
categories.
In concept learning, a target function is represented as a conjunction of constraints on at-
tributes. The hypothesis space H consists of a lattice of possible conjunctions of attribute
constraints for a given problem domain. A least-commitment search strategy is adopted to
eliminate hypotheses in H that are not consistent with the training set D. This will result in
a structure called the version space, the subset of hypotheses that are consistent with the
training data. The algorithm, called the candidate elimination, utilizes the generalization and
specialization operations to produce the version space with regard to H and D. It relies on
a language (or restriction) bias that states that the target function is contained in H. This is
an eager and supervised learning method. It is not robust to noise in data and does not have
support for prior knowledge accommodation.
In decision tree learning, a target function is dened as a decision tree. Search in decision
tree learning is often guided by an entropy-based information gain measure that indicates
how much information a test on an attribute yields. Learning algorithms often have a bias
for small trees. It is an eager, supervised, and unstable learning method, and is susceptible
to noisy data, a cause for overtting. It cannot accommodate prior knowledge during the
learning process. However, it scales up well with large data in several different ways.
In neural network learning, given a xed network structure, learning a target function amounts
to nding weights for the network such that the network outputs are the same as (or within
an acceptable range of) the expected outcomes as specied in the training data. A vector of
weights in essence denes a target function. This makes the target function very difcult for
human to read and interpret. This is an eager, supervised, and unstable learning approach
and cannot accommodate prior knowledge. A popular algorithm for feed-forward networks
is backpropagation, which adopts a gradient descent search and sanctions an inductive bias
of smooth interpolation between data points.
Bayesian learning offers a probabilistic approach to inference, which is based on the as-
sumption that the quantities of interest are dictated by probability distributions, and that
optimal decisions or classications can be reached by reasoning about these probabilities
along with observed data. Bayesian learning methods can be divided into two groups based

TEAM LinG
viii
on the outcome of the learner: the ones that produce the most probable hypothesis given the
training data, and the ones that produce the most probable classication of a new instance
given the training data. A target function is thus explicitly represented in the rst group, but
implicitly dened in the second group. One of the main advantages is that it accommodates
prior knowledge (in the form of Bayesian belief networks, prior probabilities for candidate
hypotheses, or a probability distribution over observed data for a possible hypothesis). The
classication of an unseen case is obtained through combined predictions of multiple hy-
potheses. It also scales up well with large data. It is an eager and supervised learning method
and does not require search during learning process. Though it has no problem with noisy
data, Bayesian learning has difculty with small data sets. Bayesian learning adopts a bias
that is based on the minimum description length principle.
Genetic algorithms and genetic programming are both biologically-inspired learning meth-
ods. A target function is represented as bit strings in genetic algorithms, or as programs in
genetic programming. The search process starts with a population of initial hypotheses.
Through the crossover and mutation operations, members of current population give rise
to the next generation of population. During each step of the iteration, hypotheses in the
current population are evaluated with regard to a given measure of tness, with the ttest
hypotheses being selected as members of the next generation. The search process terminates
when some hypothesis h has a tness value above some threshold. Thus, the learning process
is essentially embodied in the generate-and-test beam search. The bias is tness-driven.
There are generational and steady-state algorithms.
Instance-based learning is a typical lazy learning approach in the sense that generalizing
beyond the training data is deferred until an unseen case needs to be classied. In addition,
a target function is not explicitly dened; instead, the learner returns a target function value
when classifying a given unseen case. The target function value is generated based on a
subset of the training data that is considered to be local to the unseen example, rather than
on the entire training data. This amounts to approximating a different target function for a
distinct unseen example. This is a signicant departure from the eager learning methods

where a single target function is obtained as a result of the learner generalizing from the
entire training data. The search process is based on statistical reasoning, and consists in
identifying training data that are close to the given unseen case and producing the target
function value based on its neighbors. Popular algorithms include: K-nearest neighbors,
case-based reasoning, and locally weighted regression.
Because a target function in inductive logic programming is dened by a set of (propositional
or rst-order) rules, it is highly amenable to human readability and interpretability. It lends
itself to incorporation of background knowledge during learning process, and is an eager
and supervised learning. The bias sanctioned by ILP includes rule accuracy, FOIL-gain, or
preference of shorter clauses. There are a number of algorithms: SCA, FOIL, PROGOL,
and inverted resolution.
Instead of learning a non-linear target function from data in the input space directly, support
vector machines use a kernel function (dened in the form of inner product of training data)
to transform the training data from the input space into a high dimensional feature space F
rst, and then learn the optimal linear separator (a hyperplane) in F. A decision function,
dened based on the linear separator, can be used to classify unseen cases. Kernel functions
play a pivotal role in support vector machines. A kernel function relies only on a subset of
the training data called support vectors.
TEAM LinG
ix
In ensemble learning, a target function is essentially the result of combining, through weighted
or unweighted voting, a set of component or base-level functions called an ensemble. An
ensemble can have a better predictive accuracy than its component function if (1) individual
functions disagree with each other, (2) individual functions have a predictive accuracy that
is slightly better than random classication (e.g., error rates below 0.5 for binary classi-
cation), and (3) individual functions’ errors are at least somewhat uncorrelated. ensemble
learning can be seen as a learning strategy that addresses inadequacies in training data (insuf-
cient information in training data to help select a single best h ∈ H), in search algorithms
(deployment of multiple hypotheses amounts to compensating for less than perfect search
algorithms), and in the representation of H (weighted combination of individual functions

makes it possible to represent a true function f ∉ H). Ultimately, an ensemble is less likely
to misclassify than just a single component function.
Two main issues exist in ensemble learning: ensemble construction and classication
combination. There are bagging, cross-validation, and boosting methods for constructing
ensembles, and weighted vote and unweighted vote for combining classications. The Ada-
Boost algorithm is one of the best methods for constructing ensembles of decision trees.
There are two approaches to ensemble construction. One is to combine component functions
that are homogeneous (derived using the same learning algorithm and being dened in the
same representation formalism, for example, an ensemble of functions derived by decision
tree method) and weak (slightly better than random guessing). Another approach is to combine
component functions that are heterogeneous (derived by different learning algorithms and
being represented in different formalisms, for example, an ensemble of functions derived by
decision trees, instance-based learning, Bayesian learning, and neural networks) and strong
(each of the component functions performs relatively well in its own right).
Multiple instance learning deals with the situation in which each training example may have
several variant instances. If we use a bag to indicate the set of all variant instances for a
training example, then for a Boolean class the label for the bag is positive if there is at least
one variant instance in the bag that has a positive label. A bag has a negative label if all
variant instances in the bag have a negative label. The learning algorithm is to approximate
a target function that can classify every variant instance of an unseen negative example as
negative, and at least one variant instance of an unseen positive example as positive.
In unsupervised learning, a learner is to analyze a set of objects that do not have their class
labels, and discern the categories to which objects belong. Given a set of objects as input,
there are two groups of approaches in unsupervised learning: density estimation methods that
can be used in creating statistical models to capture or explain underlying patterns or inter-
esting structures behind the input, and feature extraction methods that can be used to glean
statistical features (regularities or irregularities) directly from the input. Unlike supervised
learning, there is no direct measure of success for unsupervised learning. In general, it is dif-
cult to establish the validity of inferences from the output unsupervised learning algorithms
produce. Most frequently utilized methods under unsupervised learning include: association

rules, cluster analysis, self-organizing maps, and principal component analysis.
Semi-supervised learning relies on a collection of labeled and unlabeled examples. The
learning starts with using the labeled examples to obtain an initial target function, which is
then used to classify the unlabeled examples, thus generating additional labeled examples.
The learning process will be iterated on the augmented training set. Some semi-supervised
learning methods include: expectation-maximization with generative mixture models, self-
training, co-training, transductive support vector machines, and graph-based methods.
TEAM LinG
x
When a learner has some level of control over which part of the input domain it relies on
in generating a target function, this is referred to as active learning. The control the learner
possesses over the input example selection is called selective sampling. Active learning can
be adopted in the following setting in semi-supervised learning: the learner identies the
most informative unlabeled examples and asks the user to label them. This combination of
active learning and semi-supervised learning results in what is referred to as the multi-view
learning.
Analytical learning allows a target function to be generalized from a domain theory (prior
knowledge about the problem domain). The learned function has a good readability and
interpretability. In analytical learning, search is performed in the form of deductive reason-
ing. The search bias in explanation based learning, a major analytical learning method, is a
domain theory and preference of a small set of Horn clauses. One important perspective of
explanation based learning is that learning can be construed as recompiling or reformulat-
ing the knowledge in the domain theory so as to make it operationally more efcient when
classifying unseen cases. EBL algorithms include Prolog-EBG.
Both inductive learning and analytical learning have their props and cons. The former requires
plentiful data (thus vulnerable to data quality and quantity problems), while the latter relies
on a domain theory (hence susceptible to domain theory quality and quantity problems).
Inductive analytical learning is meant to provide a framework where benets from both ap-
proaches can be strengthened and impact of drawbacks minimized. It usually encompasses
an inductive learning component and an analytical learning component. It requires both a

training set and a domain theory, and can be an eager and supervised learning. The issues
of target function representation, search, and bias are largely determined by the underlying
learning components involved.
Reinforcement learning is the most general form of learning. It tackles the issue of how
to learn a sequence of actions called a control strategy from indirect and delayed reward
information (reinforcement). It is an eager and unsupervised learning. Its search is carried
out through training episodes. Two main approaches exist for reinforcement learning: model-
based and model-free approaches. The best-known model-free algorithm is Q-learning. In
Q-learning, actions with maximum Q value are preferred.
Machine Learning Applications in
Software Engineering
In software engineering, there are three categories of entities: processes, products and
resources. Processes are collections of software related activities, such as constructing
specication, detailed design, or testing. Products refer to artifacts, deliverables, documents
that result from a process activity, such as a specication document, a design document, or
a segment of code. Resources are entities required by a process activity, such as personnel,
software tools, or hardware. The aforementioned entities have internal and external attri-
butes. Internal attributes describe an entity itself, whereas external attributes characterize the
behavior of an entity (how the entity relates to its environment). Machine learning methods
have been utilized to develop better software products, to be part of software products, and
to make software development process more efcient and effective. The following is a
TEAM LinG
xi
partial list of software engineering areas where machine learning applications have found
their way into:
• Predicting or estimating measurements for either internal or external attributes of
processes, products, or resources. These include: software quality, software size, soft-
ware development cost, project or software effort, maintenance task effort, software
resource, correction cost, software reliability, software defect, reusability, software
release timing, productivity, execution times, and testability of program modules.

• Discovering either internal or external properties of processes, products, or resources.
These include: loop invariants, objects in programs, boundary of normal operations,
equivalent mutants, process models, and aspects in aspect-oriented programming.
• Transforming products to accomplish some desirable or improved external attributes.
These include: transforming serial programs to parallel ones, improving software
modularity, and Mapping OO applications to heterogeneous distributed environ-
ments.
• Synthesizing or generating various products. These include: test data, test resource,
project management rules, software agents, design repair knowledge, design schemas,
data structures, programs/scripts, project management schedule, and information
graphics.
• Reusing products or processes. These include: similarity computing, active browsing,
cost of rework, knowledge representation, locating and adopting software to specica-
tions, generalizing program abstractions, and clustering of components.
• Enhancing processes. These include: deriving specications of system goals and
requirements, extracting specications from software, acquiring knowledge for speci-
cation renement and augmentation, and acquiring and maintaining specication
consistent with scenarios.
• Managing products. These include: collecting and managing software development
knowledge, and maintaining software process knowledge.
Organization of the Book
This book includes sixteen chapters that are organized into ve sections. The rst section
has three chapters (Chapters I-III) that deal with analysis, characterization, and renement
of software engineering data in terms of machine learning methods. The second section
includes three chapters (Chapters IV-VI) that present applications of several machine learn-
ing approaches in helping with software systems development and deployment. The third
section contains four chapters (Chapters VII-X) that describe the use of machine learning
methods to establish predictive models for software quality and relevancy. Two chapters
(Chapters XI-XII) in the fourth section offer some state-of-the-practice on the applications
of two machine learning methods. Finally, the four chapters (Chapters XIII-XVI) in the last

section of the book serve as areas of future work in this emerging research eld.
Chapter I discusses the issue of how to use machine learning methods to rene a large
software project database into a new database which captures and retains the essence of the
TEAM LinG
xii
original database, but contains a fewer number of attributes and instances. This new and
smaller database would afford the project managers a better chance to gain insight into the
database. The proposed data renement approach is based on the decision tree learning.
Authors demonstrate their approach through four datasets in the International Software
Benchmarking Standard Group database.
Chapter II is concerned with analyzing software maintenance data to shed light on efforts
in defect elimination. Several learning methods (decision tree learning, rule-based learning
and genetic algorithm and genetic programming) are utilized to address the following two
issues: the number of software components to be examined to remove a single defect, and
the total time needed to remove a defect. The maintenance data from a real life software
project have been used in the study.
Chapter III takes a closer look at the credibility issue in the empirical-based models. Sev-
eral experiments have been conducted on ve NASA defect datasets using naïve Bayesian
classier and decision tree learning. Several observations have been made: the importance
of sampling on non-class attributes, and insufciency of the ten-fold cross validation in
establishing realistic models. The author introduces several credibility metrics that measure
the difculty of a dataset. It is argued that adoption of these credibility metrics will lead to
better models and improve their chance of being accepted by software practitioners.
Chapter IV focuses on the applications of inductive logic programming to software engi-
neering. An integrated framework based on inductive logic programming has been proposed
for the synthesis, maintenance, reuse, testing and debugging of logic programs. In addition,
inductive logic programming has been successfully utilized in genetics, automation of the
scientic process, natural language processing and data mining.
Chapter V demonstrates how multiple instance learning and neural networks are integrated
with Markov model mediator to address the following challenges in an advanced content-

based image retrieval system: the signicant discrepancy between the low-level image
features and the high-level semantic concepts, and the perception subjectivity problem.
Comparative studies on a large set of real-world images indicate the promising performance
of this approach.
Chapter VI describes the application of genetic algorithms to recongurable service oriented
systems. To accommodate recongurability in a service-oriented architecture, QoS analysis
is often required to make appropriate service selections and congurations. To determine
the best selections and congurations, some composition analysis techniques are needed
to analyze QoS tradeoffs. The composition analysis framework proposed in this chapter
employs a genetic algorithm for composition decision making. A case study is conducted
on the selections and congurations of web services.
Chapter VII deals with the issue of software quality models. Authors propose an approach
to dene logic-driven models based on fuzzy multiplexers. The constructs in such models
have a clear and modular topology whose interpretation corresponds to a collection of
straightforward logic expressions. Genetic algorithms and genetic optimization underpin
the design of the logic models. Experiments on some software dataset illustrate how the
logic model allows the number of modications made to software modules to be obtained
from a collection of software metrics.
Chapter VIII denes a notion called relevance relation among software entities. Relevance
relations map tuples of software entities to values that signify how related they are to each
other. The availability of such relevance relations plays a pivotal role in software development
TEAM LinG
xiii
and maintenance, making it possible to predict whether a change to one software entity (one
le) results in a change in another entity (le). A process has been developed that allows
relevance relations to be learned through decision tree learning. The empirical evaluation,
through applying the process to a large legacy system, indicates that the predictive quality
of the learned models makes them a viable choice for eld deployment.
Chapter IX presents a novel software quality classication model that is based on genetic
programming. The proposed model provides not only a classication but also a quality-based

ranking for software modules. In evolving a genetic programming based software quality
model, three performance criteria have been considered: classication accuracy, module
ranking, and the size of the tree. The model has been subjected to case studies of software
measurement data from two industrial software systems.
Chapter X describes a software quality prediction model that is used to predict fault prone
modules. The model is based on an ensemble of trees voting on prediction decisions to
improve its classication accuracy. Five NASA defect datasets have been used to assess the
performance of the proposed model. Two strategies have been identied to be effective in
the prediction accuracy: proper sampling technique in constructing the tree classiers, and
the threshold adjustment in determining the resulting class.
Chapter XI offers a broad view of the roles rule-based learning plays in software engineering.
It provides some background information, discusses the key issues in rule induction, and
examines how rule induction handles uncertainties in data. The chapter examines the rule
induction applications in the following areas: software effort and cost prediction, software
quality prediction, software defect prediction, software intrusion detection, and software
process modeling.
Chapter XII, on the other hand, provides a state-of-the-practice overview on genetic algo-
rithm applications to software testing. The focus of the chapter is on evolutionary testing,
which is the application of genetic algorithms for test data generation. The central issue in
evolutionary testing is a numeric representation of the test objective from which an appropri-
ate tness function can be dened to evaluate the generated test data. The chapter includes
reviews of existing approaches in structural, temporal performance, and specication-based
functional evolutionary testing.
Chapter XIII reviews two well-known formal methods, high-level Petri nets and temporal
logic, for software system specication and analysis. It pays attention to recent advances in
using these formal methods to specify, model and analyze software architectural design. The
chapter opens the opportunity for machine learning methods to be utilized in learning either
the property specications or behavior models at element or composition level in a software
architectural design phase. In addition, learning methods can be applied to the formal analysis
for element correctness, or composition correctness, or renement correctness.

A model-driven software engineering process advocates developing software systems by
creating an executable model of the system design rst and then transforming the model
into a production quality implementation. The success of the approach hinges critically
on the availability of code generators that can transform a model to its implementation.
Chapter XIV gives a testimony to the model-driven process. It provides insights, practical
considerations, and lessons learned when developing code generators for applications that
must conform to the constraints imposed by real-world high-performance systems. Since
the model can be construed as the domain theory, analytical learning can be used to help
TEAM LinG
xiv
with the transformation process. There have been machine learning applications in program
transformation tasks.
Chapter XV outlines a distributed proactive semantic software engineering environment.
The proposed environment incorporates logic rules into a software development process to
capture the semantics from various levels of the software life cycle. The chapter discusses
several scenarios in which semantic rules are used for workow control, design consistency
checking, testing and maintenance. This environment certainly makes it possible to deploy
machine learning methods in the rule generator and in the semantic constraint generator to
learn constraint rules and proactive rules.
Chapter XVI depicts a role-based access control model that is augmented with the context
constraints for computer security policy. There are system contexts and application contexts.
Integrating the contextual information into a role-based access control model allows the
model to be exible and capable of specifying various complex access policies, and to be
able to provide tight and just-in-time permission activations. Machine learning methods can
be used in deriving context constraints from system or application contextual data.
This book is intended particularly for practicing software engineers, and researchers and
scientists in either software engineering or machine learning eld. The book can also be
used either as a textbook for advanced undergraduate or graduate students in a software
engineering course, a machine learning application course, or as a reference book for ad-
vanced training courses in the eld.

Du Zhang
Jeffrey J.P. Tsai
TEAM LinG
xv
Acknowledgments
We would like to take this opportunity to express our sincere appreciation to all the authors
for their contributions, and to all the reviewers for their support and professionalism. We are
grateful to Kristin Roth, development editor at IGI for her guidance, help, and encouragement
at each step of this project.
Du Zhang
Jeffrey J.P. Tsai
TEAM LinG
xvi
Section.I:
Data.Analysis.and.
Renement
This part has three chapters that deal with analysis, characterization and renement of
software engineering data in terms of machine learning methods. There are circumstances
in software engineering where data may be plentiful or scarce or anywhere in between. The
questions to be asked are: What is the data quality? What do the data convey to project
managers or to software engineers? Can machine learning methods help glean useful
information from the data?
Chapter I discusses the issue of how to use machine learning methods to rene a large
software project database into a new database which captures and retains the essence of
the original database, but contains fewer number of attributes and instances. This new and
smaller database would afford the project managers a better chance to gain insight into
the database. Chapter II is concerned with analyzing software maintenance data to shed
light on efforts in defect elimination. Several learning methods are utilized to address the
following two issues: the number of software components to be examined to remove a single
defect, and the total time needed to remove a defect. Chapter III takes a closer look at the

credibility issue in the empirical-based models. Several experiments have been conducted
on ve NASA defect datasets using naïve Bayesian classier and decision tree learning. It
results in some interesting observations.
TEAM LinG
A Two-Stage Zone Regression Method for Global Characterization 1
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
Chapter I
A Two-Stage Zone
Regression Method for
Global Characterization
of a Project Database
J. J. Dolado, University of the Basque Country, Spain
D. Rodríguez, University of Reading, UK
J. Riquelme, University of Seville, Spain
F. Ferrer-Troyano, University of Seville, Spain
J. J. Cuadrado, University of Alcalá de Henares, Spain
Abstract
One of the problems found in generic project databases, where the data is collected from
different organizations, is the large disparity of its instances. In this chapter, we characterize
the database selecting both attributes and instances so that project managers can have a
better global vision of the data they manage. To achieve that, we ﬁrst make use of data min-
ing algorithms to create clusters. From each cluster, instances are selected to obtain a ﬁnal
subset of the database. The result of the process is a smaller database which maintains the
prediction capability and has a lower number of instances and attributes than the original,
yet allow us to produce better predictions.
TEAM LinG
2 Dolado, Rodríguez, Riquelme, Ferrer-Troyano & Cuadrado
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.

Introduction
Successful software engineering projects need to estimate and make use of past data since
the inception of the project. In the last decade, several organizations have started to col-
lect data so that companies without historical datasets can use these generic databases for
estimation. In some cases, project databases are used to compare data from the organization
with other industries, that is, benchmarking. Examples of such organizations collecting data
include the International Software Benchmarking Standards Group (ISBSG, 2005) and the
Software Technology Transfer Finland (STTF, 2004).
One problem faced by project managers when using these datasets is that the large number
of attributes and instances needs to be carefully selected before estimation or benchmarking
in a speciﬁc organization. For example, the latest release of the ISBSG (2005) has more
than 50 attributes and 3,000 instances collected from a large variety of organizations. The
project manager has the problem of interpreting and selecting the most adequate instances.
In this chapter, we propose an approach to reduce (characterize) such repositories using
data mining as shown in Figure 1. The number of attributes is reduced mainly using expert
knowledge although the data mining algorithms can help us to identify the most relevant
attributes in relation to the output parameter, that is, the attribute that wants to be estimated
(e.g., work effort). The number of instances or samples in the dataset is reduced by selecting
those that contribute to a better accuracy of the estimates after applying a version of the M5
(Quinlan, 1992) algorithm, called M5P, implemented in the Weka toolkit (Witten & Frank,
1999) to four datasets generated from the ISBSG repository. We compare the outputs before
and after, characterizing the database using two algorithms provided by Weka, multivariate
linear regression (MLR), and least median squares (LMS).
This chapter is organized as follows: the Techniques Applied section presents the data min-
ing algorithm; The Datasets section describes the datasets used; and the Evaluation of the
Techniques and Characterization of Software Engineering Datasets section discusses the
approach to characterize the database followed by an evaluation of the results. Finally, the
Conclusions section ends the chapter.
Figure 1. Characterizing dataset for producing better estimates

A
1

a
1m

A
n

A
2

a
mn

…
Reduced dataset,
with less attributes
and instances
TEAM LinG
A Two-Stage Zone Regression Method for Global Characterization 3
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
Techniques Applied
Many software engineering problems like cost estimation and forecasting can be viewed
as classiﬁcation problems. A classiﬁer resembles a function in the sense that it attaches a
value (or a range or a description), named the class, C, to a set of attribute values A
1
, A
2

,
A
n
, that is, a classiﬁcation function will assign a class to a set of descriptions based on the
characteristics of the instances for each attribute. For example, as shown in Table 1, given
the attributes size, complexity, and so forth, a classiﬁer can be used to predict the effort.
Table 1. Example of attributes and class in software engineering repository
A
1
-Size … A
n
- Complexity C - Effort
a
11
… a
11
c
1
… … … …
a
11
… a
nm
c
n
In this chapter, we have applied data mining, that is, computational techniques and tools
designed to support the extraction, in an automatic way, of the information useful for deci-
sion support or exploration of the data source (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).
Since data may not be organized in a way that facilitates the extraction of useful information,
typical data mining processes are composed of the following steps:

• Data Preparation: The data is formatted in a way that tools can manipulate it, merged
from different databases, and so forth.
• Data Mining: It is in this step when the automated extraction of knowledge from
the data is carried out. Examples of such algorithms and some usual representations
include: C4.5 or M5 for decision trees, regression, and so forth.
• Proper Interpretation of the Results: Including the use of visualization tech-
niques.
• Assimilation of the Results.
Within the available data mining algorithms, we have used M5 and linear regression classi-
ﬁers implemented in the Weka toolkit, which have been used to select instances of a software
engineering repository. The next sub-sections explain these techniques in more detail.
M5 and M5P
The main problem in linear regression is that the attributes must be numeric so that the
model obtained will also be numeric (simple equations in a dimensions). As a solution to
this problem, decision trees have been used in data mining for a long time as a supervised
TEAM LinG
4 Dolado, Rodríguez, Riquelme, Ferrer-Troyano & Cuadrado
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
learning technique (models are learned from data). A decision tree divides the attribute space
into clusters with two main advantages. First, each cluster is clearly deﬁned in the sense that
new instances are easily assigned to a cluster (leaf of the tree). The second beneﬁt is that the
trees are easily understandable by users in general and by project managers in particular.
Each branch of the tree has a condition which reads as follows: attribute ≤ value or attribute
> value that serve to make selections until a leaf is reached. Such conditions are frequently
used by experts in all sciences in decision making.
Decision trees are divided into model trees in which each leaf represents the average value
of the instances that are covered by the leaf and regression trees in which each leaf is a
regression model. Examples of decision trees include a system called CART (Classiﬁcation
and Regression Trees) developed by Breiman (1984), ID3 (Quinlan, 1986) improved into

C4.5 (Quinlan, 1993), and M5 (Quinlan, 1992) with the difference that in M5 the nodes
represent linear regressions rather than discrete classes.
The M5 algorithm, the most commonly used classiﬁer of this family, builds regression trees
whose leaves are composed of multivariate linear models, and the nodes of the tree are chosen
over the attribute that maximizes the expected error reduction as a function of the standard
deviation of output parameter. In this work, we have used the M5 algorithm implemented
in the Weka toolkit (Witten & Frank, 1999), called M5P. Figure 2 shows Weka’s output
for the M5P algorithm for one of the datasets that we used for this chapter. In this case,
the M5P algorithm created 17 clusters, from LM1 to LM17. The normalized work effort
(NormWorkEff) is the dependent variable, and a different linear model is applied depending
on the number of Function Points (FP) and productivity (NormPDR). The clusters found
can assign to the dependent variable either a constant or a linear equation (in the majority of
the cases); in this case, each cluster or region is associated with linear equations (Figure 2,
right column) In the example shown in Figure 2, the M5P algorithm created 17 leaves, and
we will use FP and NormPDR to select the appropriate linear model. In this case, the tree
Figure 2. Weka’s M5P output
FP <= 389.5 :
| NormPDR <= 16.5 :
| | NormPDR <= 6.65 : LM1 (305/8.331%)
| | NormPDR > 6.65 : LM2 (318/5.402%)
| NormPDR > 16.5 :
| | NormPDR <= 37.25 :
| | | FP <= 147 : LM3 (124/5.751%)
| | | FP > 147 : LM4 (107/10.878%)
| | NormPDR > 37.25 :
| | | FP <= 161 :
| | | | NormPDR <= 56.9 : LM5 (34/8.323%)
| | | | NormPDR > 56.9 :
| | | | | FP <= 57 : LM6 (9/17.555%)
| | | | | FP > 57 : LM7 (34/27.858%)

LM num: 2
…
LM num: 17
NormWorkEff =
22.1179 * FP
- 15457.3164 * VAF
+ 36.5098 * MaxTeamSizy
- 634.6502 * DevType=New_Development,Re-development
+ 37.0267 * DevPlatf=MR
+ 1050.5217 * LangType=2GL,3GL,ApG
+ 328.1218 * ProjElapTime
- 90.7468 * ProjInactiveTime
+ 370.2088 * NormPDR
+ 5913.1867
TEAM LinG
A Two-Stage Zone Regression Method for Global Characterization 5
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
generated is composed of a large number of leaves divided by the same variables at differ-
ent levels. The tree could be simpliﬁed adding a restriction about the minimum number of
instances covered by each leaf; for example, saying that there should be 100 instances per
leaf will generate a simpler tree but less accurate.
Figure 3 also shows the tree in a graphical way. Each leaf of the tree provides further infor-
mation within brackets. For example, for LM1, there are 308 instances and an approximate
error in that leaf is 8.331%.
Constructing the M5 Decision Tree
Regarding the construction of the tree, M5 needs three steps. The ﬁrst step generates a
regression tree using the training data. It calculates a linear model (using linear regression)
for each node of the tree generated. The second step tries to simplify the regression tree
generated in the previous search (ﬁrst post-pruning) deleting the nodes of the linear models

whose attributes do not increase the error. The aim of the third step is to reduce the size of
the tree without reducing the accuracy (second post-pruning). To increase the efﬁciency, M5
does the last two steps at the same time so that the tree is parsed only once. This simpliﬁes
both the number of the nodes as well as simplifying the nodes themselves.
As mentioned previously, M5 ﬁrst calculates a regression tree that minimizes the variation
of the values in the instances that fall into the leaves of the tree. Afterwards, it generates a
lineal model for each of the nodes of the tree. In the next step, it simpliﬁes the linear models
of each node by deleting those attributes that do not reduce the classiﬁcation error when
they are eliminated. Finally, it simpliﬁes the regression tree by eliminating subtrees under
the intermediate nodes. They are the nodes whose classiﬁcation error is greater than the
classiﬁcation error given by the lineal model corresponding to those intermediate nodes. In
this way, taking a set of learning instances E and a set of attributes A, a simpliﬁed version
of the M5 algorithm will be as follows:
Figure 3. Graphical view of the M5P tree
TEAM LinG
6 Dolado, Rodríguez, Riquelme, Ferrer-Troyano & Cuadrado
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
Proc_M5 (E,A)
begin
R : = create-node-tree-regression
R : = create-tree-regression (E,A,R)
R : = simplify-lineal-models (E,R)
R : = simplify-regression-tree (E,R)
Return R
End
The regression tree, R, is created in a divide-and-conquer method; the three functions (cre-
ate-tree-regression, simplify-lineal-models and simplify-regression-tree) are called in a recursive
way after creating regression tree node by (create-node-tree-regression).
Once the tree has been built, a linear model for each node is calculated and the leaves of the

trees are pruned if the error decreases. The error for each node is the average of the differ-
ence between the predicted value and the actual value of each instance of the training set
that reaches the node. This difference is calculated in absolute terms. This error is weighted
according to the number of instances that reach that node. This process is repeated until all
the examples are covered by one or more rules.
Transformation of Nominal Attributes
Before building the tree, all non-numeric attributes are transformed into binary variables so
that they can be treated as numeric attributes. A variable with k values is transformed into
k-1 binary variables. This transformation is based on the Breiman observation. According
to this observation, the best splitting in a node for a variable with k values is one of the k-1
possible solutions once the attributes have been sorted.
Missing Values
A quite common problem with real datasets occurs when the value of a splitting attribute
does not exist. Once the attribute is selected as a splitting variable to divide the dataset into
subsets, the value of this attribute must be known. To solve this problem, the attribute whose
value does not exist is replaced by the value of another attribute that is correlated to it. A
simpler solution is to use the prediction value as the value of the attribute selected or the
average value of the attribute for all the instances in the set that do not reach the node, but
can be used as the value of the attribute.
Heuristics
The split criterion of the branches in the tree in M5 is given by the heuristic used to select
the best attribute in each new branch. For this task, M5 uses the standard deviation as a
measure of the error in each node. First, the error decrease for each attribute used as split-
ting point is calculated.
TEAM LinG
A Two-Stage Zone Regression Method for Global Characterization 7
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
Smoothing
In the ﬁnal stage, a regularization process is made to compensate discontinuities among

adjacent linear models in the leaves of the tree. This process is started once the tree has
been pruned and especially for models based on training sets containing a small number of
instances. This smoothing process usually improves the prediction obtained.
Linear Regression and Least Median Squares
Linear regression (LR) is the classical linear regression model. It is assumed that there is a
linear relationship between a dependant variable (e.g., effort) with a set of or independent
variables, that is, attributes (e.g., size in function points, team size, development platform,
etc.). The aim is to adjust the data to a model so that
y = β
0
+ β
1
x
1
+ β
2
x
2
+ + + β
k
x
k
+ e.
Least median squares (LMS) is a robust regression technique that includes outlier detection
(Rousseeuw & Leroy, 1987) by minimizing the median rather than the mean.
Goodness of ﬁt of the linear models is usually measured by the correlation, co-efﬁcient
of multiple determination R
2
and by the mean squared error. However, in the software
engineering domain, the mean magnitude of relative error (MMRE) and prediction at level

l—Pred (l)—are well known techniques for evaluating the goodness of ﬁt in the estimation
methods (see the Evaluation of the Techniques and Characterization of Software Engineer-
ing Datasets section).
The Datasets
The International Software Benchmarking Standards Group (ISBSG), a non-proﬁt organiza-
tion, maintains a software project management repository from a variety of organizations.
The ISBSG checks the validity and provides benchmarking information to companies
submitting data to the repository. Furthermore, it seems that the data is collected from large
and successful organizations. In general, such organizations have mature processes and
well-established data collection procedures. In this work, we have used the “ISBSG release
no. 8”, which contains 2,028 projects and more than 55 attributes per project. The attributes
can be classiﬁed as follows:
• Project context, such as type of organization, business area, and type of develop-
ment;
• Product characteristics, such as application type user base;
TEAM LinG
8 Dolado, Rodríguez, Riquelme, Ferrer-Troyano & Cuadrado
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
• Development characteristics, such as development platform, languages, tools, and so
forth;
• Project size data, which is different types of function points, such as IFPUG (2001),
COSMIC (2004), and so forth; and
• Qualitative factors such as experience, use of methodologies, and so forth.
Before using the dataset, there are a number of issues to be taken into consideration. An
important attribute is the quality rating given by the ISBSG: its range varies from A (where
the submission satisﬁes all criteria for seemingly sound data) to D (where the data has some
fundamental shortcomings). According to ISBSG, only projects classiﬁed as A or B should
be used for statistical analysis. Also, many attributes in ISGSB are categorical attributes
or multi-class attributes that need to be pre-processed for this work (e.g., the project scope

attribute which indicates what tasks were included in the project work effort—planning,
speciﬁcation, design, build, and test—were grouped. Another problem of some attributes is the
large number of missing instances. Therefore, in all datasets with the exception of the “reality
dataset”, we have had to do some pre-processing. We selected some attributes and instances
manually. There are quite a large number of variables in the original dataset that we did not
consider relevant or they had too many missing values to be considered in the data mining
process. From the original database, we only considered the IFPUG estimation technique
and those that can be considered very close variations of IFPUG such as NESMA.
We have used four datasets selecting different attributes including the one provided in the
“reality tool” by ISBSG. In our study, we have selected NormalisedWorkEffort or Summa-
ryWorkEffort as dependent variables. The normalized work effort is an estimate of the full
development life cycle effort for those projects covering less than a full development life
cycle while the summary work effort is the actual work effort carried out by the project. For
projects covering the full development life cycle and projects where the development life
cycle coverage is not known, these values are the same, that is, work effort reported. When
the variable summary work effort is used, the dataset included whether each of the life cycle
phases were carried out, such as, planning, speciﬁcation, building and testing.
DS1: The reality dataset is composed of 709 instances and 6 attributes (DevelopmentType,
DevelopmentPlatform, LanguageType, ProjectElapsedTime, NormalisedWorkEffort,
UnadjustedFunctionPoints). The dependent variable for this dataset is the Normalised-
WorkEffort.
DS2: The dataset DS2 is composed of 1,390 instances and 15 attributes (FP, VAF, Max-
TeamSize, DevType, DevPlatf, LangType, DBMUsed, MethodUsed, ProjElapTime,
ProjInactiveTime, PackageCustomisation, RatioWEProNonPro, TotalDefectsDe-
livered, NormWorkEff, NormPDR). The dependent variable for this dataset is the
NormalisedWorkEffort.
DS3. The dataset DS3 is composed of 1,390 instances and 19 attributes (FP, SummWorkEffort,
MaxTeamSize, DevType, DevPlatf, LangType, DBMUsed, MethodUsed, ProjElapTime,
ProjInactiveTime, PackageCustomisation, Planning, Speciﬁcation, Build, Test, Impl,
RatioWEProNonPro, TotalDefectsDelivered, ReportedPDRAdj). In this case, we did

consider the software life cycle attributes (Planning, Speciﬁcation, Build, Impl, Test),
TEAM LinG

advances in machine learning applications in software engineering

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về