MINISTRY OF NATIONAL DEFENSE
MILITARY TECHNICAL ACADEMY
NGUYEN THI HIEN
A STUDY OF GENERALIZATION
IN GENETIC PROGRAMMING
Specialized in: Applied Mathematics and Computer Science
Code: 60 46 01 10
SUMMARY OF THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN MATHEMATICS
Hanoi –2014
THE THESIS IS SUBMITTED TO MILITARY TECHNICAL ACADEMY -
MINISTRY OF NATIONAL DEFENSE
Academic supervisors:
1. Assoc/Prof. Dr. Nguyen Xuan Hoai
2. Prof. Dr. R.I. (Bob) McKay
Reviewer 1: Prof. Dr. Vu Duc Thi
Reviewer 2: Assoc/Prof. Dr. Nguyen Duc Nghia
Reviewer 3: Dr. Nguyen Van Vinh
The thesis was evaluated by the examination board of the academy by the
decision number 1767/QĐ-HV, 1
st
July 2014
of the Rector of Military Technical Academy, meeting at Military Technical
Academy on day month year
This thesis can be found at:
- Library of Military Technical Academy
- National Library of Vietnam
INTRODUCTION
Genetic programming (GP) paradigm was first proposed by Koza
(1992) can be viewed as the discovery of computer programs which
produce desired outputs for particular inputs. Despite advances
in GP research, when it is applied to learning tasks, the issue
of generalization of GP has not been taken the attention that it
deserves. This thesis focuses on the generalization aspect of GP
and proposes mechanisms to improve GP generalization ability.
1. Motivations
GP has been proposed as a machine learning (ML) method. The
main goal when using GP or other ML techniques is not only to
create a program that exactly cover the training examples, but
also to have a good generalization ability Although, recently, the
issue of generalization in GP has received more attention that
it deserves, the use of more traditional ML techniques has been
rather limited. It is hoped that adapting ML techniques to GP
would help to improve GP generalization ability.
2. Research Perspectives
The majority of theoretical work has been derived from exper-
imentation. The approach taken in this thesis is also based on
the careful designed experiments and the analysis of experimental
results.
3. Contributions of the Thesis
This thesis makes the following contributions:
1) An early stopping approach for GP that is based on the
estimate of the generalization error as well as propose some
1
new criteria apply to early stopping for GP. The GP training
process will be stopped early at the time which promises to
bring the solution with the smallest generalization error.
2) A progressive sampling framework for GP, which divides GP
learning process into several layers with the training set size,
starting from being small, gradually get bigger at each layer.
The termination of training on each layer will be based on
certain early stopping criteria.
4. Organization of the Thesis
Chapter 1 first introduces the basic components of GP as well as
some benchmark problem domains. Two important research is-
sues and a metaphor of GP search are then discussed. Next, the
major issues of GP are discussed especially when the GP is consid-
ered as a ML system. It first overviews the approaches proposed
in the literature that help to improve the generalization ability of
GP. Then, the solution complexity (code bloat), in the particular
context of GP, and its relations to GP learning performance are
discussed. Chapter 2 provides a backgrounds on a number of con-
cepts and techniques subsequently used in the thesis for improv-
ing GP learning performance. They include progressive sampling,
layer learning, and early stopping in ML. Chapter 3 presents one
of the main contributions of the thesis. It proposes some criteria
used to determine when to stop the training of GP with an aim
to avoid over-fitting. Chapter 4 develops a learning framework for
GP that is based layer learning and progressive sampling. Chap-
ter 4.4. concludes the thesis summarizing the main results and
proposing suggestions for future works that extend the research
in this thesis.
2
Chapter 1
BACKGROUNDS
This chapter describes the representation and the specific al-
gorithm components used in the canonical version of GP. The
chapter ends with a comprehensive overview of the literature on
the approaches used to improve GP generalization ability.
1.1. Genetic Programming
The basic algorithm is as follows:
1) Initialise a population of solutions
2) Assign a fitness to each population member
3) While the Stopping criterion is not met
4) Produce new individuals using operators and the existing
population
5) Place new individuals into the population
6) Assign a fitness to each population member, and test for the
Stopping criterion
7) Return the best fitness found
1.1.1. Representation, Initialization and Operators in GP
1.1.2. Representation
GP programs are expressed as expression trees. The variables
and constants in the program are called terminals in GP, which
are the leaves of the tree. The arithmetic operations are internal
nodes (called functions in the GP literature). The sets of allowed
functions and terminals together form the primitive symbol set of
a GP system.
3
1.1.3. Initializing the Population
The ramped half-n-half method is the most commonly used to
perform tree initialisation. It was introduced by Koza and proba-
bilistically selects between two recursive tree generating methods:
Grow and Full.
1.1.4. Fitness and Selection
Fitness is the measure used by GP to indicate how well a pro-
gram has learned to predict the output(s) from the input(s). Error
fitness function and squared error fitness function are used com-
monly in GP. The most commonly employed method for selecting
individuals in GP is tournament selection.
1.1.5. Recombination
• Choose two individuals as parents that is based on repro-
duction selection strategy.
• Select a random subtree in each parent. The subtrees con-
stituting terminals are selected with lower probability than
other subtrees.
• Swap the selected subtrees between the two parents. The
resulting individuals are the children.
1.1.6. Mutation
An individual is selected for mutation using fitness proportional
selection. A single function or terminal is selected at random from
among the set of function and terminals making up the original
individual as the point of mutation. The mutation point, along
with the subtree stemming from mutation point, is then removed
4
from the tree, and replaced with the new, randomly generated
subtree.
1.1.7. Reproduction
The reproduction operator copies an individual from one popula-
tion into the next.
1.1.8. Stopping Criterion
A maximum number of generations usually defines the stopping
criterion in GP. However, when it is possible to achieve an ideal
individual, this can also stop GP evolutionary process. In this
thesis, some other criteria are proposed, such as a measure of
generalization loss, a lack of fitness improvement, or a run of gen-
erations with over-fitting.
1.1.9. Some Variants of GP
The GP community has proposed numerous different approaches
to program evolution: Linear GP, Graph Based GP, Grammar
Based GP.
1.2. An Example of Problem Domain
This thesis uses the 10 polynomials in Table 1.1 with white (Gaus-
sian) noise of standard deviation 0.01. It means each target func-
tion is F (x) = f (x) + ϵ. 4 more real-world data sets are from
the UCI ML repository and StatLib are also used as shown in
Table 1.2.
1.3. GP and Machine Learning Issues
This section explores some open questions in GP research from
ML perspective, which motivates the research in this thesis.
5
1 F
1
(x) = x
4
+ x
3
+ x
2
+ x
2 F
2
(x) = cos(3x)
3 F
3
(x) =
√
x
4 F
4
(x) = x
1
x
2
+ sin ((x
1
− 1)(x
2
− 1))
5 F
5
(x) = x
4
1
− x
3
1
+
x
2
2
2
− x
2
Friedman1 F
6
(x) = 10 sin (πx
1
x
2
) + 20(x
3
− 0.5)
2
+ 10x
4
+ 5x
5
Friedman2 F
7
(x) =
√
x
2
1
+ (x
2
x
3
−
1
x
2
x
4
)
2
Gabor F
8
(x) =
π
2
e
−2(x
2
1
+x
2
2
)
cos[2π(x
1
+ x
2
)]
Multi F
9
(x) = 0.79 + 1.27x
1
x
2
+ 1.56x
1
x
4
+ 3.42x
2
x
5
+ 2.06x
3
x
4
x
5
3-D Mexican Hat F
1
0(x) =
sin(
√
x
1
1
+x
2
2
)
√
x
2
1
+x
2
2
Table 1.1: Test Functions
Data sets Features Size Source
Concrete Compressive Strength 9 1030 UCI
Pollen 5 3848 StatLib
Chscase.census6 7 400 StatLib
No2 8 500 StatLib
Table 1.2: The real-world data sets
1.3.1. Search Space
GP is a search technique that explores the space of computer pro-
grams. Particularly it changes with respect to the size of program.
In the implementation of GP the maximum depth of a tree is not
the only parameter that limits the search space of GP, an alter-
native could be the maximum number of nodes of an individual
or both (depth and size).
1.3.2. Bloat
In the course of evolution, the average size of the individual in the
GP population often get increased largely. Typically the increase
in program size is not accompanied by any corresponding increase
in fitness. The origin of this phenomenon, which is know as bloat,
has effectively been a subject of research for over a decade.
6
1.3.3. Generalization and Complexity
Most of research in improving generalization ability of the GP
has concentrated on avoiding over-fitting on training data. One
of main cause of over-fitting has been identified as the "complex-
ity" of the hypothesis generated by the learning algorithm. When
GP trees grow up to fit or to specialize on "difficult" learning
cases, it will no longer have the ability to generalize further, this
is entirely consistent with the principle of Occam’s razor stating
that the simple solutions should always be preferred. The rela-
tionship between generalization and individual complexity in GP
has often been studied in the context of code bloat, which is ex-
traordinary enlargement of the complexity of solutions without
increasing their fitness.
1.4. Generalization in GP
1.4.1. Overview of Studies on GP Generalization Capability
Common to most of the research are the problems with obtaining
generalization in GP and the attempts to overcome these prob-
lems. These approaches can be categorized as follows:
• Using training and testing and in order to promote general-
ization in supervised learning problems.
• Changing training instances from one generation to another
and evaluation of performance based on subsets of training
instances are suggested, or consider to combine GP with
ensemble learning techniques.
• Changing the formal implementation of GP: representation
GP, genetic operators and selection strategy.
7
1.4.2. Problem of Improving Training Efficiency
To derive and evaluate a model which uses GP as the core learning
component we are sought to answer the following questions:
1) Performance: How sensitive are the model to changes of the
learning problem, or initial conditions?
2) The effective size of the solution - Complexity: the princi-
ple of Occam’s razor states that among being equally, the
simpler is more effective.
3) Training time: How long will the learning take, i.e., how fast
or slow is the training phase?
Chapter 2
EARLY STOPPING, PROGRESSIVE SAMPLING,
AND LAYERED LEARNING
Early stopping, progressive sampling, and layered learning are the
techniques for enhancing the learning performance in ML. They
have been used in combination with a number of ML techniques.
2.1. Over-training, Early Stopping and Regularization
The overtraining problem is common in the field of ML. This
problem is related with learning process of learning machines. The
overtraining has been an open topic for discussion motivating the
proposal of several techniques like regularization, early stopping.
2.1.1. Over-training
When the number of training samples is infinitely large and they
are unbiased, the ML system parameters converge to one of the
8
local minima of the specified risk function (expected loss). When
the number of training samples is finite, the true risk function is
different from the empirical risk function to be minimized. Thus,
since the training samples are biased the parameters of a learning
machine might converge to a biased solution. This is known as
over-fitting or over training in ML.
2.1.2. Early Stopping
During the training phase of a learning machine, the generaliza-
tion error might decrease in an early period reaching a minimum
and then increase as the training goes on, while the training er-
ror monotonically decreases. Therefore, it is considered better to
stop training at an adequate time, the class of techniques that are
based on this are referred to as early stopping
2.1.3. Regularization Learning
Regularization is important for ML as it relates to over-fitting.
Regularization is based on the idea that one does not only want
to minimize the error between the model and the data but also
the complexity of the model. Methods of regularization balance
degree of fit with model complexity, which leads to simpler models
that still fit the data without over-fitting. This idea is closely
related to Occam’s razor. However, it is noticed that Occam’s
razor does not claim that the simplest solution is always the best
one. Sometimes, a more complex solution can explain the data
better than a simpler one. Occam’s razor recognizes the trade-off
between explanatory power and complexity and says we should
not increase the complexity of the solution unless it is necessary
to do so.
9
2.2. Progressive Sampling
PS is a school of training techniques in traditional machine learn-
ing for dealing with large training data sets. As Provost et al.
(1999) describe PS, a learning algorithm is trained and retrained
with increasingly larger random samples until the accuracy (gener-
alization) of the learnt model stops improving. For a training data
set of size N, the task of PS is to determine a sampling schedule
S = {n
0
, n
1
, n
2
, . . . , n
k
} where n
i
is size of sample to be provided
to the learning algorithm in stage i; it satisfies i < j ⇒ n
i
< n
j
and n
i
≤ N, ∀i, j ∈ {0, 1, . . . , k}. A learning machine M is then
trained according to the sampling schedule S. Generally, these
factors depend on the learning problem and algorithm. Provost et
al. investigated a number of sampling schedules as static schedule,
simple schedule, geometric schedules; geometric schedules were
formally and experimentally shown to be efficient for learning al-
gorithms of polynomial time complexity.
2.3. Layered Learning
The layered learning (LL) paradigm is described formally by Stone
and Veloso (2000). Intended as a means for dealing with problems
for which finding a direct mapping from inputs to outputs is in-
tractable with existing learning algorithms, the essence of the ap-
proach is to decompose a problem into subtasks, each of which is
then associated with a layer in the problem-solving process. LL is
a ML paradigm defined as a set of principles for the construction
of a hierarchical, learned solution to a complex task.
10
Chapter 3
AN EMPIRICAL STUDY OF EARLY STOPPING
FOR GP
In this chapter we empirically investigate several early stopping
criteria for GP learning process. The first group of stopping cri-
teria are based on the idea of over-fitting detection as proposed in
[5] and subsequently developed in [1]. The second group of criteria
are those proposed in Prechelt’s work (1997) for neural networks.
3.1. Method
Before the training (evolutionary run) commences, the entire data
set is divided into three subsets - training (50%), validation (25%),
and test sets (25%) in random way.
3.1.1. Proposed Classes of Stopping Criteria
To detect the over-fitting during a run of GP, the amounts of over-
fitting is calculated as proposed in Vanneschi’s paper (2010). The
detail of the method is presented in Algorithm 1 below. In the
algorithm, bvp is the "best valid point" and it stands for the best
validation fitness (error on the validation set) reached until the
current generation of a GP run, excluding the generations (usu-
ally at the beginning of the run) where the best individual fitness
on the training set has a better validation fitness value than the
training one; tbtp stands for "training at best valid point",i.e. the
training fitness of the individual that has validation fitness equal
to btp; Training Fit(g) is a function that returns the best training
fitness in the population at generation g; Val Fit(g) returns the
validation fitness of the best individual on the training set at gen-
eration g. The first class of stopping criteria for an early
stopping of a GP run is satisfied if it has been detected
11
Algorithm 1: Measuring the over-fitting for GP systems.
over fit(0)=0; bvp = Val Fit(0);
tbtp = Training Fit(0);
foreach generation g > 0 do
If (Training Fit(g)> Val Fit(g))
over fit(g) = 0;
else
if (Val Fit(g) < bvp) over fit(g) = 0;
bvp = Val Fit(g);
tbtp = Training Fit(g);
else
over f it(g) = |T raining F it(g) − V al F it(g)| − |tbtp − bvp| ;
for over-fitting in m successive generations, where m is a
tunable parameter.
OF
m
: stop after generation g when m successive generations
where over-fitting happened up to g
overf it(g), , overfit(g − m) > 0
However, the above criterion is taken into account to occur or not
over-fit phenomenon without consideration to its value. This leads
us to The second class of stopping criteria: stop when the
over-fitting does not decrease in m successive generations
V F
m
: stop when m successive generations where over-fitting value
increased.
overf it(g)≥overf it(g − 1)≥ ≥overf it(g − m) > 0
In the early generation, the desire for early stopping should be
smaller than that for the late generations. Therefore, we also pro-
pose a variant of the second stopping criterion called the true
adaptive stopping criteria and its work as in Algorithm 2
(where d is the number of generations those over-fit values of in-
creasing, f g is the generation started the chain of over-fit value
not decreasing, lg is the last generation of the chain of over-fit
value not decreasing, random is a random value in [0, 1], g is cur-
rent generation, M axGen is maximum number of generation).
12
Algorithm 2: True Adap Stopping
pr oAp =
(d−f g)∗g
(lg−f g)∗generation
;
if proAp ≥ random and d < m then
Stop = T rue
Three last classes of early stopping criteria are adapted from on
those presented in Prechelt’s work for training a neural network.
Their approach considered the validation set to predict the trend
on test set, so that the validation error is used as an estimate of
the generalization error. The third class of stopping criteria:
stop as soon as the generalization loss exceeds a certain
threshold. The class GL
α
is defined as GL
α
: stop after first
generation g with GL(g) > α
The generalization loss at generation g is measured by Equa-
tion 3.1
GL(g) = 100.(
E
va
(g)
E
opt
(g)
− 1) (3.1)
The value E
opt
(g) is calculated to be the lowest validation set er-
ror up to generation g:
E
opt
(g) = min
g
′
≤g
E
va
(g
′
) (3.2)
in which E
va
(g) is the validation fitness value of the best indi-
vidual of generation g. However, if the training error still gets
decreased rapidly, generalization losses have higher chance to be
"repaired"; thus assume that over-fitting does not begin until the
error decreases only slowly. The progress training is considered
training strip of length k to be a sequence k generations num-
bered g + 1 , , g + k where g is divisible by k. It measures how
much the average training error during the strip larger than the
13
minimum training error during the strip. It is formulated by 3.3
P
k
(t) = 1000.
∑
g
t
′
=g−k+1
E
tr
(t
′
)
k min
g
t
′
=g−k+1
E
tr
(t
′
)
− 1 (3.3)
The fourth class of stopping criteria: use the quotient of
generalization loss and progress.
P Q
alpha
: stop after the first end of strip at generation g with
GL(g)
P
k
(t)
> α
In the experiment we take strips of length 5 (k = 5). The fifth
class of stopping criteria: stop when the generalization
error gets increased in s successive strips. The final class of
stopping criteria is that when the validation error has increased
not only once but during s consecutive generations, assuming that
such increase indicate the beginning of final over-fitting.
UP
s
: stop after generation g iff UP
s−1
stops after generation
t − k and E
va
(g) > E
va
(g − k)
UP
1
: stop after first end of strip generation g with
E
va
(g) > E
va
(g − k)
3.2. Experimental Settings
The parameter settings for the GP systems are shown in Table 3.1.
It is noted that GPV (GP with early stopping, OF stands for GP
with the first class stopping criterion, VF for second class, GL for
third, PQ for fourth, UP for fifth) is just GPM (standard GP)
with an exception that at each generation the stopping criterion
is checked and if it is satisfied the run is stopped at the current
generation, Tar (GP with Tarpeian Bloat Control) is GPM with
Tarpeian Control as proposed in Mahler’s work (2003) where the
Target ratio parameter gives the percentage of over average sized
programs that are targeted at every generation was taken 0.1.
Otherwise, the three systems have the same setting as in Table 3.1,
14
they even use the same initial random seed at the beginning of
each run. All the runs were conducted on a Compaq Presario
CQ3414L computer with Intel Core i3-550 Processor (4M Cache,
3.20 GHz) running on Ubuntu Linux operating system.
Population Size 500
Number of generations 150 (for GPM)
Tournament size 3
Crossover probability 0.9
Mutation probability 0.05
Initial Max depth 6
Max depth 15
Non-terminals +, -, *, / (protected)
, sin, cos, exp, log (protected)
Standardized fitness mean absolute error
Number of runs 100
Table 3.1: Parameter settings for the GP systems
3.3. Results and Discussions
For each run, we recorded the generalization error (GE – mea-
sured on the test data set) of the best individual of the run, the
size of the best individual, the first generation where the best in-
dividual of the run was discovered, and the last generation of the
run (for GPV). We tested the significance of the differences in
generalization error (GE), run time and size of best solution be-
tween GPM and GPV (GP with certain stopping criteria), using
a two-tailed pairwise t-test with confidence level 0.95(α = 0.05).
For all settings of parameters, and for all class of stopping crite-
ria, GPV had much shorter training time than GPM. Especially,
the fourth class of stopping criteria had significantly shorter run
time than GPM, on all tested problems. Also there is a trade-off
between GEs and running time for GPV. The better GEs of solu-
tions obtained by GPV the more training time is needed. About
the size of best individual (SoBI), for the third stopping criteria,
the SoBI of GPV is significantly smaller than GPM on almost
all functions (excepted F
4
is similar), the same is applied for the
15
fourth stopping criteria (excepted F
7
, F
8
and Conc where the So-
BIs are similar), the first stopping criteria, the second stopping
criteria and true adaptive, and finally the fifth stopping criteria
often obtain solutions of much smaller sizes than GPM. To investi-
gate the effectiveness of stopping time criteria, the third and fifth
class of stopping criteria are of little difference compared to stan-
dard GP. However, the second, True Adap, and the fourth class of
stopping criteria all demonstrate their capabilities in determining
right stopping time, which explains why the running time of these
criteria are much shorter than standard GP while their solutions’
GE were almost at least as go od as standard GP. Event though
Tarpeian implements an explicit bias towards shorter solutions,
as any regularisation technique does, their solution sizes were not
much better than the solutions found by early stopping GP. Even,
sometimes early stopping GP could find solutions that are signifi-
cantly smaller than Tarpeian GP (for instance soultions found by
PQ for target functions (F
4
, F
10
). Overall, the results show that
early stopping are very competitive compared to a regularized
technique as Tarpeian GP.
3.4. Conclusion
The results from 10 synthetic symbolic regression and 4 real-world
problems show that early stopping using the fourth class of stop-
ping criteria improves GP learning efficiency by significantly re-
ducing training time while retaining, or even slightly improving,
the quality of the solutions it learns. The second class of stopping
criteria help GP often to obtain solutions with significantly b et-
ter generalization errors but at the cost of increasing training time
compared to the fourth class of stopping criteria (though it is still
significantly better than standard GP). It also confirms the value
of Prechelt’s second stopping criterion in his paper with different
16
settings of parameter α than were used by Prechelt. The results
somewhat contradictory those reported in Tuite’s work (2011),
where this stopping criterion is found to be less effective for GE.
We conjecture that these results from the the different settings of
α, and that Tuite et al. might see better results from the second
criterion with increased values of α. The techniques were also
compared to a recent regularized technique, Tarpeian GP, and
were shown to be competitive.
Chapter 4
A PROGRESSIVE SAMPLING FRAMEWORK
FOR GP
We presented a study of a simple GPLL system, GPLL in [3],
using incrementally increasing sample sizes. However it intro-
duced a number of ad-hoc parameter settings (initial training set
size, number of learning layers) which were difficult to justify (in
those experiments, we adjusted the parameter settings through
trial runs - thus reducing any computational cost advantage). This
leads into the main theme of the current work, to eliminate ad-hoc
parameter settings by deriving them from PS theory. Preliminary
experiments on such a system were presented in [2]; they were
sufficient to suggest value in the approach, which we examine in
more detail in this chapter.
4.1. The Proposed Method
The learning/evolutionary process is divided into l layers. Learn-
ing starts as in standard GP, but only a subset of the training
examples are used. The next layer commences when the stopping
criteria of the current layer are met, using the final population of
17
the previous layer as its starting population. However the training
sample is incremented with new samples drawn under the same
distribution – usually uniform random –from the problem samples.
This process is repeated until predetermined stopping criteria are
met. The main extensions (briefly outlined in [2]) are derived from
progressive sampling.
4.1.1. The Sampling Schedule
The sampling schedule uses geometric sampling:
S
g
= {2
i
.n
0
: i ∈ (0, 1, . . . , ⌊log
2
N⌋)}
∪
{N} (4.1)
In this, GPLL differs from PS, in that it always continues the
learning process until the data set is exhausted.
4.1.2. The Initial Sample Set Size
It starts by estimating the optimal sample size (the SOSS), based
on the idea that the initial training set should as far as possible re-
semble the full data set. Resemblance is measured by the sampling
quality Q – the inverse of the Kullback-Leibler (KL) divergence J
–between the subset and the original data set. In detail, we find
the SOSS for a data set D of size N, by generating n samples S
i
of sizes spanning the range [1, N]. We compute the corresponding
Q
i
for each S
i
. This process is detailed in Algorithm 3 of Baohua
Gu et al. (2001).
Algorithm 3: Pseudocode for Calculation of SOSS
Input: a mother data set D of size N, n sample sizes {S
i
|i=1 . n}
Output: n pair of (S
i
,Q
i
)
foreach instance j in D (j ∈ [1,N]) do
update corresponding statistics for D;
foreach sample i do
r ← UniformRand(0,1);
if (r <
S
i
N
) then
update corresponding statistics for sample i;
foreach sample i do
calculate its Q
i
and output (S
i
, Q
i
);
18
We then estimate the SOSS from the points (S
i
, Q
i
) as Baohua
did.
4.1.3. The Number of Learning Layers
Having chosen a = 2, it follows that the number l of learning
layers is given by:
l = ⌊log
2
(
N
n
0
)⌋ + 1 (4.2)
4.1.4. The Stopping Criterion for Each Layer
The second class of stopping criteria is preferred here as it was
shown in chapter 3 that it helped GP to often obtain significantly
better solutions (in terms of generalization error).
4.2. Experimental Settings
The primary interest of this work is in the impact of layered learn-
ing on the learning efficiency of GPLL. We tested this on (almost)
the same set of problems as in chapter 3. We determined the SOSS
for each problem as described in Section 4.1.2 The number of
layers was determined from the SOSS, following Section 4.1.3.:
for each problem, we built incrementally-sized training sets using
the geometric sampling schedule. We also constructed an inde-
pendent validation set for stopping criterion, and a separate test
set to measure the generalization accuracy of the end-of-the-run
solution. The tuning parameter in GPLL is the parameter m used
in the layer-stopping criterion slightly adjusted to used in the pre-
vious chapter. We ran GPLL with settings of m ∈ 3, 6, 9 (instead
of 3, 9, 18), resulting in systems denoted as GPLL
3
, GPLL
6
, and
GPLL
9
. The evolutionary parameter settings are similar in Ta-
ble 3.1. With one exception, GPLL uses the same algorithm and
settings – even the same initial random seeds – as standard GP.
19
The only difference is that in the former, the training set (fitness
cases) increases at each learning layer. Since we also examined run
times, it is important to note that all runs (100 for each system on
each problem) were conducted on a Compaq Presario CQ3414L
computer with Intel Core i3-550 Processor (4M Cache, 3.20GHz)
running a Ubuntu Linux operating system. No other jobs were
running at the same time.
4.3. Results and Discussions
For each setting, we recorded the generalization error (GE - test
set error) of the best individual of the run, its size, the total run-
ning time, and the first hitting time (the generation where the
best individual was found). To summarise, GPLL
3
was signifi-
cantly worse than GP (except for F
1
); the performance degrada-
tions ranged from 5% (Poll) to 81% (F
7
). Thus for m = 3, the
stopping criterion forced GPLL to move onto the next layer pre-
maturely, limiting its exploration. GPLL
6
was closer to GP: at
least as good on 10 out of 14 problems, and for the remainder, the
disadvantage of GPLL
6
ranged from about 3% to 28%. GPLL
9
was at least as good as GP on all problems, and better on F
1
, F
2
,
and Cens. Changing the GPLL parameter m allowed tuning of
the trade-off between GE and run time. For 10 of the 14 problems,
GPLL
3
was at least 5 times faster than GP, but suffered a per-
formance penalty. GPLL
6
had performance close to that of GP,
but still retained a factor of 2-3 advantage in run time. GPLL
9
achieved comparable or superior learning outcomes to GP, but
still retained a 2:1 performance advantage. Thus GPLL provides
the best of both worlds: if one requires equivalent or superior
performance to GP, with a time cost advantage of 2:1, a setting
of n = 9 can achieve it. If slightly worse performance than GP
is acceptable, one can gain a time advantage of 5:1 or better by
20
setting n = 3; and outcomes in between can also be chosen. In
the bulk of the 14 problems (13 for GPLL
3
, 12 for GPLL
6
, and 8
for GPLL
9
), GPLL found significantly smaller solutions than GP.
In the early layers, the training data set is sparse; if it represents
the original data set poorly (i.e. KL distance is large) it could
drive the GP process into over-fitting wrong data. For GPLL
6
and GPLL
9
, the stopping criterion was more tolerant, and conse-
quently more solutions were found in early layers (especially when
the KL distance between the training set and the full data set is
small). This suggests that the run time of GPLL could be further
improved if we could find stopping criteria that would allow GPLL
to terminate even if all layers had not been consumed. GPLL in-
corporates three important components beyond those in standard
GP: layered learning, incremental sampling, and early stopping.
Are all these components necessary for the performance of GPLL?
To test this, we conducted experiments on the 14 test problems,
omitting one or more of these components. Specifically, we ran
GP with the following settings:
• GPM: GP using the SOSS for training rather than the orig-
inal data set
• GPMSt
9
: GPM with the stopping criterion of GPLL
9
• GPSt
9
: GP with the original data set and the stopping cri-
terion of GPLL
9
In detail, GPLL
9
out-performed GPM, GPMSt, and GPSt
9
on 9,
10, and 6 of the 14 problems, and was never worse.
4.4. Conclusion
This chapter studies in detail the modified GPLL system first
proposed in [2], itself an extension of the original GPLL of [3]
21
with ideas from progressive sampling described in chapter 2. This
modified GPLL has shown, at one extreme of tuning, very large
improvements in search time with a small cost in generalization
capability; and at the other, small improvements in generaliza-
tion, while retaining a substantial advantage (over standard GP)
in search time. At either extreme, it generates less complex (and
therefore more comprehensible) solutions. The system has only
one tuning parameter additional to those of GP, namely the tuning
parameter m that determines the eagerness exhibited by GPLL
to move on to the next layer.
CONCLUSIONS AND FUTURE WORK
As learning is an essential part of Artificial Intelligence (AI) re-
search so that any contribution to research in learning will have a
direct influence on the advancements in other sub-fields of AI, it
must be investigated in depth. With this broad goal in mind, in
this thesis the following three major issues were explored:
1) The thesis stated the need for a beneficial interaction be-
tween machine learning research and the evolutionary meth-
ods in order to improve on the approaches and methods that
are currently employed in evolutionary research. A particu-
lar emphasis is given to one of the sub-area of evolutionary
learning research: genetic programming (GP).
2) Generalization performance as one of the performance mea-
sure of a learner was shown to be of crucial importance in im-
proving on practices and approaches of evolutionary learn-
ing, and in particular of GP. Several new methods aiming
at promoting generalization performance of the solutions (in
terms of (generalization error, training time, solution com-
plexity) obtained by GP were developed and their use was
22
supported by experimental evidence.
3) A class of learning problems were represented as a challenge
for both conventional learning methods and evolutionary
methods. The thesis developed various new methods which
were able to produce successful generalization performance
on the selected examples of learning problems. Most of the
results obtained using these methods were b etter than the
standard GP.
The research on this thesis has considered GP as a machine learn-
ing system and aim to improve the its performance. The main
contributions of the thesis is summarized as follows.
1. Contributions
In addition to a summarization of literature regarding the gener-
alization ability of GP, following original work and results have
been reported in the thesis:
• A method to improve the performance of GP based
on early stopping:Early stopping method was used with
some other machine learning techniques, but it has not been
applied to GP. The results of the experiments showed that
early stopping technique has significant advantages in re-
ducing the training time of and the complexity of solution
found by GP that does not alter the quality of the solutions.
As a consequence, three new stopping criteria for GP have
been proposed and three others have been adopted from the
literature of neural networks. These results are shown in
publication [1, 5].
• A framework to improve the performance of GP
based on PS: Although GPLL does not always help GP to
23