Tải bản đầy đủ (.pdf) (5 trang)

a comparison of neural network architectures for

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (138.32 KB, 5 trang )

A Comparison of Neural Network Architectures for
Handwritten Digit Recognition

Eman M. El-Sheikh
1
, Bradley A. Swain
1
, and Mohamed A. Khabou
2

1
Department of Computer Science, University of West Florida, Pensacola, FL, USA
2
Department of Electrical & Computer Engineering, University of West Florida, Pensacola, FL, USA


Abstract - In this paper, we describe the development
and use of an artificial neural network architecture for
recognizing handwritten digit data. The feed-forward
neural network, which was implemented in Java, was
designed to be modular and parameterized so that
various parameters and settings of the network could be
easily modified. The network was used to recognize a
large set of data representing handwritten digits.
Various configurations of the network were run on the
data set to determine the best settings for this type of
data set. A correct classification rate of 93.77% was
achieved on a testing set containing 3000 images not
used in the training phase. We present a summary and
analysis of the results.
Keywords: Machine learning, neural network,


handwritten digit recognition, classification, Java
application.

1 Introduction
The focus of this work is the development, use,
and analysis of an artificial neural network architecture
for the recognition of handwritten digit data. Neural
networks have a long track record of successful use for
pattern recognition problems, including character
recognition. The application of recognizing
handwritten digit data was selected for this study due
to the numerous potential applications, including mail
sorting, automated check reading, and data entry for
hand-held devices.
In this paper, we describe the development of a
software neural network architecture. The system,
which was implemented using the Java programming
language, was designed to be modular and
parameterized to allow easy modification of the
network’s size, structure, and parameters. The overall
motivation for our study is the analysis of various
machine learning techniques for real-world
applications with the purpose of determining the
appropriateness and usefulness of each technique.
Specifically, for the study described in this paper, we
focused on the use of neural network learning
techniques for handwritten digit recognition. Our
objective was two-fold: to test the neural network
implementation on a large-scale, real-world data set
using a variety of network parameters and sizes, and to

determine the best network structure and settings for
the handwritten digit data set. The results provide
evidence for the use of neural network learning
techniques for this type of application and data set, and
insight about the appropriate design, use, and
parameterization of the network to achieve acceptable
results.
In the next section, we provide a brief summary of the
literature relevant to neural network techniques and
applications. Section 3 describes the design,
implementation, and use of the neural network. Section
4 describes the methods used for our tests, including a
description of the data set, experiments we conducted,
and results of our study. The last section includes the
conclusions derived and lessons learned from
analyzing the results, as well as plans for future work.

2 Literature Review
The use of neural networks in pattern recognition
problems dates back to the early 1950s when the early
perceptron models were introduced by Rosenblatt. In
the 1960s and 1970s the progress in neural networks
slowed due to limited computing capabilities and the
unfounded perceptions concerning the limitations of
neural networks. The enhancement of computing
capabilities in the 1980s and the emphatic support by
Hopfield, Rumelhart, and other researchers accelerated
the development of new neural network models and
theories. Today numerous neural network models and
algorithms are available including back propagation

learning, competitive learning, Kohonen learning,
Hopfield networks, Boltzman machines, shared-weight
neural networks and self-organizing neural networks
[4].
Neural networks have been used in a variety of
applications ranging from stock price prediction to
automatic target detection and recognition. They are
especially useful as pattern classifiers and/or
recognizers because of their robustness, fault tolerance,
and universal function approximation capability. In
theory, a multi-layer feed-forward neural network with
enough nodes and hidden layers can approximate any
bounded function and its derivative to any arbitrary
accuracy given a “large enough” training set [5].
Results published in [1] state that a network with N
inputs, one hidden layer of H units, and a total of W
weights will require in the order of W/
ε
training
patterns to yield an error less than
ε
on the test set.

3 Neural Network Architecture
The neural network application used for this
paper was written using the Sun Java 2 Runtime
Environment (v. 1.5.0), mainly on Mac OS X 10.5.1,
but has been successfully used on several Linux
distributions as well. Theoretically, the program should
work on any platform for which a Java virtual machine

exists, as it uses no platform-specific features or calls.
The design of the application attempted to make as few
assumptions as possible. All attributes, input, and
output values are stored as real numbers, allowing for
both continuous and discrete attributes. Missing
attribute values, however, are not handled by the
implementation. The application uses a simple feed-
forward back-propagation-trained network, using the
sigmoid activation function, and is designed to allow
the use of a different activation function. The system
was designed to be modular and easily re-configured.
Configurable attributes of the network include the
number of input neurons, the number of hidden layers,
the number of neurons in each hidden layer, the
number of output neurons, the learning rate (α), the
threshold at which the value of the sigmoid function
will cause the neurons to fire, and the number of
epochs for which to train the network, all of which are
passed into the application by way of command-line
parameters. There is a complete connection between
nodes in layers, with every node in the layer being
connected to every node in the following layer, and the
initial weights for these connections are initialized to
random values in the [-1, 1] range.
4 Methods and Results
In this section, we describe the data set that we
used for the study, the configurations of the neural
network architecture that we used for our experiments,
and the result of our test runs.
4.1 Data

The problem of handwritten character recognition
has been studied intensively for over thirty years [7].
This research has been driven by the great number of
potential applications including the recognition of
handwritten addresses for mail sorting purposes.
Unlike the problem of machine-printed character
recognition, the difficulties in the handwritten
character recognition come from the fact that people
write differently, as can be seen in Figure 1.
Numerous methods have been proposed to capture the
distinctive features of handwritten characters [2, 3, 6,
8]. These approaches can be classified into two
categories: global analysis and structural analysis. As
an example of the first category, we find techniques
such as template matching, moment invariant,
mathematical transforms (Fourier, Walsh, Hadamard,
etc). In the second category, the main goal is to capture
the essential shape features of the characters mainly
from their skeletons or contours. Such features include
loops, endpoints, junctions, arcs, concavities,
convexities, and strokes.
The main challenges to create a classifier based on
features are: (1) what type of features to use, (2) how
many features to use, (3) how to select the “best”
features, and (4) how to define criteria for selecting the
“best” features.
The use of feature vectors as input to handwritten digit
classifiers has many advantages over the use of the
two-dimensional digit images. The most important
advantage is the reduction of the dimension of the

input space. The smallest “reasonable” dimensions of a
digit image are 16×16, which corresponds to an input
vector of dimension 256. Such input dimensions would
produce a huge neural network if multiple hidden
layers are used. The use of features in this specific case
would reduce the dimension of the input space
significantly.



Figure 1. Representative samples from the data set.


In our experiments we used a data set that consisted of
6000 unconstrained binary images of handwritten
digits (600 images from each digit class) from actual
USPS mail pieces. The data was collected at the
Environmental Research Institute of Michigan (ERIM)
and was used by many researchers to test their
handwriting recognition systems [3]. All images were
moment-normalized to a size of 24×18 as described in
[2]. A subset of this data set (2000 images) was used
for training our system and a different subset (3500
images) was used for testing. The results we report in
this paper were achieved with the testing data set.
Three sets of features were used to train and test the
neural networks. Each set contains 60 features, which
have values in the [0, 1] range and represent the degree
of correlation between a feature’s template and the
input image. A value of 0 indicates that a feature

template did not match the digit image; a value of 1
indicates a complete match; and values in between
these two extremes indicate a partial match.
These 60 “best” features were selected using three
measures: an information measure, an orthogonality
measure, and a combination of the two. The
information measure is based on Shannon’s entropy. It
is a statistical measurement of the capability of a
feature to separate digit classes. If the information
measure of a feature is high, it indicates that the feature
is able to separate at least some of the classes well
since the conditional probabilities of the classes given
that the feature is either present or not present are
significantly different from each other. The
orthogonality measure is based on the known
mathematical property that the best basis to represent
an n-dimensional space is using n orthogonal vectors
(i.e. their dot product equals 0). The orthogonality
measure yields features that do not respond similarly to
an input image, i.e. the response of one feature would
be high when the other is low on some classes and
vice-versa on all the other classes. Thus the features
selected based on the orthogonality measure provide
discrimination capability since they behave differently
on each class. The third measure we used is a
combination of the information and orthogonality
measures. More details about the measures and
selection process can be found in [3].
4.2 Experiments
We ran the networks on the handwritten digit

data, training on the first 2000 digits in each file and
testing on the final 3500 digits. For a comparison
between architectures, we ran tests using two hidden
layers with 25 and 20 nodes, two hidden layers with 25
and 15 nodes, one hidden layer with 25 nodes, and one
hidden layer with 15 nodes, respectively. The other
parameters to the network were fixed for all runs of all
architectures. After trial runs with different values, a
learning rate of 0.05 was used, as it was small enough
to avoid saturating the network, but large enough to
provide ample learning. The number of training epochs
was set at 500, as experimentation proved this to
provide the best testing results, compared to 200
(under-trained), and 750 and 1000 epochs (over-
trained). Finally, for all tests the firing threshold for the
activation function of the hidden neurons was set at
0.9, which experimentally gave the best results.


Table 1. Summary of performance results for each architecture configuration and each feature data set.

Hidden
Layer(s)
Info
Features
Time
(mm:ss)
Orthogonal
Features
Time

(mm:ss)
Combination
Features
Time
(mm:ss)
25/20 90.63% 04:18 93.77% 04:20 93.74% 04:15
25/15 90.83% 03:56 93.34% 03:59 93.34% 03:59
25 90.54% 03:18 93.57% 03:19 93.43% 03:20
15 89.91% 02:08 92.86% 02:08 92.83% 02:08

4.3 Results
The results for the experiments are summarized in
Table 1. The table shows the percentage of handwritten
digit samples correctly classified for different
configurations of the architecture. For each row, the
first column shows the architecture of the network
used. The first two runs both used two hidden layers;
the last two used only one hidden layer. The remaining
columns consist of data set and time pairs, with the
percentage of the test data correctly classified and the
amount of time that the network took to train and test,
respectively. Each architecture and data set
combination was run 5 times, and the best result
presented for each configuration of the architecture that
was tested.
Based on previous work done on the data sets, the two
hidden layer architecture with 25 nodes in the first
layer and 15 nodes in the second layer was expected to
perform best. For the information features data set, this
held true, but for the other data sets, it actually

performed worse than both the 25/20 two hidden layer
architecture and the single hidden layer with 25 nodes
architecture. For the other data sets, the 25/20
architecture outperformed the other network
architectures, with only a minor increase in time
required over the 25/15 architecture. With a slightly
reduced performance, however, the single hidden layer
with 25 nodes performs nearly as well as the 25/20
architecture, but reduces the running time by as much
as a minute.

5 Conclusions
Overall, the implemented system was able to
successfully classify the test data using each of three
feature sets. The single hidden layer network with 15
nodes performed better on the orthogonal and
combination features data set, achieving a prediction
accuracy of 92.86% and 92.83%, respectively. The
information set did not perform as well due to the fact
that some features were almost identical and hence did
not provide more information to the network. Adding
more nodes to the single hidden layer improved the
accuracy slightly. With 25 nodes in the hidden layer,
the accuracy improved to 90.54% for the information
features, 93.57% for the orthogonal features, and
93.43% for the combination features data set. This
slight improvement in accuracy comes at a price of an
increase in time requirements, since the running time
increased by over a minute. It is interesting to note that
expanding the architecture to include two hidden layers

with 25 nodes in the first layer and 15 nodes in the
second layer increases the running time but does not
improve the classification accuracy for the orthogonal
and combination features data set, and only has a slight
benefit for the information features data set, improving
the accuracy from 90.54% to 90.83%. Increasing the
number of nodes in the second hidden layer from 15 to
20 provides the best classification accuracy for the
orthogonal features data set and the combined features
data set, namely 93.77% and 93.74% respectively, but
does not improve the prediction rate for the
information features data.
The orthogonal features data set and the combined
features data set provided better results for all
configurations of the network than the information
features data set. This was somewhat expected since
the selection process can theoretically yield very
similar information features which do not provide extra
class discrimination to the neural network. This
possibility is not present in the orthogonality and
combined features since the feature selection process
using these two measures inherently discourages
similar features. Both the orthogonal and combination
features data sets provided fairly comparable results, in
terms of both accuracy and running time. The single
hidden layer with 25 nodes performs nearly as well as
the two hidden layer architecture with 25 and 20 nodes
on all three data sets, but reduces the running time by
as much as a minute.
We would like to continue testing the developed

system with other large data sets for handwritten digit
and character recognition. We are also interested in
exploring the combination of different machine
learning techniques for handwritten digit recognition
and similar character recognition problems. More
specifically, we are interested in developing a system
that integrates a decision tree learning technique with a
neural network architecture and testing it with the three
data sets used in this study to compare the performance
of such a system with the results of using a decision
tree or neural network separately. We anticipate that
such a technique that fuses both machine learning
techniques will yield better results than using each
technique individually, and will allow the recognition
of larger, more complex data sets.

6 References
[1] E. Baum and D. Haussler, “What Size Net Gives
Valid Generalization?” Neural Computation, vol. 1,
no. 1, pp. 151-160, 1989.
[2] P. D. Gader, B. Forester, M. Ganzberger, A.
Gillies, B. Mitchell, M. Whalem, and T. Yocum,
“Recognition of Handwritten Digits Using Template
and Model Matching,” Pattern Recognition, vol. 24,
pp. 421-432, 1991.
[3] P. D. Gader and M. A. Khabou, “Automatic
Feature Generation for Handwritten Digit
Recognition,” IEEE Trans. Pattern Analysis Machine
Intelligence, vol. 18, no. 12, pp. 1256-1262, 1996.
[4] S. Haykin, Neural Networks, a Comprehensive

Foundation, MacHillan Publishing Co., 1994.
[5] K. Hornik, M. Stinchrombe and H. White,
“Universal Approximation of an Unknown mapping
and its Derivatives Using Multilayer Feed-forward
Networks,” Neural Networks, vol. 3, pp. 551-560,
1990.
[6] C.Y. Suen, “Distinctive Features in the Automatic
Recognition of Handprinted Characters,” Signal
Processing, vol. 4, pp. 193-207, 1982.
[7] C.Y. Suen , “Character Recognition by Computer
and Application,” Handbook of Pattern Recognition
and Image Processing, Academic Press, pp. 569-586,
1986.
[8] C.Y. Suen, C. Nadal, R. Legault, T.A. Mai, and
L. Lam, “Computer Recognition of Unconstrained
Handwritten Numerals.” Proc. of IEEE, vol. 80, no 7,
pp. 1162-1180, 1992.

×