..
BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
-----------------------------------------------
THEODOR CONSTANTINESCU
OPTICAL CHARATER RECONGNITION USING NEURAL
NETWORKS
LUẬN VĂN THẠC SĨ KHOA HỌC
CHUYÊN NGÀNH: XỬ LÝ THÔNG TIN VÀ TRUYỀN
THÔNG
NGƯỜI HƯỚNG DẪN KHOA HỌC:
Hà Nội - 2009
BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
*********♦*********
THEODOR CONSTANTINESCU
OPTICAL CHARATER RECONGNITION USING NEURAL
NETWORKS
LUẬN VĂN THẠC SĨ KHOA HỌC
CHUYÊN NGÀNH: XỬ LÝ THÔNG TIN VÀ TRUYỀN THÔNG
NGƯỜI HƯỚNG DẪN KHOA HỌC: NGUYÊN LINH GIANG
HÀ NỘI 2009
Contents
I. Introduction ............................................................................................
3
II. Pattern recognition ...............................................................................
6
III. Optical character recognition (OCR) ................................................. 26
IV. Neural networks ................................................................................. 34
V. The program ........................................................................................ 55
VI. Conclusions ....................................................................................... 71
I.
Introduction
The difficulty of the dialogue between man and machine comes on one hand
from the flexibility and variety of modes of interaction that we are able to
use: gesture, speech, writing, etc. and also the rigidity of those classically
offered by computer systems. Part of the current research in IT is therefore a
design of applications best suited to different forms of communication
commonly used by man. This is to provide the computer systems with features for
handling the information that humans manipulate themselves currently every day.
In general the information to process is very rich. It can be
text, tables, images, words, sounds, writing, and gestures. In this paper I treat the case of
writing, to be more precise, printed character recognition. By the application and personal
contexts the way to represent this information and transmit it is very variable. Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language. Moreover because of the sensitivity of the sensors and
the media used to acquire and transmit, the information to be processed is often different
from the originals. It is therefore characterized by either intrinsic to the phenomena to
which they are either related to their transmission ways inaccuracies.
Their
treatment
requires
the
implementation
of
complex
analysis
and decision systems. This complexity is a major limiting factor in the context of the
dissemination of the informational means. This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is at the same time directed towards the resolution of more and more difficult
tasks and to the integration of these applications in cheaper and therefore low capacity
mobile systems.
Optical character recognition represents the process through which a program
converts the image of a character (usually acquired by a scanner machine) into the code
associated to that character, thus enabling the computer to “understand” the character,
which heretofore was just a cluster of pixels. It turns the image of the character (or of a
string of characters – text) into selectable strings of text that you can copy, as you would
any other computer generated document. In its modern form, it is a form of artificial
intelligence pattern recognition.
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one. The alternative would be a human
reading the characters in the image and typing them into a text editor, which is obviously
a stupid, Neanderthal approach when we possess the computers with enough power to do
this mind-numbing task. The only thing we need is the right OCR software.
Before OCR can be used, the source material must be scanned using an optical
scanner (and sometimes a specialized circuit board in the PC) to read in the page as a
bitmap (a pattern of dots). Software to recognize the images is also required. The OCR
-3-
software then processes these scans to differentiate between images and text and
determine what letters are represented in the light and dark areas.
The approach in older OCR programs was still animal. It was simply to compare
the characters to be recognized with the sample characters stored in a data base. Imagine
the numbers of comparisons, considering how many different fonts exist. Modern OCR
software use complex neural-network-based systems to obtain better results – much more
exact identification – actually close to100%.
Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity between the text characters, and the
background. Allowing for irregularities of printed ink on paper, each algorithm averages
the light and dark along the side of a stroke, matches it to known characters and makes a
best guess as to which character it is. The OCR software then averages or polls the results
from all the algorithms to obtain a single reading.
Advances have made OCR more reliable; expect a minimum of 90% accuracy for
average-quality documents. Despite vendor claims of one-button scanning, achieving
99% or greater accuracy takes clean copy and practice setting scanner parameters and
requires you to "train" the OCR software with your documents.
The first step toward better recognition begins with the scanner. The quality of its
charge-coupled device light arrays will affect OCR results. The more tightly packed these
arrays, the finer the image and the more distinct colors the scanner can detect.
Smudges or background color can fool the recognition software. Adjusting the scan's
resolution can help refine the image and improve the recognition rate, but there are tradeoffs.
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi),
each of the 1,200 pixels has 24 bits' worth of color information. This scan will take
longer than a lower-resolution scan and produce a larger file, but OCR accuracy will
likely be high. A scan at 72 dpi will be faster and produce a smaller file—good for
posting an image of the text to the Web—but the lower resolution will likely degrade
OCR accuracy. Most scanners are optimized for 300 dpi, but scanning at a higher number
of dots per inch will increase accuracy for type under 6 points in size.
Bilevel (black and white only) scans are the rule for text documents. Bilevel scans
are faster and produce smaller files, because unlike 24-bit color scans, they require only
one bit per pixel. Some scanners can also let you determine how subtle to make the color
differentiation.
The accurate recognition of Latin-based typewritten text is now considered
largely a solved problem. Typical accuracy rates exceed 99%, although certain
applications demanding even higher accuracy require human review for errors. Other
areas - including recognition of cursive handwriting, and printed text in other scripts
(especially those with a very large number of characters) - are still the subject of active
research.
Today, OCR software can recognize a wide variety of fonts, but handwriting and
script fonts that mimic handwriting are still problematic. Developers are taking different
approaches to improve script and handwriting recognition. OCR software from
ExperVision Inc. first identifies the font and then runs its character-recognition
algorithms.
-4-
Which method will be more effective depends on the image being scanned. A
bilevel scan of a shopworn page may yield more legible text. But if the image to be
scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out.
On-line systems for recognizing hand-printed text on the fly have become wellknown as commercial products in recent years. Among these are the input devices for
personal digital assistants such as those running Palm OS. The algorithms used in these
devices take advantage of the fact that the order, speed, and direction of individual lines
segments at input are known. Also, the user can be retrained to use only specific letter
shapes. These methods cannot be used in software that scans paper documents, so
accurate recognition of hand-printed documents is still largely an open problem.
Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved,
but that accuracy rate still translates to dozens of errors per page, making the technology
useful only in very limited applications.
Whereas commercial and even open source OCR software performs well for, let's
say, usual images, a particularly difficult problem for computers and humans is that of the
old religious registers of baptisms and marriages, which contain mainly the names, where
the pages can be damaged by weather, water or fire, and the names can be obsolete or
written by former spellings.
Character recognition is an active area of research for computer science since the
late 1950s. Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting. It will take many decades to computers to read any document
with the same precision as human beings.
All the commercial software is quite complex. My aim was to create a simple and
reliable program to perform the same tasks.
-5-
II. Pattern recognition
Pattern recognition is a major area of computing in which searches are
particularly active. There are a very large number of applications that may require a
recognition module in processing systems designed to automate certain tasks for humans.
Among those handwriting recognition systems are a difficult issue to handle as they are
grouped alone much of the difficulties encountered in pattern recognition. In this chapter
I give a general presentation of the main pattern recognition techniques.
Pattern recognition is the set of the methods and techniques with which we can
achieve a classification in a set of objects, processes or phenomena. This is accomplished
by comparison with models. In memory of the computer a set of models (prototypes), one
for each class is stored. The new, unknown input (not classified yet) is compared in turn
with each prototype, classifying them into one of the classes being based on a selection
criterion: if the unknown best suits well with the "i" then it will belong to class "i". The
difficulties that arise are related to the selection of a representative model, which best
characterizes a form class, as well as defining an appropriate selection criterion, able to
univocally classify each unknown form.
Pattern recognition techniques can be divided into two main groups: generative
and discriminant. There have been long standing debates on generative vs. discriminative
methods. The discriminative methods aim to minimize a utility function (e.g.
classification error) and it does not need to model, represent, or “understand” the pattern
explicitly. For example, nowadays we have very effective discriminative methods. They
can detect 99.99% faces in real images with low false alarms, and such detectors do not
“know” explicitly that a face has two eyes. Discriminative methods often need large
training data, say 100,000 labeled examples, and can hardly be generalized. We should
use them if we know for sure that the recognition is all we need in an application, i.e. we
don’t expect to generalize the algorithm to much broader scope or utility functions. In
comparison, generative methods try to build models for the underlying patterns, and can
be learned, adapted, and generalized with small data.
BAYESIAN INFERENCE
The logical approach for calculating or revising the probability of a hypothesis is
called Bayesian inference. This is governed by the classic rules of probability
combination, from which the Bayes theorem derives. In the Bayesian perspective,
probability is not interpreted as the transition to the limit of a frequency, but rather as the
digital translation of a state of knowledge (the degree of confidence in a hypothesis).
The Bayesian inference is based on the handling of probabilistic statements. The
Bayesian inference is particularly useful in the problems of induction. Bayesian methods
-6-
differ from standard methods known by the systematic application of formal rules of
transformation of probabilities. Before proceeding to the description of these rules, let's
review the notations used.
The rules of probability
There are only two rules for combining probabilities, and on them the theory of
Bayesian analysis is built. These rules are the addition and multiplication rules.
The addition rule
The multiplication rule
The Bayes theorem can be derived simply by taking advantage of the symmetry of the
multiplication rule
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes.
Evidence notation
In practice, when probability is very close to 0 or 1, elements considered
themselves as very improbable should be observed to see the probability change.
Evidence is defined as:
for clarity purposes, we often work in decibels (dB) with the following equivalence:
An evidence of -40 dB corresponds to a probability of 10-4, etc. Ev stands for weight of
evidence.
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
• Bayesian methods use impersonal methods to update personal probability, known as
subjective (probability is always subjective, when analyzing its fundamentals),
• statistical methods use personal methods in order to treat impersonal frequencies.
The Bayesian and exact conditional approaches to the analysis of binary data are
very different, both in philosophy and implementation. Bayesian inference is based on the
posterior distributions of quantifies of interest such as probabilities or parameters of
logistic models. Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
values.
The Bayesians thus choose to model their expectations at the beginning of the
process (nevertheless revising this first assumption made at the beginning of the
experience in light of the subsequent observations), while classical statisticians fix a
-7-
priori an arbitrary method and assumption and don't treat the data until after that.
Bayesian methods, because they do not require fixed prior hypothesis, have paved the
way for the automatic data mining, there is indeed no more need to use Prior human
intuition
to
generate
hypotheses
before
we
can
start
working.
When should we use one or the other? The two approaches are complementary;
the statistic is generally better when information is abundant and low cost of collection,
Bayesian where it is poor and /or costly to collect. In case of abundance data, the results
are asymptotically the same for each method, the Bayesian calculation being simply more
costly. In contrast, the Bayesian can handle cases where statistics would not have enough
data to apply the limit theorems.
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a 2 x 2 contingency table. This result is hard to generalise
to more complex examples.
The Bayesian psy-test (which is used to determine the plausibility of a distribution
compared to the observations) asymptotically converges to the χ ² in classical statistics as
the number of observations becomes large. The seemingly arbitrary choice of a Euclidean
distance in the χ ² is perfectly justified a posteriori by the Bayesian reasoning.
Example: From which bowl is the cookie?
To illustrate, suppose there are two full bowls of cookies. Bowl #1 has 10
chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks
a bowl at random, and then picks a cookie at random. We may assume there is no reason
to believe Fred treats one bowl differently from another, likewise for the cookies. The
cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more
plain cookies in bowl #1. The precise answer is given by Bayes's theorem. Let H1
correspond to bowl #1, and H2 to bowl #2. It is given that the bowls are identical from
Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal
to 0.5. The event E is the observation of a plain cookie. From the contents of the bowls,
we know that P(E | H1) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5. Bayes's formula
then yields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
#1 was the prior probability, P(H1), which was 0.5. After observing the cookie, we must
revise the probability to P(H1 | E), which is 0.6.
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
-8-
where it intends to deal with quantified data that can be partially wrong for example recognition of images (characters, fingerprints, search for patterns and sequences in the
genes, etc.)
The data production model
A hidden Markov chain is a machine with states that we will note
We denote the state of the automaton at the moment. The probability of transition
from one state to a state is given, we call it a(m,n)
We have
We also have d(m), the probability that the automaton is in state m at the initial moment:
We obviously have
When the automaton passes through the state m it emits a piece of information yt that can
take N values. The probability that the automaton emits a signal n when it is in this state,
m, will be noted:
we have
The word "hidden'' used to characterize the model reflects the fact that the
emission of a certain state is random. This random nature of measures which, added to
the properties of Markov processes yields the flexibility and strength of this approach. A
variant of this approach found a renewed interest in the field of error correcting codes in
digital transmission where it is used in turbocodes.
-9-
Probability of transition and emission of data in the hidden Markov model
The important property of Markov process is that the automaton evolution after
moment depends only on the value of the state it is in at this moment and orders which
are then applied and not what he suffered before arriving at this state. In other words, the
future does not depend on the manner in which the automaton arrived in that state. The M
States and the N possible values of measures and also the probabilities a(m,m'), b(m,n)
and d(m), characterize the model. We have to address three problems
1 – Recognition.
We have observed Y = [y0,…,yt,…,yT]. [a(m,m'), b(m,n), d(m)] is given. What is the
most
likely
state
sequence
S
=
[s0,...,st,…,sT]
that
created
it?
2 - Probability of observing a sequence.
We have observed a sequence of measures Y = [y0,…,yt,…,yT]. What is the probability
that the automaton characterized by the parameters [a(m,m'), b(m,n), d(m)] has led to this
sequence?
3 - Learning.
We have observed Y = [y0,…,yt,…,yT]. How to calculate (or rather update) the model's
parameters [a(m,m'), b(m,n), d(m)] in order to maximize the probability of observing?
The following algorithm aims to find the sequence of states most likely have
produced the measured sequence Y = [y0,…,yt,…,yT]. At moment t we calculate
recursively for each state
The maximum being calculated for all possible state sequences S = [s0,...,st-1]
Initialization: At the moment t = 0
Recurrence: let's assume that the moment t-1 we calculated rt-1(m) for each state. We
then have
The state m most likely occupied at the moment t-1 from which the automaton has
evolved into the state m' at the moment t is the state for which rt-1(m)a(m,m')b(m',yt) is
maximum. For each state m', we thus calculate rt(m'); each of these states has a
predecessor qt(m'). This predecessor can be used to recover the state sequence most likely
to have created the measurements [y0,…,yt,…,yT]. End of the algorithm: the state
retained at the moment is that for which rT(m) is maximum. The probability for the
- 10 -
measured sequence to have been been emitted by the automaton is rT(m). We can find
the sequence of statements by finding the predecessor of rt(m), and recursively, in the
same way on to s0
Choosing a path in the lattice between the moments.
Reconstuction of the way corresponding to the optimal sequence if it was
generated by the automaton.
One can thus calculate for each of the Markov models representing for example a
word to recognize the probability that the studied measured sequence was generated by
this automaton then compare the results.
The probability that a state sequence S = [s0,...,st,…,sT] has generated Y is
obtained by using the property of Markovian sources
- 11 -
Also
Therefore
and the probability of having emitted the sequence Y is given by the summation of all
MT possible sequences S
useless formula for practical purposes because it requires the order of operations.
Summarizing,
1. A Markov System Has N states, called s1, s2 .. sN
2. There are discrete timesteps, t=0, t=1, …
3. On the t’th timestep the system is in exactly one of the available states. Let's call it qt
(qt ∈ {s1, s2 .. sN }).
4. Between each timestep, the next state is chosen randomly.
5. The current state determines the probability distribution for the next state.
Applications
• speech recognition
• automated processing of natural language (automatic translation, text labeling, and
reconstruction of noisy text ...)
• handwriting recognition;
• Bio-informatics.
A set of measurements with which an object is described or characterized is called form
vector (or - in short - form). Each feature can be viewed as a variable in an m dimensional space, called form space, where to each feature is assigned a dimension.
T
Form vector: x = [x1x2. . . xm] where xi is the feature i.
Each form appears as a point in the form space. This space - noted H x - can be described
by the indices matrix x (i, j):
H
x
= [ x ( i, j ) ; i = 1, 2, ..., N ; j = 1, 2, ..., m ] = [ x
T
k
; k = 1, 2, ..., N ]
where N is the number of forms.
Form classification can be understood as a partitioning of the form space in mutually
- 12 -
exclusive domains, each domain belonging to a class:
, where F is the set of points
that constitute the boundaries between classes.
From the mathematical problem of this kind of classification can be defined as a
discriminant function D j ( x ) associated with the form class h j , ( j = 1, 2, ..., n ) with
the property that if the form represented by the vector x belongs to hi, then the value of
Di ( x ) must be the largest, i.e. for all
D i ( x ) > D j ( x ) ; i, j = 1, 2, ... , n ; i ≠ j
x belonging to hi we'll have:
Boundaries between classes, called "decision limits" can be expressed by:
F = D i ( x ) − D j ( x ) ; i , j = 1, 2, ...., n ; i ≠ j
Discriminant functions (or the classifiers) can be divided into two categories: parametric
and non-parametric. If the set of forms of training can not be described by statistical
measurements, then non-parametric discriminant functions are used, the most used being
decision-theory classifiers. Parametric classifiers are based on estimating the statistical
parameters of the training set of forms, these estimates being then used to establish
discriminant functions.
DECISION-THEORY CLASSIFIERS (FREE DISTRIBUTION)
Linear clasifier
T
In a two-dimensional space x = [ x1, x2 ] is the classifier linear function (a plane
perpendicular to the forms plane): w1 x1 + w2 x2 + w 3 = 0
In an m-dimensional space the function is a hyperplane:
w1 x1 + w2 x2 + ... + wm x m + wm + 1 = 0
In this case, the discriminant function function is:
Minimum distance classifier
- 13 -
It is based on the evaluation of the distances between the form to be classified and
a set of reference vectors from the forms space. If we assume that n reference vectors are
known, noted R , R , . . . , R , with R associated with the class hj, then the minimum
1
2
n
j
distance classifier will assign the form x to class hi if the distance between it and the
associated reference vector is minimum:x belongs to hi if d ( x, Ri ) = |x − Ri| =min |x − Rj|.
Reference vectors are considered to represent the centers of the classes and we calculate
them with:
where Mj is the number of forms in class hi.
The distance between the form x and the vector Ri of the class hi is:
2
T
|x − R | = ( x − R ) ( x − R ). We then have:
i
i
i
2
2
T
T
T
T
d ( x, R ) = |x − R | = x x − x R − R x + R R
i
i
i
i
i
i
T
Since the x x is the same for all classes, it can be removed and we can change the sign
too, without changing the decision surface, obtaining the discriminant function for a
minimum distance classifier:
T
T
T
D i ( x ) = x R + R x - R R , i = 1, 2, ..., n
i
i
i
i
Note that Di (x) is linear and the distance is minimum when Di is maximum.
Nonlinear (polynomial) classifier
This classifier has the following expression:
,
where
with k1, k2, ..., kr = 1, ..., m and n1, n2, ... , nr = 0 or 1.
For r = 2 we obtain the square discriminant function:
, with k1, k2, ..., kr = 1, ..., m and n1, n2 = 0 or 1.
Discriminant functions of linear and non-linear type are represented below, in the case of
- 14 -
a two-dimensional space (the form space is a plane).
LINEAR CLASSIFICATION
For an observation vector x, in
, the output of the classifier is given by:
where is a vector of weights, w0 is the bias, and f is a function that converts the scalar
product of two vectors into the desired output. The weight vector w is learned from a
labeled learning set. The function f is often a simple threshold function, such as the sign
function, the Heaviside function, or more complex functions such as hyperbolic tangent,
or the sigmoid function. A more complex decision function could yield the probability
that a sample belongs to a certain class.
For a problem of discrimination in two classes, the operation performed by a linear
classifier can be seen as the separation of a large space by a hyperplane: all points on one
side of the hyperplane are classified as 1, the others are classified as -1. This hyperplane
is called separating hyperplane.
The linear classifiers are often used in situations where low complexity is desired, since
they are the simplest and therefore fastest classifiers, especially when the observation
vector x is hollow. However, methods of decision tree can be faster still.
The linear classifiers often get good results when N, the number of dimensions of the
space of observations is large, such as text search, where each element of x is the number
of words in a document.
Generative vs. discriminant model
There are two main types of method to estimate the parameters of the vector of a linear
classifier:
The first is to model the conditional probability. These are called generative models.
Examples of algorithms of this type are:
- 15 -
• Linear discriminant analysis (or Fisher's linear discriminant) (FLD). It implies the
existence of a discriminant model based on a probability distribution function of
Gaussian type.
• naive Bayesian classifier implies a conditional probability distribution of binomial type
The second approach regroups the discriminant analysis model; it first seeks to maximize
the classification quality. In a second step, a cost function will produce the adaptation of
the final classification model (minimizing the errors). Some examples of training the
classifiers by the linear discriminant method:
• Perceptron. An algorithm that seeks to correct all errors encountered when processing
the training sets (and thus improve the learning and the model created from the training
sets).
• Support Vector Machine algorithm that maximizes the separating hyperplanes margin
of the classifier using the training sets for learning.
It is generally accepted that the models trained by a discriminant method (SVM, logistic
regression) are more accurate than generic training with conditional probabilities (naive
Bayesian classifier or linear). It is considered that the Generative classifiers are more
suited for the classification process with a lot of missing data (e.g. text classification with
little learning data).
LINEAR DISCRIMINANT ANALYSIS
The linear discriminant analysis is one of the techniques of predictive
discriminant analysis. This is about explaining and predicting an individual belonging to
a class (group) from predefined characteristics measured using predictive variables.
The variable to predict is necessarily categorical (discrete). The linear discriminant
analysis can be compared to supervised methods developed in machine learning and
logistic regression in statistics.
We have a sample of
observations across
groups
.
Let
be the variable to predict, it takes its values in
. We have
predictive variables
.
the centers of gravity of the conditional point clouds,
their
We note
variance-covariance matrix.
The aim is to produce a rule of assignment
that can
predict, for a given observation ω, its associated value in Y from the values taken by X.
The Bayesian rule is to produce an estimate of the a posteriori probability of
assignment
is
the
a
priori
probability
of
belonging
to
a
class.
represents the density function of X conditional to the class
- 16 -
.
The
allocation
rule
,
if
for
and
an
only
individual
ω
to
be
classified
becomes
if
The whole issue of discriminant analysis is then to provide an estimate of
.
There are two main approaches to estimating the distribution
:
• The non-parametric approach makes no assumption about the distribution and
proposes a procedure for local estimation of the probabilities in the vicinity of the
observation ω to be classified. The procedures are the best known Parzen kernels and
method of nearest neighbors. The main challenge is to adequately define the
neighborhood.
In statistics, the estimation by kernel (or method of Parzen-Rozenblatt) is a nonparametric method for estimating the probability density of a random variable. It is based
on a sample of a statistic population and estimates the density at any point of the support.
In this sense, this cleverly generalizes the method of estimation by histogram.
If x1, x2, ..., xN ~ ƒ is a sample of a random variable, then the non-parametric
estimator by the method of kernel density is
where K is a kernel and h a window, which governs the degree of smoothing of the
estimate. Often, K is chosen as the density function of a standard Gaussian function (zero
expectation and unit variance):
The idea behind the Parzen method is a generalization of the method of histogram
estimation. In the second method, the density at a point x is estimated by the proportion
of observations x1, x2, ..., xN in the vicinity of x. To do this, draw a box with x and
whose width is governed by a smoothing parameter h; then count the number of
observations that fall into this box. This estimate, which depends on the smoothing
parameter h, has good statistical properties but is non-continuous.
The kernel method is to recover continuity: for this, it replaces the box centered at
x and width h by a bell curve centered in x. the More an observation is close to the point
of support x, the higher the numerical value the bell curve will give it. In contrast,
observations too far from x are assigned a negligible numerical value. The estimator is
formed by the sum (or rather the average) of bell curves. As shown in the picture below,
it is clearly continuous.
- 17 -
Six Gaussian bell curves (red) and their sum (blue).
The kernel estimator of the density f (x) is the average (for the number of bell curves, 6).
The variance of the normals is set to 0.5. Finally, the more observations are in the
neighborhood of a point, the higher the density is.
It can be shown that under weak assumptions, there is no nonparametric estimator
which converges faster than the kernel estimator.
The practical use of this method requires two things:
• the kernel K (usually the density of a statistical law);
• the smoothing parameter h.
If the choice of kernel is known as little influence on the estimator, it is not the
same for the smoothing parameter. A parameter too low causes the appearance of
artificial details appearing on the plot of the estimator. For a value of h too large, the
majority of features on the contrary erased. The choice of h is therefore a central issue in
the estimation of density.
• The second approach makes assumptions about the distribution of conditional point
clouds; we speak in this case parametric discriminant analysis. The most commonly
.
used hypothesis is undoubtedly the multinormal law, which takes values in
In the case of multidimensional normal distribution, the distribution of conditional
point clouds is
where
is the determinant of the matrix variance co-variance in condition
The objective is to determine the maximum of the a posteriori probability of assignment,
we can ignore everything that does not depend on k. Passing to the logarithm, we obtain
the discriminant score which is proportional to
.
The assignment rule becomes
If we fully develop the discriminant score, then we have quadratic discriminant
analysis. Widely used in research because it behaves very well in terms of performance,
compared to other methods, it is less widespread in practice. Indeed, the expression of the
discriminant score is rather complex, it is difficult to discern clearly the direction of
- 18 -
causality between predictive variables and class . It is especially difficult to distinguish
the truly determinant variables in the classification, so that the interpretation of the results
is quite unsure.
A second hypothesis allows to further simplify the calculations; it is the
assumption that the variability in scores for one variable is roughly the same at all values
of the other variable: variance-covariance matrixes are identical from one group to
another. Geometrically, this means that the point clouds have the same shape (and
volume) in the representation space.
The estimated variance-covariance matrix is in this case the variance-covariance matrix
intra-classes calculated using the following expression
Again, we can drop in the discriminant score all that no longer depends on k; it then
becomes
By developing the expression of the discriminant score after introduction of
approximately constant variability, we see that it is expressed linearly in relation to the
predictive variables. We have therefore as many classification functions as variables to
predict; they are linear combinations of the following form:
It is possible, by studying the value and sign of coefficients, to determine the
sense of causality in the ranking.
Multinomality and this second assumption may seem too restrictive, limiting the
scope of the linear discriminant analysis in practice.
A key concept in statistics is robustness. Even if the assumptions made are not too
satisfied, a method may still yield results. This is the case of linear discriminant analysis.
The most important thing is to see it as a linear separator. In this case, if the point clouds
are linearly separable in the space of representation, it can operate properly.
Compared to other techniques, discriminant analysis provides comparable
performance. It may nevertheless be affected when the second assumption is strongly
violated.
- 19 -
SUPPORT VECTOR MACHINE
The Support Vector Machine, SVM are a set of supervised learning techniques to
solve problems of discrimination and regression. The SVMs have been developed in the
1990s from theoretical considerations by Vladimir Vapnik on developing a statistical
theory of learning: the theory of Vapnik-Chervonenkis. The SVM were quickly adopted
for their ability to work with large amount of data, low number of hyper parameters, the
fact that they are well theoretically-founded, and their good practical results.
The SVMs have been applied to many fields (bio-informatics, information
retrieval, computer vision, finances). The data indicates that the performance of support
vector machines is similar or even superior to that of a neural network.
Separators with large margins are based on two key ideas: the concept of
maximum margin and the concept of kernel function. These two concepts existed for
several years before they were pooled to construct the SVM.
The first key idea is the notion of maximum margin. The margin is the distance
between the border of separation and the closest samples. These samples are called
support vectors. In SVM, the boundary of separation is chosen as that which maximizes
the margin. This is justified by the theory of Vapnik-Chervonenkis (or statistical theory
of learning), which shows that the boundary of maximum margin of separation has the
smallest capacity. The problem is finding the optimal separating boundary, from a
learning set. This is done by formulating the problem as a quadratic optimization problem
for which there are known algorithms.
In order to address cases where data are not linearly separable, the second key
idea of SVM is to transform the representation of data into a larger (possibly infinite
dimension), in which it is likely that there is a linear separator. This is done through a
kernel function, which must meet certain conditions, and has the advantage of not
requiring explicit knowledge of the transformation to apply for change of space (like
from Cartesian to radial). Kernel functions allow us to transform a scalar product in a
large space, which is expensive, a simple evaluation of a function. This technique is
known as the kernel trick.
The SVM can be used to solve problems of discrimination, i.e. to decide to which
class a sample belongs, or regression, i.e. predict the numerical value of a variable. The
resolution of these problems is through a function h which for an input vector x produces
a corresponding output y: y = h (x).
NEURAL NETWORKS
An artificial neural network is a computational model whose design is roughly
based on the functioning of real neurons (human or not). Neural networks are usually
optimized by methods of statistical learning style, so they are both in the family of
statistical applications, and partly in the family of methods of artificial intelligence that
they improve to making decisions based more on perception than on formal logic.
Neural networks are built on a biological paradigm, that of formal neuron (in the same
way as genetic algorithms are the natural selection). These types of biological metaphors
have become common with the ideas of cybernetics.
- 20 -
Neural networks, as a system capable of learning, implement the principle of
induction, i.e. learning by experience. By interacting with isolated situations, they infer
an integrated decision system where the generic character is based on the number of
encountered cases of learning and their complexity compared to the complexity of the
problem to solve. By contrast, symbolic systems capable of learning, if they also
implement induction, are based on algorithmic logic, by a complex set of deductive rules
(e.g. PROLOG).
With their ability of classification and generalization, neural networks are
generally used in problems of statistical nature, such as automatic classification of
postcodes, or character recognition. The neural network does not always rule used by a
human. The network is often a black box that provides an answer when he presents, but
the network does not give easy to interpret.
Neural networks are actually used, for example, for:
• for classification, e.g. for the classification of animal species from pictures.
• pattern recognition, e.g. for optical character recognition (OCR), and in particular by
banks to verify the amount of the check by the Post Office to sort mail according to
postal code, etc.
• approximation of an unknown function.
• accelerated modeling of a known but very complicated to calculate function
• Stock estimates: Attempts to predict the frequency of stock prices. This type of
prediction is highly because it is not clear that the course of a share has recurring
character (the market anticipates a largely increases as decreases predictable, which
applies to any possible frequency variation of the period to make it difficult to reliably).
•
modeling
of
learning
and
improving
techniques
of
teaching.
Limits
• artificial neural networks require real case examples used for learning (what we call the
learning base). These cases must be more numerous as the problem is more complex and
that its topology is less structured. For example, we can optimize a neural system of
character reading by using the images of a large number of characters written by hand by
many people. Each character can be presented as a raw image, with a topology with two
spatial dimensions, or a series of almost all segments connected. The chosen topology,
the complexity of the phenomenon modeled, and the number of examples must be
relevant. Practically, this is not always easy because the examples can be absolutely
limited in quantity or too expensive to collect in sufficient numbers.
The neuron calculates the sum of its entries multiplied by the corresponding
weights (and adds the biases, if any), and then this value is passed through the transfer
function to produce its output.
A neural network is generally composed of a succession of layers, each of which
takes as inputs the outputs of the previous one. Each layer (i) is composed of Ni neurons,
taking their input on the Ni-1 neurons in the previous layer. to Each synapse synaptic
weights are associated, so that the Ni-1 are multiplied by this weight, then added by the
neurons of level i. Beyond this simple structure, the neural network may also contain
loops that change radically the possibilities but also the complexity. In the same way that
loops can transform a combinatorial logic in sequential logic, the loops in a network of
neurons transform a device for recognition of inputs into a complex machine capable of
all sorts of behaviors.
- 21 -
A neural network is a very large number of small identical processing units called
artificial neurons. They were the first electronic implementations (the Rosenblatt
perceptron) and are most often simulated on a computer today to issues of cost and
convenience.
Neurobiologists know that each biological neuron is connected sometimes to
thousands of others, and they transmit information by sending waves of depolarization.
More specifically, the neuron receives input signals from the other neurons by synapses,
and outputs information through its axon. In roughly similar manner, the artificial
neurons are connected together by weighted connections. With their size and speed, the
networks can handle very properly questions of perception or automatic classification.
• The MLP- type networks (Multi-Layer Perceptron) calculate a linear combination of
inputs, i.e. the function returns the combination of inner product between the vector of
inputs and the vector of synaptic weights.
• The type networks RBF (Radial Basis Function) calculate the distance between the
inputs, i.e. the function returns a combination of standard Euclidean vector from the
vector difference between the input vectors.
The activation function (or transfer function) is used to introduce non-linearity in the
functioning of the neuron.
Propagation of information
After this calculation is made, the neuron propagates its new internal state
forward through its axon. In a simple model, neuronal function is simply a threshold
function: it is 1 if the weighted sum exceeds a certain threshold, 0 otherwise. In a richer
model, the neuron operates with real numbers (often in the interval [0,1] or [-1,1]). It is
said that the neural network goes from one state to another when all of its neurons
recalculate parallel their internal state, according to their entries.
Learning
The concept of learning is not modeled in the context of deductive logic: this kind
of learning starts from what is already known from which it derives new knowledge. But
this is the opposite approach: by limited observations, it draws plausible generalizations:
it is an induction process.
The concept of learning covers two facts
• memory: the process of assimilating possibly many examples in a dense form,
• generalization: being able to learn through examples, to treat distinct examples that
have not been encountered yet, but similar to the ones in the training set.
These two points are partly in opposition. If we favor one of them, we develop a
system that does not deal very effectively another.
In the case of statistical learning systems, used to optimize conventional statistical
models, neural networks and Markov automates, it is the generalization that we should
focus on.
Learning can be supervised or not:
A supervised learning is that when we force the network to converge to a specific final
state, at the same time that he has a reason.
In contrast, in a non-supervised learning, the network is allowed to converge to any state
when it has a reason.
- 22 -
Algorithm
The vast majority of neural network algorithms receive a training that is to change
the synaptic weight according to a set of data input of the network. The purpose of this
training is to enable the neural network to learn from examples. If the training is done
correctly, the network is able to provide responses in output very close to the original
values of the training set.But the whole point of neural networks lies in their ability to
generalize from the test set.
Overtraining
Often, the examples of the basic learning include noisy or approximate values. If
it requires the network to respond to almost perfect with respect to these examples, we
can get a network that is biased by incorrect values. To avoid this, there is a simple
solution: just divide the examples in 2 subsets. The first is used for learning and the 2nd
is the validation. As long as the error obtained on the 2nd set decreases, you can continue
learning, otherwise, stop it.
Backpropagation
Backpropagation is in fact transmitting backward the error a neuron "commits" to
its synapses and to the neurons connected thereto. For neural networks, we typically use
error gradient backpropagation, to correct mistakes according to the importance of the
elements that have participated in the creation of these errors: synaptic weights which
contribute to generate a significant error will be more significantly modified than the
weight which have led to small error.
All the weight of synaptic connections determine the functioning of the neural
network. The reasons are presented to a sub-set of the neural network: input layer. When
applying a pattern to a network, it aims to reach a stable condition. When reached, the
activation values of output neurons are the result. Neurons that are neither part of the
input layer or the output layer are called hidden neurons.
The types of neural networks differ in several parameters:
• the topology of connections between neurons;
• The aggregation function used (weighted sum, pseudo-Euclidean distance ...);
• activation function used (sigmoid, step, linear function, Gaussian, ...);
the learning algorithm (gradient backpropagation, cascade correlation);
Many other parameters may be implemented as part of the learning of these neural
networks, for example:
• the method of weight decay, thus avoiding the side effects and neutralize over-learning;
Since there is a whole chapter dedicated to neural networks, this presentation is general.
All important details are given in chapter IV.
Examples of pattern recognition
Binary features: Optical character recognition
For optical character recognition we have an image (e.g. bitmap) with containing
a printed (or handwritten) text. Image may come from scaning a page from a printed
paper. We assume that each character was first segmented (using specific image
processing techniques) and so we have a set of binary objects. The image corresponding
to each character can be codified by a matrix structure in m x n values belonging to {0,1}
- 23 -
where 0 encodes lack of black pixel and 1 the their presence. Below, examples of the
letters a, b, c, and d represented in binary form
011100
100010
100010
100010
100010
011111
100000
100000
111110
100001
100001
111110
011110
100001
100000
100000
100001
011110
000001
000001
011111
100001
100001
011111
Quantitative characteristics: electron microscopy images classification
The left image comes from electron microscopy of hepatitis B patients' serum and
highlights 3 types of particles: small spherical particles with a diameter of 22 nm, tubular
forms of thickness of 22 nm and length 20-250 nm and Danne corpusculi of circular form
with a diameter of 42 nm. Hepatitis B virus is considered to be the viral particle called
Danne corpuscul. Using specific image processing Segmentation techniques, the righthand side image was obtained, where the particles are represented as binary objects.
In order to detect the presence of the Danne corpusculi, we will represent the forms
(particles) using 2 features associated to objects' geometry: area and Circularity. The area
of an object is given by the number of pixels in the image that compose the object.
Circularity is defined as the ratio of the area and the square perimeter, representing an
indicator of the shape of the object: C = 4πA/ P2. Circularity is 1 for a circle and a
subunitary value for any other geometrical figure. This will allow discrimination between
the two circular and tube shapes while the area will make the difference between the two
circular forms (diameter 22 and 42 nm respectively).
Each form (particle) will be described by the pair (A, C) so the form space will be R2.
Calculating the values of the two features for each particle in the image, we represent
each form as a point in the 2D plan, with the features associated with the two axes.
Similar representations can be made and 3 features (in the 3D space).
- 24 -