Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
OPTICAL CHARACTER RECOGNITION
FOR VIETNAMESE SCANNED TEXT
Authors: Tran Anh Viet, Le Minh Hoang Hac, Le Tuan Bao Ngoc, Le Anh Duy
Class: 08ECE, Electronic and Communication Engineering Department, DaNang University of
Technology
Advisors: Ph.D. Pham Van Tuan, M.E. Hoang Le UyenThuc
Electronic and Communication Engineering Department, Da Nang University of Technology
Abstract
Optical Character Recognition is a technology that enable human to digitize scanned
images, converting into editable text on the computer and increasing the speed of data
transmission directly into computer from many source of documents. In addition, it is also useful in
handwriting recognition and making digital images searchable for text. In this paper, we proposed
anOCR system which is capable of recognizing Vietnamese characters fortyped texts using
template matching and artificial neural network recognizing methods. Each method has its own
advantage as well as weakness and they will be clearly shown through this paper, so that the
readers can figure out what method they might use for specific situation of OCR for Vietnamese
typed text.
1. Introduction
In recent years, OCR has become a popular industry aroundthe world with variety
of languages and Vietnamese is not an exception. However,in comparison with other
languages, Vietnamese OCR technology is still young and needs improvement for higher
efficiency as well as growing more applicable. With this inspiration, our group decides to
do a research on OCR to find a simpler but efficient alternative for Vietnamese language.
The process of how to do OCR for printed Vietnamese script will be discussed throughout
this paper in detail.
2. Procedure
A general approach for any OCR problem [2] contains 7 steps as shown in figure 1
Imageswithsome standard format such as bmp or jpeg format are feed into our
system. The scanned images, respectively, go through pre-processing, segmentation,
feature extraction, classification/recognition, post-processing step,than appear at the output
as text [figure 1]. We will discuss this procedure step-by-step as following.
2.1 Pre-processing
Scanned image
input
Preprocessing
Segmentation
Classification
and recognition
Post-processing
Recognized
text
Figure 1: General structure of an OCR process
Feature extraction
Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
The first step in preprocessing stage is toconvert the color or gray-scale images into binary
image. To determine a threshold value for binarizing image we apply the following
formula [4]:
T = T[x, y, p(x,y), f(x,y)] (1)
Where T is threshold function, f(x, y) is the fray scale level of point (x, y) and
p(x, y) denotes some local property of this point. A thresholded image g(x, y) is defined as:
(2)
The second step is to remove background noise using Median filter [4]
2.2 Segmentation
Segmentation process consists of the following steps respectively:
Split the original image into individual lines using horizontal projection profile of
image [6]
Lines are cells which correspond to horizontal projection profile value greater some
minimum value (0 in this case)
Split each word from lines into characters using bounding box and vertical
projection profile [6]
Letters are cells that correspond to vertical projection profile value greater some
minimum value (0 in this case)
After obtaining images of single letters, next is cropping unnecessary region and
reshaping the characters’ image to appropriate size as shown in figure 5 and 6a,b
Figure 3: Horizontal projection profile diagram of an image
Figure 2: Eliminate Gaussian noise with Median filter
0 50 100 150 200 250 300 350 400
0
10
20
30
40
Figure 4: Vertical projection profile diagram for every single character
Figure 5:Cropping empty region from a letter’s image
Figure 6a: Resize height to 16, width depends on image
ratio ce
Figure 6b: Resize both height and
width to 16
ratioce
Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
Mapping pixel from a letter matrix into a cell array for recognition process.
2.3 Feature extraction
To reduce matrix size for a letter’s image, which does a great help for training and
recognition process, feature extraction should be used to achieve this goal. In our project,
we employ Hu’s seven moment extraction [5]. Hu’s seven moment invariants are
invariant to image transformations including scaling, translation and rotation. Computing
Hu’s moments follows figure 8[5]:
Theletter’simage is first converted to binary format. The function to compute the
regular moments is in the format: [m] = moment (fig , p , q). fig is the input binary image,
and p , q are predefined moment’s order. With these parameters available, we do the
summation according to the regular moments’definition [5]:
The centroid of the binary image is computed according to and
.Based on thecentroidoftheimage, similarto the regular moments, the
centralmomentsfunctionisinthe format: [µ] = central_moment(fig , p , q). This is
computed according to the definition [5]:
2.4 Training & Recognition
We used two main methods for Character Recognizer engine: Artificial Neural
Network (ANN) and Template Matching (TM)
2.4.1 Template Matching Method
Template matchingmethod uses simple algorithm to measure the difference
between the character samples and character prototypes in the library. In order to function
Figure 7: matrix mapping for a letter
ratioce
Figure 8: Block diagram of computing Hu’s seven moment invariants
ratioce
(3)
(4)
Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
Table 1: Overall results for template-matching technique
well enough, TM must have an extensive library of character image prototypes to deal with
many of the conditions of the input images. The core algorithm of TM is the Euclidian
Distance (ED) [5] which computes the Euclid distance of the input image with the entire
library of prototypes. The Euclidian Distance is calculated as follow:
With x(i,j) and y(i,j) are the pixels of the input image and the prototype,
respectively.
As mentioned above, TM recognition engine produces considerable amount of
errors if the number of prototypes is small. Since TM engine is rather simple it doesn’t
need features extraction for a character image; the computation takes the pixels of the
image as its input instead.
2.4.2 Artificial Neuron Network Method(ANN)
General process for ANN recognition method as shown in figure 9 [2]:
Figure 10 demonstrates a training neural network with 256 cell arrays at input, 14
cell arrays at output and 150 hidden layers.
3. Experimentswith TM and ANN
3.1 Template matching
With 1600 prototypes we are able to recognize one page of printed text with Arial
font with an accuracy of 82% up to 90%. Confidence level is much higher with clean
printed texts and reduces significantly with poorly printed ones.
Font
Accuracy
Prototypes/character
Text condition
Arial
82%-90%
8
Good
Arial
82% - 90%
6
Good
Figure 9: Training and recognition using artificial
neural network
Figure 10: Training network
ratioce
First is analyzing image
for characters than
convert symbols to pixel
matrices. Second step is
retrieving corresponding
desired output character
and convert to Unicode
and reshape matrix and
feed to network than
compute network
Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
Arial
82% - 90%
3
Good
3.2 Artificial neural network
Input of network: 256 binary values, according to 256 pixel of resized image
Output of network: 14 binary values, according to binary Unicode character
Training data: 114 lowercase Vietnamese character images in “Arial” font. We use
5 more same collections of images with different size 200, 180, 160, 140, 120, 100 height
pixel and 6 space characters image. In total, we have 690 sample data for each font without
noise.
Target data: 114 lowercase Vietnamese character “Arial” font and space character.
Character image size
(Height pixel)
76
67
58
49
Result
94.7%
92.11%
92.11%
85.6%
Actual image’s recognition for ANN
o v d đ c e ắ ằ ấ ầ ề ế l 9 w
Khi hoàn thành vào năm 2017, đây sẽ là tòa tháp cao nhất thế giới
hello word ŷ viefnmesecharacferrecogition
4. Conclusion
For both Template matching (TM) and Artificial neural network (ANN) methods,
the recognizing performance for each letters is pretty high (82% - 95%). TM seems to be
better in recognizing script in good condition than ANN. However, TM strongly depends
on font style and condition while ANN is capable of dealing with different fonts and
various testing condition. In future work, from results and our point of view,ANN should
be more focused in recognition of printed Vietnamese script due to its adaptive ability to
multiple text style and condition. To enhance ANN’s performance, besides better
segmentation, multiple feature extraction can be a good way to go.
References
[1] Twan van Laarhoven(Eng) (11/2010), Text Recognition in Printed Historical
Documents.pdf, Concordia University, Montreal, Canada.
[2] Raghuraj Singh, C.S. Yadac, PrahatVerma, VibhashYadav (Eng) (6/2010), Optical
Recognition [OCR] for Printed Devnagari Script using Atificial Neural Network.pdf
[3] K L. Du, PhD, M.N.S. Swamy, PhD, D.Sc, (Eng), (2006), Neural Networks in a
Table 2: Results for Neural network method
Tuyển tập Báo cáo Hội nghị Sinh viên Nghiên cứu Khoa học lần thứ 8 Đại học Đà Nẵng năm 2012
Softcomputing Framework.pdf, Concordia University, Montreal, Canada.
[4] Rafael C.Gonzalez, Richard E.Woods, (Eng), (1992), Digital image processing second
edition.
[5] Qing Chen, Ottawa, Canada, (Eng), (2003), Evaluation of OCR Algorithms for Images
with Different Spatial Resolutions and Noises, University of Ottawa.
[6] YI LU, (Eng), (5/1994), Machine-printed segmentation, Department of Electrical and
Computer Engineering, The University of Michigan Dearborn, Dearborn,