Comparing convolutional neural networks in Vietnamese scene text recognition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.91 MB, 8 trang )

Information technology & Applied mathematics

COMPARING CONVOLUTIONAL NEURAL NETWORKS IN
VIETNAMESE SCENE TEXT RECOGNITION
Le Ngoc Thuy*
Abstract: Scene text recognition is a challenging task for research community,
especially with the scripts with diacritical marks such as Vietnamese. In the paper,
two different convolutional network architectures for recognising Vietnamese text in
natural scenes are presentd. Experiments are conducted to compare the
performance of two networks in reading Vietnamese restaurant signs. Experimental
results show that the deeper network outperforms the other in recognising accuracy
and computational time.
Keywords: Scene text recognition, Optical character recognition, Convolutional neural networks.

1. INTRODUCTION
Reading text in natural scene images refers to the problem of converting image
regions into strings. Scene text recognition is a crucial issue in many useful
applications including: automatic sign translation, text detection system for the
blind, intelligent driving assistance, content-based image/video retrieval. Hence,
scene text recognition has received increasing interests from the research and
industry community in recent years.
Although scene text recognition seems similar to optical character recognition
(OCR), reading text in natural scene images is much more challenging. One of the
leading commercial OCR engines, ABBYY FineReader, claims that it has
capability of transforming scanned documents, such as graphics and images, into
texts with the accuracy of 99.8%. However, its accuracy in character recognition is
as low as 21% for scene text applications [1]. The difficulty in scene text
recognition results from three following facts. Firstly, the appearances of
characters often vary drastically in fonts, colors and sizes, even in the same image.
Secondly, the text in captured images is affected by various factors, such as blur,
distortion, non-uniform illumination, occlusion and complex backgrounds. Lastly,

there are other objects in the captured image which make the problem more
challenging.
The numerous of studies have dealt with the scene text detection and
recognition during the last two decades but most of the existing methods and
benchmarks have focused on texts in English. There were a few efforts addressing
the scene text detection and recognition for language scripts with diacritics [2]. The
results of the ICDAR 2013 Robust Reading Competition shown that the
participating methods were usually fail to detect the dot of the letters “i” and “j”
[3]. Therefore, there is a potential that most of the current methods in scene text
detection and recognition do not recognise tiny atoms of the language scripts with
diacritics such as Vietnamese, Thai, Arabic (Figure 1) if we applied the current
methods to other languages directly. For instances, commercial OCR softwares
work well with scanned English documents but they still have significant errors in
transforming the scanned Vietnamese documents to text. The errors are mainly due
to letters with diacritics. Moreover, some Vietnamese words may consist of one

36

Le Ngoc Thuy, “Comparing convolutional neural networks ... text recognition.”

Research

letter with two diacritics above or below it. This distinctive characteristic makes
the task of Vietnamese script recognition more challenging than most other scripts.
As numerous researchers devoted to detecting and recognising the scene text,
many papers have provided comprehensive surveys on these problems [4-11]. The
most broadly reviewed paper [4] addresses more than 200 papers which are
classified in two groups. The first group include stepwise methodologies which
address the problem of reading scene text in four separate steps: localization,

verification, segmentation, and recognition. The advantages of stepwise
methodologies are computational efficiency and the capability of processing the
oriented text. However, their disadvantages are the complexity in integrating
different techniques from all four steps and the difficulty in optimizing parameters
for all steps at the same time. The other group include integrated methodologies
which are to identify specific words in images with character and language models.
While the integrated methodologies have a clear advantage in optimize parameters
for the whole solution, they are often computationally expensive and limited to a
small lexicon of words.

Figure 1. The same sentence in different languages: English, Arabic, Slovakian,
Vietnamese, Urdu, Japanese and Thai.
Another valuable survey [5] gives the overview of recent advances in scene
text detection and recognition for static images by referring to around 100 papers.
Y. Zhu et. al. [5] address the related works on scene text detection as three types of
methods: texture based methods, component based methods and hybrid methods.
The paper not only analyses the strength and weakness of comparative methods but
also gives the useful discussion about state-of-the-art algorithms and the future
trend in scene text detection and recognition.
The above papers emphasize the well performance of deep learning methods in
scene text detection and recognition. They also suggests that the further
improvement in detection and recognition accuracy can be achieved, if the deep
learning framework is employed and combined with the language knowledge.
Among the studies using deep learning and big data, Google PhotoOCR [12] is a
remarkably successful work which won the ICDAR Robust Reading Competition in
2013. It takes advantage of substantial progress in deep learning and large scale
language modeling. Its deep neural network (DNN) character classifier is trained on
two million examples while its language model is built by utilizing a corpus of more
than a trillion tokens. Many other methods using DNNs has achieved the top scores

Journal of Military Science and Technology, Special Issue, No. 51A, 11 - 2017

37

Information technology & Applied mathematics

in ICDAR Robust Reading Competitions.
To the best of our knowledge, there has not been any study of word-level scene
text recognition for Vietnamese. Hence, this paper will explore this area by
comparing the performance of two neural networks in recognising Vietnamese
words on the restaurant signs. The concept of convolutional neural networks
(CNNs) is introduced in the next section. Then, two network architectures are
represented with different complexity levels. Section 3 discusses experimental
results when using the presented networks for Vietnamese text recognition.
2. SCENE TEXT RECOGNITION USING CNNs
2.1. Background theory
Convolutional neural networks are specific feed-forward multilayer neural
networks which combines three following architectural ideas: (i) local receptive
fields used to detect elementary visual features in images, such as oriented edges,
end points or corners; (ii) shared weights to extract the same set of elementary
features from the whole input image and to reduce the computational cost; (iii)
sub-sampling operations to reduce the computational cost and the sensitivity to
affine transformations such as shifts and rotations [3].
A convolutional neural network consists of many layers, including the input
layer, the output layer, and hidden layers. The hidden layers of convolutional
networks include convolutional layers and pooling layers. Each unit in a
convolutional layer is locally connected to a set of units located in a small
neighborhood of the previous layer. The output of convolutional layers are called
feature maps because they help to extract the visual features in images. The output

features at a layer may be used to build the higher-order features in the next layers.
Unfortunately, no algorithm is able to automatically determine the optimal
architecture of a CNN for a given classification task. The architecture of network
such as the number of layers, the number of units in each layer, and the network
parameters must be determined through experiments. This section present two
convolutional network architectures which are used for the experiments of
Vietnamese scene text recognition in Section 3.
2.2. Network architecture 1
The first proposed network consists of three convolution layers as shown in
Figure 2. The input of network is the coloured image with the size of 32x32x3. The
first convolutional layer has 32 feature maps corresponding to 32 convolutional
filters. The size of each convolutional filter in the first layer is 5x5x3. The second
and third convolutional layers have 32 and 64 feature maps, respectively. The
outputs of convolutional layers are sub-sampled using the max pooling function
and normalised by the rectifier linear unit ReLU. The receptive field of pooling
layers is a 3x3 matrix with the stride of 2. The last two layers are fully connected
to combine the features learned from the previous convolution and pooling layers.
The number of filters in the last layer is the number of classes to be recognised.
This architecture has totally 12,399,306 connections while having only 145,578
parameters thanks to the weight sharing characteristic.
38

Le Ngoc Thuy, “Comparing convolutional neural networks ... text recognition.”

Research
Conv. layer
32x32x32

Input Image

32x32x3

Pooling layer
16x16x32

Conv. layer
16x16x32

Pooling layer
Pooling layer
8x8x32
4x4x64
Conv. layer
8x8x64
1x1x64

1x1x10

Figure 2. The first convolution network architecture.
2.3. Network architecture 2
The second network architecture is simpler than the first one. It consists of only
one convolution layer and one pooling layer (Figure 3). To get more information
from the input data, the larger size of input images is used (64x64x3). The
convolutional layer is created by utilizing 400 kernel filters whose size is 8x8x3.
The outputs of convolutional layers are sub-sampled using the average pooling
function and normalised by the sigmoid function. The receptive field of pooling
layers is a 3x3 matrix with the stride of 3 so that the sub-sampled areas are nonoverlapping. This architecture has totally 250,822,800 connections while having
only 77,200 parameters.
Input Image
64x64x3

Conv. layer
57x57x400

Pooling layer
19x19x400

Figure 3. The second convolution network architecture.
3. EXPERIMENTS AND RESULTS
3.1. Training dataset
Since there is no labeled datasets of Vietnamese scene text found on the
internet, a dataset of Vietnamese restaurant signs was built by collecting the
images on the internet and by capturing the shop signs on the street (Figure 4). The
collected dataset consisted of 1,301 images containing 464 words of “bún” (rice
noodle), 409 words of “phở”, 428 words of “cơm” (rice). This dataset was split in
two subsets. Two thirds of images were used for training the network. The rest
were used for validation.
Journal of Military Science and Technology, Special Issue, No. 51A, 11 - 2017

39

Information technology & Applied mathematics

Figure 4. Images of dataset.
The convolutional neural networks often require a larger number of data so that
the networks can learn the features of objects by themselves. Hence, the images of
other objects were added to the training datasets. The final training set consists of
about 3000 resized images of 10 objects.
3.2. Experimental results

Our experiments utilised the softmax classifier, which is a known multiclass
classification method, for recognising text. The output of the above neural
networks was used as the input of softmax classifier. It should be noted that the
input of neural networks in our experiments was produced directly from origin
captured image. Hence, the networks did not need the pre-prosessing step to crop
words from the origin images as some other networks do.
The accuracy of recognising each word (noodle, phở, rice) and the average
accuracy for Vietnamese words were shown in Table 1. Although the input image
resolution of network 2 had the resolution with four times greater than that of
network 1, the accuracy of network 1 in recognising words was higher than that of
network 2. This was thanks to the deeper architecture of network 2.
Table 1. The recognising accuracy.
Network 1

Network 2

Noodle

81,3%

70,9%

Pho

89,7%

70,8%

Rice

89,1%

67,4%

All classes
Vietnamese words

84,98%

79,6%

86,7%

69,7%

Figure 5 and 6 shown some randomly selected images which were recognised
correctly and incorrectly. The recognised results were promising because the
networks can correctly recognise the blurred words in the images with non-uniform
illumination and complex background.
40

Le Ngoc Thuy, “Comparing convolutional neural networks ... text recognition.”

Research

Figure 5. Correctly recognised words.
Another remarks in comparing these two networks is about the computational
complexity. Although the number of parameters in network 1 are about double
those in network 2, the number of connections in network 2 are twenty times

greater than those in network 1. Hence, the second network needed much more
time for calculating the forward propagation in the network. This fact makes the
first network faster in the recognising task.

Figure 6. Incorrectly recognised words.

Journal of Military Science and Technology, Special Issue, No. 51A, 11 - 2017

41

Information technology & Applied mathematics

4. CONCLUSIONS
Two convolutional neural networks in Vietnamese scene text recognition have
been compared. The results pointed out that the deeper network shown better
performance in recognising accuracy and computational time.
The current results are obtained by using the image pixels as the input of CNNs.
To achieve the higher accuracy, further investigation should be focused on using
some specific image features as the input of CNNs. The performance of above
CNNs on Vietnamese scene text recognition should be slightly improved with a
larger labeled dataset.
REFERENCES
[1]. Wang K., Babenko B., Belongie S., “End-to-End Scene Text Recognition”,
IEEE International Conference on Computer Vision (ICCV), Barcelona,
Spain, 2011.
[2]. Le N. T., “Các giải thuật phát hiện chữ viết đối với các ngôn ngữ có dấu”,
Journal of Military Science and Technology, vol. 46 (2016), pp. 163-169.
[3]. Karatzas D., Shafait F., Uchida S., Iwamura M., Bigorda L., Mestre S., Mas J.,
Mota D., Almaz J., Heras L., “ICDAR 2013 robust reading competition”,

Proceedings of the ICDAR (2013).
[4]. Q. Ye and D. Doermann, “Text detection and recognition in imagery: A
survey”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7 (2014), pp.
1480-1500.
[5]. Y. Zhu, C. Yao and X. Bai, “Scene text detection and recognition: Recent
advances and future trends”, Frontiers of Computer Science, Vol. 10, Issue 1
(2015), pp 19-36.
[6]. Chongmu Chen, Da-Han Wang, Hanzi Wan, “Scene Character and Text
Recognition: The State-of-the-Art”, Chapter Image and Graphics in Volume
9219 of the series Lecture Notes in Computer Science (2015), pp 310-320.
[7]. Karanje Uma B., and Rahul Dagade, “Survey on Text Detection, Segmentation
and Recognition from a Natural Scene Images” International Journal of
Computer Applications 108.13 (2014).
[8]. Patil Priyanka, and S. I. Nipanikar, “A Survey on Scene Text Detection and
Text Recognition”, International Journal of Advanced Research in Computer
and Communication Engineering, Vol. 5, Issue 3 (2016), pp. 887-889.
[9]. Cun-Zhao Shi, Song Gao, Meng-Tao Liu, Cheng-Zuo QiA, “Stroke Detector
and Structure Based Models for Character Recognition: A Comparative
Study”, IEEE Transactions on Image Processing, Vol. 24, Issue: 12 (2015), pp
4952-4964.
[10]. Kaur Tajinder, and Nirvair Neeru, “Text Detection and Extraction from
Natural Scene: A Survey”, International Journal of Advance Research in
Computer Science and Management Studies, Vol. 3, Issue 3 (2015), pp. 331336.

42

Le Ngoc Thuy, “Comparing convolutional neural networks ... text recognition.”

Research

[11]. N. Sharma , U. Pal and M. Blumenstein, “Recent advances in video based
document processing: A review”, Proc. DAS (2012), pp. 63-68.
[12]. A Bissacco, M Cummins, Y Netzer, H Neven, “PhotoOCR: Reading Text
in Uncontrolled Conditions”, IEEE International Conference on Computer
Vision, 2013, pp, 785-792.
TÓM TẮT
SO SÁNH CÁC MẠNG NƠ RON TÍCH CHẬP TRONG VIỆC
NHẬN DẠNG CHỮ TIẾNG VIỆT TRONG CẢNH
Vấn đề nhận dạng chữ viết trong cảnh là một nhiệm vụ thách thức đối với
các nhà nghiên cứu, đặc biệt là nhận dạng chữ viết có dấu như tiếng Việt.
Bài báo này giới thiệu hai kiến trúc mạng nơ ron tích chập ứng dụng trong
việc nhận dạng chữ viết tiếng Việt trong cảnh vật tự nhiên. Tác giả đã tiến
hành các thử nghiệm để so sánh hiệu quả của hai mạng nơ ron này trong
việc đọc các bảng hiệu nhà hàng bằng tiếng Việt. Kết quả thử nghiệm cho
thấy mạng nơ ron có kiến trúc sâu hơn đạt hiệu quả tốt hơn về độ chính xác
của quá trình nhận dạng và thời gian tính toán.
Từ khóa: Nhận dạng chữ viết trong cảnh, Nhận dạng ký tự quang học, Mạng nơ ron tích chập.

Received date, 13th Jul., 2017
Revised manuscript, 27th Aug., 2017
Published, 1st Nov., 2017

Author affiliation:
Posts and Telecommunications Institute of Technology;
*
Email:

Journal of Military Science and Technology, Special Issue, No. 51A, 11 - 2017

Comparing convolutional neural networks in Vietnamese scene text recognition

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về