Tải bản đầy đủ (.pdf) (13 trang)

Ứng dụng tiền xử lý ảnh và hậu xử lý trong quá trình nhận dạng chữ quang học: Nghiên cứu áp dụng cho danh thiếp tiếng Việt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (816.02 KB, 13 trang )

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

APPLYING IMAGE PRE-PROCESSING AND POST-PROCESSING
TO OCR: A CASE STUDY FOR VIETNAMESE BUSINESS CARDS
Thai Duy Quya*, Vo Phương Binha, Tran Nhat Quanga, Phan Thi Thanh Ngaa
a

The Faculty of Information Technology, Dalat University, Lamdong, Vietnam
*Corresponding author: Email:

Abstract
This paper presents a proposal image pre-processing and Vietnamese post-processing
algorithms efficiently adopt the Tesseract open source Optical Character Recognition (OCR)
library. We built a mobile application (Android) and applied the result for Vietnamese
business cards. The experimental results show that the proposed method implemented as an
Android application achieved more accuracy than the original OCR library.
Keywords: Android; OCR; Image pre-processing; Post-processing; Vietnamese Business
Card.

90


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

ỨNG DỤNG TIỀN XỬ LÝ ẢNH VÀ HẬU XỬ LÝ TRONG QUÁ
TRÌNH NHẬN DẠNG CHỮ QUANG HỌC:
NGHIÊN CỨU ÁP DỤNG CHO DANH THIẾP TIẾNG VIỆT
Thái Duy Quýa*, Võ Phương Bìnha, Trần Nhật Quanga, Phan Thị Thanh Ngaa
a

Khoa Công nghệ Thông tin, Trường Đại học Đà Lạt, Lâm Đồng, Việt Nam


*Tác giả liên hệ: Email:

Tóm tắt
Bài báo trình bày đề xuất phương pháp tiền xử lý ảnh và hậu xử lý tiếng Việt áp dụng cho
quá trình nhận dạng ký tự quang học bằng thư viện mã nguồn mở Tesseract. Chúng tôi xây
dựng một ứng dụng trên hệ điều hành Android và áp dụng kết quả nghiên cứu cho các danh
thiếp tiếng Việt. Kết quả cho thấy phương pháp đề xuất khi thực thi cho kết quả chính xác
hơn các ứng dụng hiện hành.
Từ khố: Android; Danh thiếp tiếng Việt; Hậu xử lý; Nhận dạng ký tự quang học; Tiền xử
lý ảnh.

91


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

1.

INTRODUCTION

In daily work, we usually receive business cards from our friends or partners. The
business cards regularly have some information, such as name, address, phone number,
etc. In the contact list of a smartphone, the user can also store the same contact
information as a business card. Therefore, our goal is to build an application to extract
the text of the business card and save the contact information into a smart phone. The
Android application can directly input an image of the contact information using the
phone’s camera. Noise in the business card image is then eliminated. The image is then
provided to the Optical Character Recognition (OCR) engine to extract the necessary
information and to save it to the contact list. To improve the efficiency of the extraction
process, we developed improved algorithms for image pre-processing and postprocessing. Our application is implemented on an Android device and tested with

Vietnamese business cards. The OCR engine used in this paper is the Tesseract open
source library.
2.

RELATED WORK

OCR systems have been under development in research and industry since the
1950s using knowledge-based and statistical pattern recognition techniques to transform
scanned or photographed images of text into machine-editable text files (Eason, Noble,
& Sneddon, 1955). Shalin, Chopra, Ghadge, and Onkar (2014) developed an early OCR
system. Techniques of pre-processing images, used as an initial step in character
recognition systems, were presented, of which the feature extraction step of optical
character recognition is the most important. In order to improve the accuracy of image
recognition, Mande and Hansheng (2015) and Matteo, Ratko, Matija, and Tihomir (2017)
have proposed an efficient method to remove background noise and enhance low-quality
images, respectively. In addition, Nirmala and Nagabhushan (2009) proposed an
approach which can handle document images with varying backgrounds of multiple
colors. Bhaskar, Lavassar, and Green (2015); Pal, Rajani, Poojary, and Prasad (2017);
and Yorozu, Hirano, Oka, and Tagawa (1987) presented a tutorial to improve the accuracy
of the OCR method when converting printed words into digital text.
Although there are many applications of OCR which were high accurate for the
English language (Badla, 2014; Chang, & Steven, 2009; Kulkarni, Jadhav, Kalpe, &
Kurkut, 2014; Palan, Bhatt, Mehta, Shavdia, & Kambli, 2014; Phan, Nguyen, Nguyen,
Thai, & Vo, 2017; & Trần, 2013), OCR systems for non-English languages may have
several problems. Vietnamese is a language with tones and single syllables (Phan & et
al., 2017). We were not successful in finding any relevant studies that have a 100%
recognition rate for Vietnamese, but some applications have been implemented, such as
in Trần (2013). Among commercial versions, another popular application is CamCard,
but it does not offer much support for Vietnamese language business cards. An
application available for Vietnamese language in Google Store is Business Card Reader

Free, but the experimental accuracy is not high.

92


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

3.

OCR AND TESSERACT

OCR is the technical process which converts scanned images, typewritten, or
printed text into machine encoded text. OCR has been in development for almost 80 years,
as the first patent for an OCR machine was filed in 1929 by a German named Gustav
Tauschek and an American patent was filed subsequently in 1935. OCR has many
applications, including use in the postal service, language translation, and digital libraries.
Currently, OCR is even in the hands of the general public in the form of mobile
applications. The OCR system input images include text which cannot be edited. The
output of the OCR process is editable text from the input images. The OCR process is
illustrated in Fig. 1.

Figure 1. OCR process
There are a few stages within the OCR process used to convert an image to text.
To simplify these steps, we use an open source software called Tesseract as the kernel for
our project. Tesseract was first built in 1985 by Hewlett Packard. The project later
changed hands and was further developed by the University of Nevada-Las Vegas from
1996 to 2006 (Matteo & et al., 2017). From 2007, Google has sponsored this project under
the Apache 2.0 license as open source software. Today, Tesseract is considered the most
accurate free OCR engine in existence and is one of the most widely used in the world.
Tesseract now provides support for 139 languages (Mande & Hansheng, 2015). The

Tesseract OCR process can be represented by the flow chart in Figure 2, in this system,
there are eight stages, as follows (Bhaskar & et al., 2017):


A Gray-scale or color image is provided as input: The input data should
ideally be a “flat” image from a flatbed scanner or a near parallel image
capture.



Adaptive threshold: Performs the reduction of a gray-scale image to a binary
image using Otsu’s method (Bhaskar & et al., 2017). The algorithm assumes
that in an image there are foreground (black) pixels and background (white)
pixels. It then calculates the optimal threshold that separates the two pixel
classes so that the variance between the two is minimal;



Connected-component labeling: Through the binary image, Tesseract will
identify the foreground pixels and then mark the potential characters;
93


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018



Line finding algorithm: Lines of text are found by analyzing the image space
adjacent to potential characters.




Baseline fitting algorithm: Finding baselines for each of the lines. After each
line of text is found, Tesseract examines the lines of text to find the
approximate text height across the line.



Fixed pitch detection: The other step of setting up character detection is
finding the approximate character width. This allows the correct incremental
extraction of characters as Tesseract progresses down a line;



Non-fixed pitch spacing delimiting: Characters that are not of uniform width,
or not of a width that agrees with the surrounding neighbourhood, are
reclassified to be processed in an alternate manner;



Word recognition: After finding all of the possible character “blobs” in the
document, Tesseract performs word recognition on a word-by-word, line-byline basis. Words are then passed through a contextual and syntactical
analyzer, which ensures accurate recognition.

Figure 2. Tesseract flow chart
94


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018


4.

PROPOSED METHOD

4.1.

Pre-processing

The Tesseract engine is the kernel of the OCR system in our project. To improve
the accuracy of the process, we use some pre-processing techniques for the input images.
The first technique is to fix a frame after taking pictures with a camera and converting to
gray-scale images. After that, we used the methods proposed by Mande and Hansheng
(2015); Matteo and et al. (2017); and Shivananda and Nagabhushan (2009).
When the user finishes taking the images, the program automatically identifies the
frame for the picture, which is the outline of the business card. It can change the size and
shape of the frame as suitable for recognizing text. This action not only helps increase the
accuracy of the captured image, but also removes unnecessary parts of the business card.
Figure 3 shows an example of the frame selection for a photographed business card. We
used the OpenCV open source library, which is an efficient tool for image processing.
OpenCV tool can also convert a color picture to a gray-scale picture, which is very
convenient in the next step of our OCR process.

Figure 3. A frame after taking a picture
On the other hand, the images can be processed before input to Tesseract.
Therefore, we have applied some methods proposed by previous authors. First, the
original colored image is converted into a gray-scale image using the formula proposed
by Li, Jia-bing, and Shan-shan (2010) shown in Equation (1)
Y = 0.2999R + 0.587G + 0.114B

(1)


where R, G, and B are the normalized red, green, and blue pixel values,
respectively.
Second, we applied the methods proposed by Badla (2014) to convert the color
images to gray-scale by two techniques: Luminosity and DPI Enhancement. Both of these
techniques used the OpenCV library to perform the conversion. Luminosity is a method
for converting an image into gray-scale while preserving some of the color intensities
(Badla, 2014). The algorithm code below describes the image luminosity process:

95


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018
// Get buffered image from input file; iterate all the pixels in the image with width=w and height=h
for int w=0 to w=width
{
for int h=0 to h=height
{
// call BufferedImage.getRGB() saves the color of the pixel
// call Color(int) to grab the RGB value in pixel
Color= new color();
// now use red, green, and black components to calculator average.
int luminosity = (int)(0.2126 * red + 0.7152 *green + 0.0722 *blue;
// now create new values
Color lum = new ColorLum
Image.set(lum)
// set the pixel in the new formed object
}
}


To get the best results out of the image, we need to fix the DPI as 300 DPI is the
minimum acceptable for Tesseract (Badla, 2014). The algorithm for DPI enhancement is
as follows:
start edge extract (low, high){
// define edge
Edge edge;
// form image matrix
Int imgx[3][3]={}
Int imgy[3][3]={}
Img height;
Img width;
//Get diff in dpi on X edge
// get diff in dpi on y edge
diffx= height* width;
diffy=r_Height*r_Width;
img magnitude= sizeof(int)* r_Height*r_Width);
memset(diffx, 0, sizeof(int)* r_Height*r_Width);
memset(diffy, 0, sizeof(int)* r_Height*r_Width);
memset(mag, 0, sizeof(int)* r_Height*r_Width);
// this computes the angles
// and magnitude in input img
For ( int y=0 to y=height)
For (int x=0 to x=width)
Result_xside +=pixel*x[dy][dx];
Result_yside=pixel*y[dy][dx];
// return recreated image
result=new Image(edge, r_Height, r_Width)
return result;
}


Finally, we use the methods proposed by Mande and Hansheng (2015) and Matteo
& et al. (2017) with low-quality or background images. Tesseract requires a minimum
text size for reasonable accuracy. If the x-height of images is below 20px, the accuracy
drops off. The first pre-processing method proposed of Matteo and et al. (2017) is image
resizing so that the image height is 100px. Resizing is only applied if the height of the
original image is below 100px. The second pre-processing method of Matteo and et al.
(2017) is an image sharpening method. The main reason for using it is to enhance the
contrast between edges, i.e. to enhance contrast between text and background. The image
sharpening is achieved using unsharp masking, represented by Equation (2).
g(i,j) = f(i,j) - fsmooth(i, j)

(2)

96


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

A smoothed image fsmooth is subtracted from the original image f. The third
proposed method of Matteo and et al. (2017) is image blurring to reduce high frequency
information and remove noise from the images, which can possibly cause a lower OCR
accuracy rate. This method is achieved by applying a low-pass filter to the analyzed image
f such that each pixel is replaced by the average of all the values in the local neighborhood
of size 9x9 pixels, as in Equation (3).

(3)
Mande and Hansheng (2015) proposed some methods in cases where the image
has a background. The methods are based on a color model in RGB space (Figure 4). We
applied this method using the parameter of brightness distortion (αi) and chromaticity
(CD ) to enhance a document image and make it easier to remove background. The

brightness distortion αi is obtained by Equation (4).
i

(i) = (pi - iEi)2

(4)

Where αi represents the pixel’s brightness. To minimize the object function (4), αi
must be 1 if the brightness of the given pixel in the current image is the same as in the
reference image. Similarly, αi < 1 means the pixel is dimmer than the expected brightness;
and αi >1 means it is brighter. When αi are determined, the value of CDi can be solved by
Equation (5):
CDi = || pi - iEi||

(5)

Figure 4. Color model in RGB space. Ei represents the expected color of pixel pi in
the current image. The difference between pi and Ei is decomposed into brightness
i and chromaticity (CDi)
Source: Mande and Hansheng (2015).

4.2.

Post-Processing

OCR (including Tesseract) is used for many applications these days. In this
project, we only researched and applied OCR to business cards. Therefore, we were only
concerned with four items: i) Name or organization; ii) Telephone number; iii) Email;
and iv) Address of organization. Actually, there are two techniques for extracting textual
97



KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

information from images: i) Regular expression (can own defined rules) or ii) Machine
learning statistics (Trần, 2013). In this study, we used regular expression, or methods
dependent on Vietnamese language rules, to obtain the necessary information.
The editable text received from the OCR process includes multiple lines. The
information on the business card usually is short and the first letters indicate the contents
of the line. Overall, the telephone number and email address use regular expressions,
whereas name and address are based on Vietnamese language conventions. For email
address and phone number, we used the regular expression provided by Kipalog (2018).
The regular expression for the email address is Expression (6):
/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm

(6)

Similarly, the phone number is expressied as Expression (7):
(\\(\\d+\\)+[\\s-.]*)*(\\d+[\\s-.]*)+

(7)

In addition, when the algorithm scans a phone number, it also categorizes the
number as a mobile number or a home number. On most business cards, the phone
numbers are usually a sequence of numbers, or are separated by special characters such
as white spaces, dots, dashes…. Thus, in the algorithm we included some special
exceptions to improve the post-processing.
With the Vietnamese name, the algorithm will check whether the line contains the
family name or not. The family name is stored and a comparison is made to determine if
the information stream contains a family name. If not, the algorithm will get all the words

in the line and save them as the organization name. With address, the algorithm will check
if the input stream contains headings with such Vietnamese phrases as “Đc:” or “Đ ịa chỉ:”
or English phrases such as "Add:", “Address:”, or these words in uppercase format. If it
exists, this line is the address, otherwise the algorithm checks to find the name of the
provinces in Vietnam, which are stored in a list similar to family name.
4.3.

Proposed model

Figure 5 shows the basic steps involved in recognition in our project. Images taken
by a phone’s camera of a business card are pre-processed (see in 4.1) and then inputted
to the Tesseract engine. After receiving text results, we use Vietnamese language
conventions for names and addresses to extract information from the card (postprocessing, see in 4.2) and then save the information to the list of contacts in the Android
device.

98


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Figure 5. Proposed structural model
5.

RESULTS

We have successfully implemented an application called Vietnamese Card Scan
(VnCS) on the Android OS. The experiment was deployed on the Samsung Galaxy Tab
E tablet with Android 4.4.4. The size of the APK file is 26.5MB. The program runs on
the Android OS shown in Figure 6. The test data include 250 Vietnamese business cards
of three types, as presented in Table 1.


Figure 6. ScanVnCard program in Samsung Galaxy Tab E
99


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Table 1. Business card collected data
Type

Features

Quantum

No. 1

Distinctive background and text, no wallpaper

135

No. 2

Distinctive background and letters, with wallpaper

75

No. 3

Have the same color, logo, picture or characters that are difficult to identify


40

Four types of information are extracted, as follows: i) Name or organization; ii)
Phone numbers; iii) Email; and iv) Address. The results with the accuracy of each
extraction type are shown in Table 2. Figure 7 presents an original Vietnamese business
card, after pre-processing, and editable text after OCR processing.
Table 2. Results for four types of information extracted from business cards
No. 1(%)

No. 2(%)

No. 3(%)

Name or organization

90

70

60

Phone numbers

90

80

70

Email


80

60

50

Address

70

60

60

(a)

(b)

(c)

(d)
Figure 7. An example for our OCR process in Vietnamese business card
Note: a) Original business card; b) Pre-processing; c) Editable text; and d) Saving to contact list.

100


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018


6.

CONCLUSIONS

This paper provides a detailed discussion about a mobile image to text recognition
system implemented through an Android application for Vietnamese business cards. The
image is taken with a camera and pre-processed by various techniques. The image is then
processed with an OCR technique to produce editable text on screen. Finally, the
necessary information is extracted by post-processing and saved to the contact list. The
results show that the proposed method achieves more efficiency and accuracy than the
original software. In the future, we will improve the program to run faster and deploy on
many operating systems.
REFERENCES
Badla, S. (2014). Improving the efficiency of Tesserct OCR engine. Retrieved from
/>om/&httpsredir=1&article=1416&context=etd_projects.
Bhaskar, S., Lavassar, N., & Green, S. (2015). Implementing optical character
recognition on the Android operating system for business cards. Retrieved from
/>sinessCardRecognition.pdf.
Chang, L. Z., & Steven, Z. Z. (2009). Robust pre-processing techniques for OCR
applications on mobile devices. Paper presented at The International Conference
on Mobile Technology, Application & Systems, France.
Eason, G., Noble, B., & Sneddon, I. N. (1955). On certain integrals of Lipschitz-Hankel
type involving products of Bessel functions. Phil. Trans. Roy. Soc., A247, 529551.
Kipalog. (2018). 30 đoạn biểu thức chính quy mà lập trình viên Web nên biết. Được truy
lục từ ttps://kipalog.com/posts/30-doan-bieu-thuc-chinh-quy-ma-lap-trinh-vienweb-nen-biet.
Koistinen, M., Kettunen, K., & Kervinen, J. (2017). How to improve optical character
recognition of historical Finnish newspapers using open source Tesseract OCR
engine. Paper presented at The Language & Technology Conference: Human
Language Technologies as a Challenge for Computer Science and Linguistics,
Poland.

Kulkarni, S. S., Jadhav, V., Kalpe, A., & Kurkut, V. (2014). Android card reader
application using OCR. International Journal of Advanced Research in Computer
and Communication Engineering, 3, 5238-5239.
Li, J., Jia-Bing, H. D., & Shan-shan, Z. (2010). A novel algorithm for color space
conversion model from CMYK to LAB. Journal of Multimedia, 5(2), 159-166.
Mande, S., & Hansheng, L. (2015). Improving OCR performance with background image
elimination. Paper presented at The International Conference on Fuzzy Systems
and Knowledge Discovery, China.
101


KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018

Matteo, B., Ratko, G., Matija, P., & Tihomir, M. (2017). Improving optical character
recognition performance for low quality images. Paper presented at The
International Symposium ELMAR, Croatia.
Pal, I., Rajani, M., Poojary, A., & Prasad, P. (2017). Implementation of image to text
conversion using Android app. International Journal of Advanced Research in
Electrical, Electronics and Instrumentation Engineering, 6, 2291-2297.
Palan, D. R., Bhatt, G. B., Mehta, K. J., Shavdia, K. J., & Kambli, M. (2014). OCR on
Android-travelmate. International Journal of Advanced Research in Computer
and Communication Engineering, 3, 5810-5812.
Phan, T. T. N., Nguyen, T. H. T., Nguyen, V. P., Thai, D. Q., & Vo, P. B. (2017).
Vietnamese text extraction from book covers. Dalat University Journal of
Science, 7(2), 142-152.
Shalin, Chopra, A., Ghadge, A. A., & Onkar, A. P. (2014). Optical character recognition.
International Journal of Advanced Research in Computer and Communication
Engineering, 3, 214-219.
Shivananda, N., & Nagabhushan, P. (2009). Separation of foreground text from complex
background in color document images. Paper presented at The Seventh

International Conference on Advances in Pattern Recognition, India.
Trần, Đ. H., (2013). Ứng dụng nhận dạng danh thiếp tiếng việt và cập nhật thông tin danh
bạ trên Android. Được truy lục từ />2558917-ung-dung-nhan-dang-danh-thiep-tieng-viet-va-cap-nhat-thong-tin-danh
-ba-tren-android-full-soure-code.htm
Yorozu, Y., Hirano, M., Oka, K., & Tagawa, Y. (1987). Electron spectroscopy studies on
magneto-optical media and plastic substrate interface. IEEE Transl. J. Magn., 2,
740-741.
Zhou, S. Z., Gilani, S. O., & Winkler, S. (2016). Open source OCR framework using
mobile devices. SPIE-IS&T, 6821, 1-6.

102



×