Automatic text extraction using DWT and Neural Network

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (333.47 KB, 5 trang )

Automatic Text Extraction Using DWT and Neural Network
Po-Yueh Chen (陳伯岳), Chung-Wei Liang (梁忠瑋)

Department of Computer Science and Information Engineering,
Chaoyang University of Technology
(168 Gifeng E. Rd., Wufeng, Taichung County, Taiwan, R.O.C.)
Tel: (04) 23323000 ext. 4420
Email：

摘要
本論文提出一個利用離散小波轉換與類神經網
路來擷取影像中的文字區域的方法。原始影像經過
離散小波轉換分解成四個子頻帶，正確文字區域的
高頻子頻帶與非文字區域不同，所以可利用其差距
計算出三個特徵值來當作類神經網路的輸入，然後
用倒傳遞架構的類神經網路來訓練待測的文字區
域。文字區域的類神經網路輸出值不同於非文字區
域的輸出值，因此可利用一臨界值來判定其是否為
文字區域。最後，將其偵測的文字區域經過擴張運
算後便可得到正確的文字區域。

關鍵詞：文字擷取、離散小波轉換、類神經網路

Abstract

In this paper, we present a new text extraction
method based on discrete wavelet transform and
neural network. The method successfully extracts
features of candidate text regions using discrete
wavelet transform. This is because the intensity

characteristic of any detail component sub-band is
different from that of the others. We employ this
difference to extract features of candidate text regions.
A neural network based on back propagation
algorithm (BP) is trained according to these features.
The final network output of real text regions is
different from those non-text regions. Hence, we can
apply an appropriate threshold value with some
dilation operators to obtain the real text regions.

Keywords：Text extraction, DWT, Neural Network

1. Introduction
Text extraction plays an important role in static
images and video sequences analysis. Texts provide
important information about images or video
sequences and hence can be used for video
browsing/retrieval in a large video database. However,
text extraction presents a number of problems
because the properties of text may vary, as well as the
text sizes and the text fonts. Furthermore, texts may
appear in a cluttered background. These are the
reasons make text extraction a challenging task.
Many papers concerning extraction of texts from
static images or video sequences have been published
in recent years. Those methods are applied either on
uncompressed images or compressed images. Text
extraction from uncompressed image can be
classified as either component-based or texture-based.
For component-based text extraction methods, text

regions are detected by analyzing the robust edges or
homogeneous color/grayscale components that
belong to characters. For example, Cai et al. [1]
detect text edges in video sequences using a color
edge detector and then apply a low threshold to filter
out definite non-edge points. Real text edges are
detected using an edge-strength-smoothing operator
and an edge-clustering-power operator. Finally, they
employ a string-oriented coarse-to-fine detection
method to extract the real text regions. Datong Chen
et al. [2] detect vertical edges and horizontal edges in
an image and dilate these two kinds of edges using
different dilation operators. The logical AND
operator is performed on dilated vertical edges and
dilated horizontal edges to obtain candidate text
regions. Real Text regions are then identified using
the support vector machine.
Text regions usually have special texture features
because they consist of components of characters.
These components also contrast the background and
exhibit a periodic horizontal intensity variation due to
the horizontal alignment of characters. As a result,
texts can be extracted according to these special
texture features of characters. Paul et al [3]
segmented and classified texts in a newspaper by
generic texture analysis. Small masks are applied to
obtain local textural characteristics.
All the text extraction methods described above are
applied on uncompressed images. Today, most of
digital videos and static images are usually stored in

compressed forms. For example, the JPEG2000
image compression standard applies DWT coding to
decompose the original image and the DCT (Discrete
Cosine Transform) is employed in the previous JPEG
standard. Zhong et al. [4] extract captions from the
compressed videos (MPEG video and JPEG image)
based on DCT. DCT is able to detect edges in
different directions from a candidate image. Edge
regions containing texts are then obtained using a
threshold afterward. Chun et al. [5] extract text
regions in video sequences using Fast Fourier
Transform operation (FFT) and neural network.
In this paper, we proposed an efficient method that
extracts text regions in video sequences or images
using Discrete Wavelet Transform (DWT) and neural
network. First of all, DWT extracts some edge
features of the original image. Then the text regions
are obtained using a neural network trained with
those features. The proposed extraction method is
described with details in section 2. In section 3,
experiment results are displayed. The sample images
are selected from complex images and videos with
both texts and picture regions in them so as to
demonstrate the efficiency of the proposed method.
Finally, we conclude in section 4.

2. Proposed Method
In this section, we present a method to extract texts
in static images or video sequences using DWT and
neural network. DWT decomposes one original

image into four sub-bands. The transformed image
includes one average component sub-band and three
detail component sub-bands. Each detail component
sub-band contains different features information of
the real text regions. Those features are applied to the
back-propagation (BP) algorithm for training a neural
network which eventually extracts the text regions.
In a colored image, the color components may
differ in a text region. However, the information
about colors does not help extracting texts from
images. If the input image is a gray-level image, the
image is processed directly starting at the discrete
wavelet transform. If the input image is colored, the
RGB components are combined to give an intensity
image Y as follows:

Y = 0.299R + 0.587G +0.114B
(1)

Image Y is then processed with discrete wavelet
transform and the whole extraction algorithm
afterward. If the input image itself is already stored in
the DWT compressed form, the DWT operation can
be omitted in the proposed algorithm.
The flow chart of the proposed algorithm is shown
in Figure 1. We choose Haar DWT because it is the
simplest among all wavelets [6]. The working
principle of Haar DWT is discussed in the next
sub-section in details.

Haar DWT
Feature extraction
Candidate text extraction using Neural
Network
Text regions extraction results

Figure 1. Flow chart of the proposed algorithm

LL
LH
HL
HH

Figure 2. The result of 2-D DWT decomposition

2.1 Haar Discrete wavelet transform

The discrete wavelet transform is a very useful
tool for signal analysis and image processing,
especially in multi-resolution representation [7]. It
decomposes signals into different components in the
frequency domain. One-dimensional discrete wavelet
transform (1-D DWT) decomposes an input sequence
into two components (the average component and the
detail component) with a low-pass filter and a
high-pass filter [8]. Two-dimensional discrete
wavelet transform (2-D DWT) decomposes an input
image into four sub-bands, one average component
(LL) and three detail components (LH, HL, HH) as

shown in Fig 2. In these three detail components of
an image, we can obtain various edge features of the
original image.
ABC D
EFGH
IJKL
MNOP







(A+B)(C+D)(A-B)(C-D)
(E + F) (G + H ) (E - F) (G - H )
(I + J) (K + L ) (I - J) (K - L)
(M + N ) (O + P) (M - N ) (O - P )







(a) (b)

(A + B) + (E + F) (C + D) + (G + H) (A - B) + (E - F) (C - D) + (G - H)
(I + J) + (M + N) (K + L) + (O + P) (I - J) + (M- N) (K - L) + (O - P)

(A+B)-(E+F) (C+D)-(G+H) (A-B)-(E-F) (C-D)-(G-H)
(I + J) - (M + N) (K + L) - (O + P) (I - J) - (M - N) (K - L) - (O - P)







(c)

Figure 3. (a) The original image (b) the row operation
of 2-D Haar DWT (c) the column operation of 2-D
Haar DWT

Figure 4. Original gray-level image

We demonstrate the operations of 2-D Haar DWT
with an example as shown in Figure 3. Figure 3(a) is
a sample of 4×4 gray-level images. Only addition and
subtraction are involved in the computation processes.
2-D DWT is achieved by two ordered 1-D DWT
operations (row and column). First of all, we perform
the row operation to obtain the result shown in Figure
3(b). Then it is transformed by the column operation
and the final resulted 2-D Haar DWT is shown in

Figure 3(c). 2-D Haar DWT decomposes a gray-level
image into one average component sub-band and
three detail component sub-bands. From these three
detail components, we can obtain important features
of candidate text regions.
As a practical example, a gray-level original
image is shown in Figure 4. The corresponding DWT
sub-bands are shown in Figure 5. We can extract
features of candidate text regions from the detail
component sub-bands in Figure 5 In next subsection,
a neural network is employed to learn the features of
candidate text regions obtained from those detail
component sub-bands. Finally, the well trained neural
network is ready to extract the real text regions.

Figure 5. 2-D Haar discrete wavelet transform image

LH HL HH
Output node
Hidden node
Input node

Figure 6. Proposed architecture of the neural network

2.2 Neural Network
In this subsection, text extraction from static image
or video sequences is accomplished using the
back-propagation (BP) algorithm on a neural network.

The training of the neural network is based on the
features we obtain from the DWT detail component
sub-bands. As shown in Figure 6, the proposed neural
network architecture is simpler than architectures
proposed previously [9]. It consists of three input
nodes, three hidden nodes and one output node. The
features expressed in equations (2), (3) and (4) are
computed for every pixel in the detail component
sub-bands.

22
| LH(i, j) - HL(i, j) |
feature1(i, j) =
255
(2)

22
| LH(i, j) - HH(i, j) |
feature2(i, j)=
255
(3)

22
| HL(i, j) - HH(i, j) |
feature3(i, j) =
255
(4)

Figure 7. The extracted text region

The sample images chosen for experiments include
some pure text samples and some samples containing
non-text regions. Corresponding to the text
characteristics of an image, the intensity of detail
component sub-bands is quiet different from one
sub-band to another. We employ this intensity
difference to compute 3 features of candidate text
regions. Those features are used as the input of a
neural network for training based on the
back-propagation algorithm for neural networks.
After the neural network is well trained, new input
data will produce an output value between zero and
one. The output values of real text regions are pretty
different from those of the non-text regions.
Therefore, we can apply an appropriate threshold to
remove the non-text regions. Finally, the remained
real text regions are processed by some dilation
operations and shown in Figure 7.

3. Experiment Results
Experiments are performed on static images and
video sequences. The frame size is 1024×768 in BMP
or MPEG format. We convert the colored frames into
gray-level before applying the proposed method. In
Figure 8, the results of the proposed algorithm are
illustrated step by step. The original images shown in
Figure 8(a) are decomposed into one average
component sub-band and three detail component
sub-bands as shown in Figure 8(b). Those detail

component sub-bands contain the key features of text
regions. According to these features, the text regions
are obtained using a neural network. The final results
are shown in Figure 8(c).

4. Conclusion
This paper presents a method for extracting text
regions from static images or video sequences using
DWT and a neural network. DWT provides features
of text regions for the training of a neural network
using the back-propagation (BP) algorithm. We
employ the proposed method to extract the text
regions from some complicated images. Observing
the experiment results, we find the proposed scheme
is an efficient yet simple one for extracting text
regions from images or video sequences.
References

[1] Min Cai, Jiqiang Song, M. R. Lyu “A new
approach for video text detection,” IEEE
International Conference on Inage processing,
2002, Volume: 1, 22-25 September, 2002
Page(s) : I-117 -I-120 vol.1

[2] Datong Chen, Bourlard H., Thiran J. -P., “Text
Identification in Complex Background Using
SVM,” IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2001.
CVPR 2001. Proceedings of the 2001, Volume: 2,
8-14 Dec. 2001 Page(s): II-621 -II-626 vol.2

[3] Williams. P.S., Alder. M. D., ” Generic texture
analysis applied to newspaper segmentation,”
IEEE International Conference on Neural
Networks, 1996. , Volume: 3, 3-6 June 1996
Page(s): 1664 -1669 vol.3

[4] Yu Zhong, Hongjiang Zhang, Jain, A.K., ”
Automatic caption localization in compressed
video, “ IEEE Transactions on Pattern Analysis
and Machine Intelligence, Volume: 22 Issue: 4 ,
April 2000 Page(s): 385 –392

[5] Byung Tae Chun, Younglae Bae, Tai-Yun Kim,
Fuzzy ”Automatic Text Extraction in Digital
Videos using FFT and Neural Network”, IEEE
International Conference of Fuzzy systems 1999,
FUZZ-IEEE '99. Volume: 2, 22-25 Aug. 1
Page(s): 1112 -1115 vol.2, 1999

[6] K. Grochening, W. R. Madych “Multiresoultion
Analysis, Haar Bases, and Self-Similar Tilings
of R
n
“ IEEE Transaction on Information
Theory, Vol. 38, No 2, Mar. 1992.

[7] S. G. Mallat, “A theory for Multiresolution
Signal Decomposition: The Wavelet

Representation, “IEEE Trans. On PAMI, Vol.
11, No. 7, July 1989, pp.674-693.

[8] Tinku Acharya, Po-Yueh Chen, “VLSI
Implementation of a DWT Architecture
“ISCAS ’98. Proceedings of the 1998 IEEE
International Symposium on Circuits and
Systems, Volume: 2, Page(s): 272-275 vol.2,
1998

[9] Keechul Jung, “Neural Network-based text
location in clolr images,” Pattern Recognition
Letters,

Volume: 22, Issue: 14, December,
2001, pp. 1503-1515

(a) The original images
(b) 2-D DWT sub-bands
(c) The extracted text regions
Figure 8. Two samples of the experiment results

Automatic text extraction using DWT and Neural Network

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về