Tải bản đầy đủ (.pdf) (60 trang)

Đồ án tốt nghiệp Công Nghệ Thông Tin Trí tuệ nhân tạo 9 điểm

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.8 MB, 60 trang )

THIS THESIS IS APPROVED BY:
Instructor: Le Thi My Hanh, Ph.D

Date

Suggestions/Comments:
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………


…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………


SUMMARY

Topic title: MathOCR: Solving math problems using machine learning
Student name: Bui Dang Quang Dung
Student ID: 103160193

Class: 16TCLC3

Mathematics is one of the most important fields of the people, is studied, developed, and
applied a lot in real life. Mathematics helps solve many problems in life. All of us have
been looking at math for a long time. Many people love this subject, but many people
have difficultly to solve math problems. Nowadays, with the vigorous development of
science and technology, especially Artificial Intelligence (AI), AI has many outstanding
achievements, which can solve many human's works. This thesis introduces the
application that helps to solve math problems by applying some machine learning
algorithms.


DA NANG UNIVERSITY

THE SOCIALIST REPUBLIC OF VIETNAM


UNIVERSITY OF SCIENCE AND
TECHNOLOGY
FACULTY OF INFORMATION
TECHNOLOGY

Independence - Freedom - Happiness

GRADUATION PROJECT REQUIREMENTS

Student Name: BUI DANG QUANG DUNG Student ID: 103160193
Class: 16TCLC3 Faculty: Information Technology Major: Information Technology
1. Topic title: MathOCR: Solving math problems using machine learning
2. Project topic: ☐ has signed intellectual property agreement for the final result
3. Initial figure and data:
Data is collected from many resources.
4. Content of the explanations and calculations:
The content contains five parts:
 Machine Learning, Computer Vision, Natural Language Processing and their
applications;
 Models, details in MathOCR
 Introduce MathOCR Application
 MathOCR Experiments
 Data pre-processing process:
 Generating data process of models in MathOCR
 Experiment results of models
 Conclusion.
5. Drawings, charts
6. Instructor name: Le Thi My Hanh PhD, Information Technology Faculty, University
of Danang - University of Science and Technology

7. Date of assignment :
8. Date of completion :

/ /2020.
/ /2020.
Da Nang, December

Head of Division………………….

Instructor

2020


MathOCR: Soling math problems using machine learning

PREFACE

During the project, I would like to express my sincere thanks to Le Thi My Hanh,
PhD. Thank you for giving me a lot of ideas, solutions, and knowledge to complete this
project.
And I would also like to really appreciate the teachers, and students of the
Faculty of Information Technology – University of Danang - University of
Science and Technology for helping me in the past four years of study, passing me on
the necessary knowledge and valuable experience for me to be able to do this project.
And finally, I would also like to express my special thanks to my family who
supported, gave me motivation and help, both financially and spiritually, for this project.
Although I tried my best to do this project, it is impossible to avoid mistakes or
incompletes. I hope that I can receive valuable comments and recommendations from
the teachers to complete my thesis.


Da Nang, December 11th 2020
Students

Bui Dang Quang Dung

iv


MathOCR: Soling math problems using machine learning

ASSURANCE

I understand the University’s policy about anti-plagiarism and guarantee that:
1. The contents of this thesis project are performed by myself following the
guidance of Le Thi My Hanh, PhD.
2. All the references, which I used in this thesis, are quoted with the author’s name,
project’s name, time, and location to publish clearly and faithfully.
3. This project's contents are my work and have not been copied from other sources
or been previously submitted for award or assessment.

Student Performed

Bui Dang Quang Dung

v


MathOCR: Soling math problems using machine learning


TABLE OF CONTENT

SUMMARY ................................................................................................................ii
GRADUATION PROJECT REQUIREMENTS .................................................... iii
PREFACE ................................................................................................................. iv
ASSURANCE ............................................................................................................. v
LIST OF PICTURE .................................................................................................. ix
LIST OF TABLE ...................................................................................................... xi
LIST OF ACRONYM ............................................................................................. xii
INTRODUCTION ..................................................................................................... 1
Reason for doing thesis............................................................................................. 1
Scope and Objective ................................................................................................. 1
Overview .................................................................................................................. 1
CHAPTER 1: MACHINE LEARNING, COMPUTER VISION, NATURAL
LANGUAGE AND THEIR APPLICATION ........................................................... 3
1.1. Introduction ...................................................................................................... 3
1.2. Machine Learning ............................................................................................. 3
1.2.1. What is Machine Learning .......................................................................... 3
1.2.2. Supervised Learning ................................................................................... 3
1.2.3. Unsupervised Learning ............................................................................... 4
1.2.4. Reinforcement Learning ............................................................................. 5
1.3. Computer Vision.............................................................................................. 6
1.3.1 What is Computer Vision ............................................................................. 6
1.3.2. Computer Vision tasks ................................................................................ 6
1.3.3. Applications of Computer Vision ............................................................... 7
1.4. Natural Language Processing ............................................................................ 8
1.4.1. What is Natual Language Processing .......................................................... 8
1.4.2. Natural Language Processing tasks and their application. ........................... 8
CHAPTER 2: MODELS, DETAILS IN MATHOCR ............................................ 10
2.1. Introduction ..................................................................................................... 10

2.2. The Vietnamese recognition model with the Transformer ............................... 10
2.2.1. Introduction .............................................................................................. 10
2.2.2. Backbone.................................................................................................. 11
vi


MathOCR: Soling math problems using machine learning

2.2.3. Encoder .................................................................................................... 12
2.2.4. Decoder .................................................................................................... 12
2.2.5. Multi-Head Attention ............................................................................... 12
2.2.6. Position-wise Feed-Forward Networks ..................................................... 14
2.2.7. Positional Encoding .................................................................................. 14
2.3. The Image to Latex model .............................................................................. 15
2.3.1. Introduction .............................................................................................. 15
2.3.2. Model Architecture ................................................................................... 15
2.3.3. Encoder .................................................................................................... 15
2.3.3.1. Convolution ........................................................................................ 15
2.3.3.2. Positional Encoding ........................................................................... 16
2.3.4. Decoder .................................................................................................... 16
2.3.4.1. Token Embedding ............................................................................. 16
2.3.4.2. LSTM network .................................................................................. 17
2.4. The YOLOv4 Model ..................................................................................... 18
2.4.1. Introduction .............................................................................................. 18
2.4.2. YOLOv4 Architecture .............................................................................. 18
2.4.2.1. Backbone........................................................................................... 19
2.4.2.2. Neck .................................................................................................. 20
2.4.2.3. Head .................................................................................................. 21
2.4.2.4. Bag Of Freebies................................................................................. 22
2.4.2.5 Bag Of Specials .................................................................................. 22

2.5. Metric Evaluation ........................................................................................... 22
2.5.1. BLEU ....................................................................................................... 22
2.5.2. mAP ......................................................................................................... 23
2.4.2.1 Precision and Recall ............................................................................ 24
2.5.2.2. IoU .................................................................................................... 24
2.5.2.3. AP ..................................................................................................... 25
CHAPTER 3. INTRODUCE MATHOCR APPLICATION ................................. 29
3.1. Introduction .................................................................................................... 29
3.2. Front-end ........................................................................................................ 29
3.3. Server ............................................................................................................. 30
3.4. Features Specification ...................................................................................... 30
3.3.1. Document Scanner .................................................................................... 31
vii


MathOCR: Soling math problems using machine learning

3.3.2. Math Formula Recognition ........................................................................ 32
3.3.3. Vietnamese Text Recognition .................................................................... 33
3.3.4. Solving Math equations ............................................................................. 34
CHAPTER 4. MATHOCR EXPERIMENTS......................................................... 36
4.1. Introduction .................................................................................................... 36
4.2. Data Preprocessing Process ............................................................................ 36
4.3. VietnameseOCR Model Experiments ............................................................. 38
4.3.1. Data Source .............................................................................................. 38
4.3.2. Training parameters .................................................................................. 41
4.3.3. Experimental Results ................................................................................ 41
4.4. Im2LaTex Model ............................................................................................ 42
4.4.1. Data Source .............................................................................................. 42
4.4.2. Training parameters .................................................................................. 43

4.4.3. Experimental Results ................................................................................ 43
4.5. YOLOv4 Model .............................................................................................. 44
4.5.1. Data Source ............................................................................................... 44
4.5.2. Training parameters .................................................................................. 44
4.5.3. Experiment result ...................................................................................... 45
CHAPTER 5. CONCLUSION ................................................................................ 46
5.1. Archived results: .............................................................................................. 47
5.2. Limitations: ..................................................................................................... 47
5.3 Development: ................................................................................................... 47
REFERENCES ........................................................................................................ 47

viii


MathOCR: Soling math problems using machine learning

LIST OF PICTURE

Figure 1.1. Supervised Learning .................................................................................. 4
Figure 1.2. Unsupervised Learning .............................................................................. 4
Figure 1.3. Reinforcement Learning ............................................................................ 6
Figure 1.4. Subfiles of Computer Vision ...................................................................... 7
Figure 2.1. The Transformer Architecture .................................................................. 12
Figure 2.2. (left) Scaled Dot-Product Attention, (right) Multi-Head Attention consists of
several attention layers running in parallel. ................................................................ 13
Figure 2.3. Object detector ......................................................................................... 19
Figure 2.4. DenseNet CSP ......................................................................................... 19
Figure 2.5. Modified PAN ......................................................................................... 21
Figure 2.6. Modified SAM ......................................................................................... 21
Figure 2.7. IoU .......................................................................................................... 24

Figure 2.8. The result calculated Precision and Recall ............................................... 26
Figure 2.9. The results after being smoothed .............................................................. 26
Figure 2.10. Calculate max of precision at each level ................................................. 27
Figure 2.11. Normalize by VOC format ..................................................................... 27
Figure 3.1. The React Native Framework................................................................... 29
Figure 3.2. Python, Flask and Pytorch ........................................................................ 30
Figure 3.3. Document Scanner Screen ....................................................................... 31
Figure 3.4. Math Formula Recognition Screen .......................................................... 32
Figure 3.5. Vietnamese Text Recognition Screen ...................................................... 33
Figure 3.6. Solving Math Equations Screen ............................................................... 34
Figure 4.1. Data preprocessing process ..................................................................... 36
Figure 4.2. Data normalization .................................................................................. 38
Figure 4.3. Request and Beautiful soup libraries ........................................................ 39
Figure 4.4. Process of generating data from existing text. .......................................... 39
Figure 4.5. Process of generating data from PDF files ............................................... 40
Figure 4.6. Results of Vietnamese text recognition ................................................... 41
Figure 4.7. Good prediction results ........................................................................... 43
Figure 4.7. Unexpected prediction results ................................................................. 43
ix


MathOCR: Soling math problems using machine learning

Figure 4.8. The statistics bounding box of objects ..................................................... 44
Figure 4.9. Experiment result ..................................................................................... 45
Figure 4.10. Good prediction results ......................................................................... 45
Figure 4.11. The predicted results are average .......................................................... 45
Figure 4.12. Unexpected prediction results ............................................................... 46

x



MathOCR: Soling math problems using machine learning

LIST OF TABLE

Table 2.1. The VGG19 model Architecture................................................................ 11
Table 2.2. The CNN model configurations ................................................................. 15
Table 2.3. DarkNet-53 Architecture ........................................................................... 20
Table 2.4. The prediction apple result ........................................................................ 25

xi


MathOCR: Soling math problems using machine learning

LIST OF ACRONYM

AI

Artificial Intelligence

ML

Machine Learning

NLP

Natural Language Processing


CNN

Convolution Neural Network

LSTM

Long Short-term Memory

CSP

Cross-Stage-Partial

PAN

Path Aggregation Network

SAM

Spatial Attention Module

BoF

Bag of Freebies

BoS

Bag of Specials

BLEU


Bilingual Evaluation Understudy

AP

Average Precision

OCR

Optical Character Recognition

Map

mean Average Precision

IoU

Intersection over Union

PDF

Portable Document Format

xii


MathOCR: Soling math problems using machine learning

INTRODUCTION

Reason for doing this thesis

Mathematics is one of the essential fields, the foundation of all other science
fields, which is studied, developed, and applied a lot in real life, helping solve many life
problems. All of us have been taught math for a long time. Learning math helps us know
to solve the issues and develop other skills, such as giving us analytical, reasoning, and
problem-solving skills to help us become smarter and think more quickly. So math has
become a favourite subject for many people. But along with that, many people have lost
inspiration when studying this subject. It seems that they will feel shy and unable to
receive useful knowledge from this subject. One of the main reasons is that it is not easy
to solve difficult math problems, so depression is manageable.
With the vigorous development of science and technology, especially in recent
times, Artificial Intelligence has made many outstanding achievements, helping people
save time and work more productively. The application of AI achievements in solving
human problems is significant.
For the above reasons, in this thesis, I will introduce an application that helps
solve some math problems. It will serve as a source of reference for you who are
disoriented in their approach to mathematics. Simultaneously, it also provides other
functions in recognizing Vietnamese script, making it easy for us to save content and be
able to embed in different file formats.
Scope and Objective
 This thesis will apply OCR, object detection technologies
 The main objective of Vietnamese text and math formula
 The solvers will be math equations
Overview
This thesis can be divided into four parts:
 Machine Learning, Computer Vision, Natural Language Processing and their
applications;
 Describe machine learning models in the thesis
 Introduce MathOCR Application
Student: Bui Dang Quang Dung


Instructor: Le Thi My Hanh Ph.D

1


MathOCR: Soling math problems using machine learning

 Experiments Results of ML models
 Conclusion.

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

2


MathOCR: Soling math problems using machine learning

CHAPTER 1: MACHINE LEARNING, COMPUTER VISION,
NATURAL LANGUAGE AND THEIR APPLICATION

1.1. Introduction
This thesis's primary focus is on the OCR, which is an application of Machine
Learning, Computer Vision, and Natural Language Processing. This chapter will give a
general view of these fields of research and their applications
1.2. Machine Learning
1.2.1. What is Machine Learning
In recent years, there are many achievements in Machine Learning. Machine
Learning has become one of the most critical industries and contributes to human

developments in the Fourth Industrial Revolution.
Machine learning (ML) is the study of computer algorithms that improve
automatically through experience. It is seen as a subset of artificial intelligence. Machine
learning algorithms build a model based on sample data, known as "training data", to
make predictions or decisions without being explicitly programmed to do so [1].
There are three types of Machine Learning techniques:
 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
1.2.2. Supervised Learning
Most practical Machine Learning uses supervised learning. It is defined by using
labelled datasets to train algorithms that classify data or predict outcomes accurately. As
input data is fed into the model, it adjusts its weights through a reinforcement learning
process, ensuring that the model has been fitted appropriately.
Supervised learning can be separated into two types of problems when data miningclassification and regression [2]:
 Classification uses an algorithm to assign test data into specific categories
accurately. It recognizes particular entities within the dataset and attempts to
conclude how those entities should be labelled or defined. Classification
algorithms are commonly linear classifiers, support vector machines (SVM),
decision trees, k-nearest neighbour, and random forest...
Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

3


MathOCR: Soling math problems using machine learning

 Regression is used to understand the relationship between dependent and

independent variables. It is commonly used to make projections, such as for
sales revenue for a given business. Linear regression, logistical regression,
and polynomial regression are popular regression algorithms.

Figure 1.1. Supervised Learning
1.2.3. Unsupervised Learning
Unsupervised learning (unsupervised machine learning) uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden
patterns or data groupings without the need for human intervention.

Figure 1.2. Unsupervised Learning
Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

4


MathOCR: Soling math problems using machine learning

Unsupervised learning models are utilized for three main tasks-clustering,
association, and dimensionality reduction [3]:
 Clustering is a data mining technique that groups unlabeled data based on their
similarities or differences. Clustering algorithms are used to process raw,
unclassified data objects into groups represented by structures or patterns in the
information. Clustering algorithms can be categorized into a few types,
specifically exclusive, overlapping, hierarchical, and probabilistic.


An association rule is a rule-based method for finding relationships between

variables in each dataset. These methods are frequently used for market basket
analysis, allowing companies to understand relationships between various
products better.



Dimensionality reduction: While more data yields more accurate results, it can
also impact the performance of machine learning algorithms (e.g., overfitting),
and it can also make it challenging to visualize datasets. Dimensionality
reduction is a technique used when the number of features, or dimensions, in each
dataset is too high. It reduces the number of data inputs to a manageable size
while also preserving the dataset's integrity as much as possible. It is commonly
used in the preprocessing data stage,

1.2.4. Reinforcement Learning
Reinforcement learning is an area of Machine Learning. It is about taking suitable
action to maximize reward in a particular situation. It is employed by various software
and machines to find the best possible behaviour or path it should take in a specific
situation.
Reinforcement learning differs from supervised learning in a way that in
supervised learning, the training data has the answer key with it, so the model is trained
with the correct answer itself. In contrast, there is no answer in reinforcement learning,
but the reinforcement agent decides what to do to perform the given task. In the absence
of a training dataset, it is bound to learn from its experience.
There are two types of Reinforcement:
 Positive reinforcement means giving something to the subject when they perform
the desired action, so they associate the action with the reward and do it more
often. The reward is a reinforcing stimulus.

Student: Bui Dang Quang Dung


Instructor: Le Thi My Hanh Ph.D

5


MathOCR: Soling math problems using machine learning


Negative Reinforcement is defined as the strengthening of behavior because a
negative condition is stopped or avoided.

Figure 1.3. Reinforcement Learning
1.3. Computer Vision
1.3.1 What is Computer Vision
Computer vision is the field of study surrounding how computers see and
understand digital images and videos. Computer vision spans all tasks performed by
biological vision systems, including "seeing" or sensing a visual stimulus, understanding
what is being seen, and extracting complex information into a form that can be used in
other processes. This interdisciplinary field simulates and automates these elements of
human vision systems using sensors, computers, and machine learning algorithms.
Computer vision is the theory underlying artificial intelligence systems' ability to see
and understand their surrounding environment.
1.3.2. Computer Vision tasks
Computer Vision is used and researched widely throughout life. It has many tasks,
such as:
 Object Detection is the ability to detect or identify objects in any given image
correctly along with their spatial position in the given image, in the form of
rectangular boxes (known as Bounding Boxes) which bound the object within it.
An example is shown below, which detects objects such as laptops, glasses,

notebooks, coffee, and iPhone in their Bounding Boxes.

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

6


MathOCR: Soling math problems using machine learning

 Another computer vision task which is popular is Image Classification. Image
Classification means identifying what class the object belongs to. For example,
in the image shown below, there are objects present belonging to various classes
such as trees, huts, giraffe, etc.
 Image Captioning is looking at an image and describing what is happening in the
image. The image given below contains annotations or labels which describe
what is happening in the picture, which should give you a good idea about what
Image Captioning does.
 Image Segmentation: Identifying parts of the image and understanding what
object they belong to. Segmentation lays the basis for performing object detection
and classification.

Figure 1.4. Subfiles of Computer Vision
1.3.3. Applications of Computer Vision
There are many examples of computer vision applied because its theory spans any
area where a computer will see its surroundings in some form. Below are examples of
computer vision [4]:
 Autonomous Vehicles: Self-driving cars need to gather information about
their surroundings to decide how to behave.

 Facial Recognition: Businesses and personal electronics use facial
recognition technology to“see” who is trying to access something. It has
become a powerful security tool.

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

7


MathOCR: Soling math problems using machine learning

 Image Search and Object Recognition: Many applications use data vision
theory to identify objects within images, search through catalogues of images,
and extract information out of images.
 Robotics: Most robotic machines, often in manufacturing, need to see their
surroundings to perform the task at hand. In manufacturing, machines may be
used to inspect assembly tolerances by “looking at” them.
1.4. Natural Language Processing
1.4.1. What is Natual Language Processing
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret, and manipulate human language. NLP draws from
many disciplines, including computer science and computational linguistics, in its
pursuit to fill the gap between human communication and computer understanding.
1.4.2. Natural Language Processing tasks and their application.
There are many tasks in NLP has been developed and researched:
 Text Classification Tasks
 Representation: bag of words (does not preserve word order)
 Goal: predict tags, categories, sentiment

 Application: filtering spam emails, classifying documents based on dominant
content
 Word Sequence Tasks
 Representation: sequences (preserves word order)
 Goal: language modeling - predict next/previous word(s), text generation
 Application: translation, chatbots, sequence tagging (predict POS tags for
each word in sequence), named entity recognition
 Text Meaning Tasks
 Representation: word vectors, the mapping of words to vectors (ndimensional numeric vectors) aka embeddings
 Goal: how do we represent meaning?
 Application: finding similar words (similar vectors), sentence embeddings (as
opposed to word embeddings), topic modeling, search, question answering

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

8


MathOCR: Soling math problems using machine learning

Figure 1.5. Named entity recognition (NER)
 Sequence to Sequence Tasks
 Many tasks in NLP can be framed as such
 Examples are machine translation, summarization, simplification, Q&A
systems
 Such systems are characterized by encoders and decoders, which work in
complement to find a hidden representation of text and to use that hidden
representation

 Dialog Systems
 Two main categories of dialog systems, categorized by their scope of use
 Goal-oriented dialog systems focus on being useful in a particular, restricted
domain; more precision, less generalizable
 Conversational dialog systems are concerned with being helpful or
entertaining in a much more general context; less precision, more
generalization

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D

9


MathOCR: Soling math problems using machine learning

CHAPTER 2: MODELS, DETAILS IN MATHOCR

2.1. Introduction
In this section, I will introduce in detail the machine learning models used in the
project included:


The Vietnamese text recognition model: VietnameseOCR



The mathematical formula recognition model: Im2LateX




The Vietnamese Text and math formula detection model: YoLoV4

Mainly focused on the architectural description of the models, the arithmetic
operations, and the two metric values used to evaluate the model's quality are BLEU
and mAP.
2.2. The Vietnamese recognition model with the Transformer
2.2.1. Introduction
The Transformer is a term introduced in the paper "Attention Is All You Need"
[5]. This paper describes the Transformer and what is called a sequence-to-sequence
(Seq2Seq) architecture. Seq2Seq is a neural network to transforms given sequence
elements, such as the sequence of words in a sentence, into another sequence.
Another new tech help to improve the accuracy of Seq2Seq models is Attention.
The attention mechanism looks at an input sequence and decides at each step which
other parts of the sequence are essential. It sounds abstract, but this mechanism is also
related to us, or in words, we also have our mechanism. For example, our eyes have
vision 120 degrees both vertically and horizontally. However, we are only "attention" a
small part of the image to extract information. This mechanism helps us not need a lot
of energy to make decisions but still provides reliable results.
The Transformer is an architecture for transforming one sequence into another
sequence help of two parts (Encoder and Decoder). The Encoder part is used to learn
the vector present of a sentence hoping this vector carries complete sentence
information. The Decoder part performs the function of converting the vector into the
output sequence.

Student: Bui Dang Quang Dung

Instructor: Le Thi My Hanh Ph.D


10


MathOCR: Soling math problems using machine learning

2.2.2. Backbone
The main object of the Vietnamese word recognition model is a picture
containing Vietnamese text. The model needs to extract this image's characteristics and
convert the image's information into a vector. Many pre-trained models are trained from
the extremely famous ImageNet dataset. These include Resnet, VGG, Xception ...
For OCR tasks, models with vanilla architecture are beneficial and applied much
in practice. Among them can be mentioned models VGG family (VGG16, VGG19 ...)
[6]. In the Vietnamese word recognition model, the backbone will be the VGG19 model.
The model also needs to be customized with parameters to suit the task. The architecture
VGG19 finetune shows below:
Table 2.1. The VGG19 model Architecture
Type

Filters

Input

k

s

p

RGB image


Convolution

64

(3, 3)

(1, 1)

(1, 1)

Convolution

64

(3, 3)

(1, 1)

(1, 1)

(2, 2)

(2, 2)

(0, 0)

MaxPooling
Convolution

128


(3, 3)

(1, 1)

(1, 1)

Convolution

128

(3, 3)

(1, 1)

(1, 1)

(2, 2)

(2, 2)

(0, 0)

Max-Poling
Convolution

256

(3, 3)


(1, 1)

(1, 1)

Convolution

256

(3, 3)

(1, 1)

(1, 1)

(2, 1)

(2, 1)

(0, 0)

Max-Poling
Convolution

512

(3, 3)

(1, 1)

(1, 1)


Convolution

512

(3, 3)

(1, 1)

(1, 1)

Convolution

512

(3, 3)

(1, 1)

(1, 1)

Convolution

512

(3, 3)

(1, 1)

(1, 1)


(2, 1)

(2, 1)

(0, 0)

Max-Poling
Convolution

512

(3, 3)

(1, 1)

(1, 1)

Convolution

512

(3, 3)

(1, 1)

(1, 1)

Convolution


512

(3, 3)

(1, 1)

(1, 1)

Convolution

512

(3, 3)

(1, 1)

(1, 1)

(1, 1)

(1, 1)

(0, 0)

(1, 1)

(1, 1)

(1, 1)


Max-Poling
Convolution
Student: Bui Dang Quang Dung

256

Instructor: Le Thi My Hanh Ph.D

11


MathOCR: Soling math problems using machine learning

2.2.3. Encoder

Figure 2.1. The Transformer Architecture [5]
The Encoder consists of N layers. Each layer consists of two sub-layers, MultiHead Attention, and a Feedforward network. At each sub-layer, the residual block is
built into the normalization layer. The output of each section will be 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚 (𝑥 +
𝑆𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥)), where 𝑆𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥) is a function built by that sub-layer itself. The
output will be a vector of size 512.
2.2.4. Decoder
In general, the Decoder has a similar architecture to the Encoder, which has N
layers. In addition to the two sub-layer like the Encoder, a subclass is added, which
performs Multi-Head Attention over the encoder stack's output. The Self-attention sublayer in the decoder stack to prevent positions from attending to subsequent positions.
This masking, combined with the fact that the output embeddings are offset by one
position, ensures that the predictions for position 𝑖 can depend only on the known
outputs at positions less than 𝑖.
2.2.5. Multi-Head Attention
Student: Bui Dang Quang Dung


Instructor: Le Thi My Hanh Ph.D

12


MathOCR: Soling math problems using machine learning

The Multi-Head Attention is self-attention, but for the model to pay attention to
many different patterns, the author uses many self-attention. Each of these sub-attention
layers is introduced by the author under the name Scaled Dot-Product Attention.
Each word will be mapped into the embedding space to form a vector, then create
three vectors Q (query), K (key), and V (value) by multiplying by the matrix
corresponding to the word's vector.
Where:
 Query vector: vector used to contain information of the word to be searched,
compared.
 Key vector: vector used to represent information about the words to be compared
with the above search term.
 Value vector: vector representing the content and meaning of words.

Figure 2.2. (left) Scaled Dot-Product Attention, (right) Multi-Head Attention consists
of several attention layers running in parallel. [5]
Assume, The input consists of queries and keys of dimension 𝑑𝑘 , and values of
dimension 𝑑𝑣 . Multiplying key matrix with a query matrix is calculated above to
compare the query and the key to learning the correlation, divide each by √𝑑𝑘 . Then
normalize to the segment [0-1] by using the softmax function. The result is more near
one means the query is the more same as the key, and opposite. Finally, multiply it with
value matrix.

Student: Bui Dang Quang Dung


Instructor: Le Thi My Hanh Ph.D

13


×