VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
DEPT. OF SOFTWARE ENGINEERING
Do Thi Thanh Nha - Le Thanh Luan
FINAL THESIS
Developing
Vietnamese-Sign-Language To Text
Translation System
SOFTWARE ENGINEERING MAJOR
HO CHI MINH CITY, JULY 2023
VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
DEPT. OF SOFTWARE ENGINEERING
Do Thi Thanh Nha - Le Thanh Luan
FINAL THESIS
Developing
Vietnamese-Sign-Language To Text
Translation System
SOFTWARE ENGINEERING MAJOR
Instructor: Dr. Nguyen Trinh Dong
HO CHI MINH CITY, JULY 2023
GRADUATION & THESIS EVALUATION
COMMITTEE INFORMATION
The Graduation & Thesis Evaluation Committee was established according to
Decision No. . . . . . . . . . dated . . . . . . . . . by the Rector of the University of
Information Technology.
...............
...............
...............
...............
-
Chairman
Secretary
Member
Member
ACKNOWLEDGEMENT
We would like to express our deepest gratitude and appreciation to all those
who have supported us throughout the journey of completing this thesis.
First and foremost, we are profoundly grateful to the esteemed members of the
thesis committee, particularly those from the Faculty of Software Engineering at
the University of Information Technology - VNUHCM. Their valuable insights,
constructive criticism, and scholarly contributions have significantly enhanced the
academic rigor of this thesis. We are indebted to their expertise and rigorous
evaluation.
We would also like to express our heartfelt appreciation to our family and
friends for their unconditional love, encouragement, and belief in our abilities.
Their unwavering support, understanding, and patience have been the driving
force behind our academic journey.
Furthermore, we would like to express our deep appreciation to Professor Nguyen
Trinh Dong for his invaluable guidance, expertise, and mentorship throughout the
process this thesis’ journey. His unwavering support, constructive feedback, and
scholarly insights have played a crucial role in shaping the academic rigor and
overall quality of this work.
To everyone who has contributed to this thesis in various ways, whether directly or indirectly, we extend our heartfelt appreciation. Your support has been
invaluable in the successful completion of this academic endeavor.
Thank you very much, we wish you all the best.
Ho Chi Minh City, July 2023
Students
Le Thanh Luan
Do Thi Thanh Nha
THE OBSERVATIONS
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
Members & Participation percentage
No. Full name
1
2
Le Thanh Luan
Do Thi Thanh Nha
Student ID Contribution percentage
19520702
18529116
50%
50%
2 / 110
Contents
1 Introduction
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
10
11
2 Foundational knowledge
2.1 Vietnamese Sign Language (VSL) . . . . . . . . . . . . . . . . . .
2.2 Survey of Existing Sign Language Translation Technologies . . . .
2.2.1 Sign language recognition using sensor gloves . . . . . . . .
2.2.2 A Cost Effective Design and Implementation of Arduino
Based Sign Language Interpreter . . . . . . . . . . . . . .
2.2.3 Neural Sign Language Translation . . . . . . . . . . . . . .
2.2.4 Deep Learning for Vietnamese sign language recognition in
video sequence . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Machine Learning Techniques and Algorithms . . . . . . . . . . .
2.3.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Model Architectures for Sign Language Recognition . . . .
2.3.4 Deep Neural Network (DNN) . . . . . . . . . . . . . . . .
2.3.5 Convolutional Neural Networks (CNN) . . . . . . . . . . .
2.3.6 Recurrent Neural Network (RNN) . . . . . . . . . . . . . .
2.3.7 Long Short-Term Memory (LSTM) . . . . . . . . . . . . .
2.4 Software Background . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Javascript . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.4 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.5 scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.6 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . .
2.4.7 Mediapipe . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.8 Expo-React Native . . . . . . . . . . . . . . . . . . . . . .
13
13
16
16
18
18
19
21
21
22
24
26
29
36
38
40
43
43
44
45
46
47
48
51
3 / 110
2.4.9
2.4.10
2.4.11
2.4.12
SQLServer . . . . . .
Strapi . . . . . . . .
Websocket knowledge
Visual Studio Code .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
54
55
58
3 Data Collection and Preprocessing
3.1 Reason for Dataset Creation: Scarcity of Existing Datasets and
Unresponsive Researchers . . . . . . . . . . . . . . . . . . . . . .
3.2 Gathering and Preparing the Dataset for VSL Recognition Model
3.3 Data Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . .
3.4 Matrix Formation - Input for Machine Learning Model . . . . . .
3.5 File Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4 Model Training
4.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
72
74
5 Mobile Application Development
5.1 Designing the Mobile Application . . . . . . . . . . . . . . . . .
5.1.1 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Database Diagram . . . . . . . . . . . . . . . . . . . . .
5.1.3 The streaming system for processing videos . . . . . . .
5.2 Implementation and Results . . . . . . . . . . . . . . . . . . . .
5.2.1 Integrating the Model into the Mobile App with Python
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 User Testing and Feedback: Improving the User Experience . . .
.
.
.
.
.
.
.
.
6 Feedback and Discussion
6.1 User Testing Results and Feedback on the Mobile Application . .
6.2 Comparison of our Mobile Application with Existing Sign Language
E-learning Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 ASL Bloom . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Lingvano . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Other Assistive Technologies . . . . . . . . . . . . . . . . . . . . .
60
63
64
66
67
78
78
78
86
86
95
95
97
100
102
102
103
103
104
105
7 Conclusion and Future Work
106
7.1 Summary of Contributions and Accomplishments . . . . . . . . . 106
7.2 Future Directions for the Project: Expanding the dictionary . . . 107
4 / 110
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
2.26
VSL alphabet (Source: Circular 17/2020/TT-BGDĐT) . . . . . .
VSL shares similarities with global sign language dictionaries . . .
An example of sensor glove used for detecting movement sequences.
Source: Cornell University ECE 4760 Sign language glove prototype
LSTM architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
Structure of a simple neural network with an input layer, an output
layer, and two hidden layers. . . . . . . . . . . . . . . . . . . . . .
Starting the convolutional operation. . . . . . . . . . . . . . . . .
Step two in the convolutional operation . . . . . . . . . . . . . . .
Finish the convolution operation when the kernel goes through the
5*5 matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of the convolutional matrix . . . . . . . . . . . . . . . .
X matrix when adding outer padding of zeros. . . . . . . . . . . .
Convolutional operation when stride=1, padding=1 . . . . . . . .
Convolutional operation stride=2, padding=1 . . . . . . . . . . .
Illustration of convolutional operation on a color image with k=3
Tensor X, W 3 dimensions are written as 3 matrices. . . . . . . .
Max pooling layer with size=(3,3), stride=1, padding=0 . . . . .
Example of pooling layer . . . . . . . . . . . . . . . . . . . . . . .
The structure of a recurrent neural network . . . . . . . . . . . .
The flowchart of RNN-T algorithm. Used in Reliable Multi-Object
Tracking Model Using Deep Learning and Energy Efficient Wireless
Multimedia Sensor Networks[1] . . . . . . . . . . . . . . . . . . .
Python Language Syntax . . . . . . . . . . . . . . . . . . . . . . .
Javascript Language Syntax . . . . . . . . . . . . . . . . . . . . .
Sklearn Metrics - Confusion matrix . . . . . . . . . . . . . . . . .
Jupyter Notebook IDE . . . . . . . . . . . . . . . . . . . . . . . .
Mediapipe Hand Landmarks . . . . . . . . . . . . . . . . . . . . .
Mediapipe Holistic Landmarks . . . . . . . . . . . . . . . . . . . .
Mediapipe Face Mesh Landmarks . . . . . . . . . . . . . . . . . .
Expo can run cross-platform . . . . . . . . . . . . . . . . . . . . .
14
15
17
26
27
30
31
31
32
32
33
33
34
35
36
36
37
38
43
44
46
47
49
50
51
52
5 / 110
2.27 Strapi.io . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.28 Visual Studio Code Editor . . . . . . . . . . . . . . . . . . . . . .
3.1
54
59
3.3
3.4
3.5
3.6
3.7
1st dataset of Deep Learning for Vietnamese Sign Language Recognition in Video Sequence. Source: [11] . . . . . . . . . . . . . . . .
Some actual footage of the dataset used in Deep Learning for Vietnamese Sign Language Recognition in Video Sequence. Source: [11]
Letter D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Letter L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Letter V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Letter Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Folder tree of the sign ’a’ . . . . . . . . . . . . . . . . . . . . . . .
62
66
66
66
66
68
4.1
Mediapipe Face Mesh Landmarks . . . . . . . . . . . . . . . . . .
75
3.2
61
5.1
5.2
The database diagram designed for SignItOut . . . . . . . . . . . 86
The diagram illustrates the relationships among the elements within
the API layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 The diagram illustrates the relationships among the elements within
the Engine layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Run Strapi with command prompt . . . . . . . . . . . . . . . . . 97
5.5 Strapi UI run on localhost . . . . . . . . . . . . . . . . . . . . . . 98
5.6 Login Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.7 Home screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Course details Screen . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.9 Plain Text Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10 Quiz Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.11 Video Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1
6.2
The logo of ASL Bloom, one of the most popular sign language
E-Learning applications with more than 100.000 users . . . . . . . 103
Lingvano, the ASL learning application which uses artificial intelligence for giving feedback about users’ signing accuracy. . . . . . 105
6 / 110
List of Tables
3.1
Description of VSL Dataset with Alphabet signs . . . . . . . . . .
64
4.1
4.2
4.3
Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . .
Confusion Matrix Results . . . . . . . . . . . . . . . . . . . . . .
71
71
76
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Use
Use
Use
Use
Use
Use
Use
Use
Use
Use
79
80
80
81
82
82
83
84
84
85
Case
Case
Case
Case
Case
Case
Case
Case
Case
Case
Table . . . . . . . . . . . . . . . . . . . .
UC001: User Login . . . . . . . . . . . . .
UC002: Browse Courses . . . . . . . . . .
UC003: Enroll in Course . . . . . . . . . .
UC004: View Course Details . . . . . . .
UC005: Learn course . . . . . . . . . . . .
UC006: Take Quiz . . . . . . . . . . . . .
UC007: Track Lesson Progress . . . . . .
UC008: User Logout . . . . . . . . . . . .
UC009: Use continuous signing detection .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 / 110
Chapter 1
Introduction
This chapter provides a comprehensive introduction to the problem addressed
in this thesis, presenting its significance and potential implications. Through a
thorough review of existing literature, the chapter identifies research gaps and
establishes a clear research objective. The chosen methodology, including research
design, data collection procedures, and analysis techniques, is outlined to provide
a logical framework for the study. Additionally, an overview of the research results
is presented, summarizing the data collected, analysis conducted, and key findings
derived from the study.
1.1
Problem statement
Vietnamese Sign Language (VSL) is the primary means of communication for
the deaf community in Vietnam. It is a unique visual language with its grammar,
vocabulary, and syntax. VSL is not only a means of communication but also a
critical part of Deaf culture and identity. However, the VSL has been marginalized
for a long time, and the deaf community faces significant challenges in their daily
lives due to the lack of recognition and understanding of VSL by the wider society.
Within the realm of the Fourth Industrial Revolution, scientists worldwide are
actively working towards resolving the pervasive issue of machine translation.
However, the field of sign language translation remains largely underserved, leaving minimal resources available to comprehend the language used by deaf and
mute individuals. For instance, in online meetings, individuals who are unable to
speak solely rely on textual communication to convey their thoughts. Regrettably,
this reliance on chat messages poses obstacles to maintaining focus during discussions of serious matters, resulting in delays that hinder the flow of the meeting.
8 / 110
CHAPTER 1
SECTION 1.1.
When contemplating artificial intelligence-integrated products in Vietnam, the
predominant associations typically revolve around chatbots, recommendation systems, and autonomous vehicles - all of which find utility in business settings,
aiding individuals in their professional endeavors. Conversely, in advanced nations such as Germany or the United States, artificial intelligence extends beyond
work-related applications to encompass aiding disabled individuals in their daily
activities. This concept, known as science for humanity, has evolved in parallel
with science for business since the inception of the artificial intelligence industry. While Vietnam is witnessing a progressive surge in science for business, with
new research being published daily, achieving technological parity with developed
countries necessitates a heightened emphasis on science for humanity. This thesis
addresses this need by delving into the domain of sign language translation.
This thesis provides a valuable opportunity to apply the knowledge and skills
acquired during our four-year period at the University of Information Technology - VNUHCM. Throughout our academic journey, we have cultivated a strong
foundation in diverse disciplines, encompassing programming, computer science,
machine learning, computer vision, and natural language processing. This wealth
of expertise can now be harnessed to address the challenge of sign language translation. By leveraging our acquired knowledge, we can design a sophisticated system
that adeptly captures the intricacies of sign language through computer vision
techniques, employs natural language processing methodologies, and harnesses
advanced machine learning algorithms. Furthermore, our understanding of software engineering allows us to extend our efforts beyond the sign language translation system alone by incorporating the development of educational software for
sign language. This integration of academic learning and practical implementation holds tremendous potential for developing innovative solutions in the realm
of sign language translation.
This project delves into a profoundly relevant and widely-discussed topic of our
time: artificial intelligence. By undertaking this endeavor, we not only contribute
to the growing body of knowledge and advancements in this field but also position ourselves for future opportunities to engage with state-of-the-art systems. The
process of designing and developing a sign language translation system presents us
with a unique advantage - the firsthand experience of working with large-scale systems. This invaluable experience equips us with the skills and insights necessary
to thrive in the dynamic and fast-paced environments of prominent companies.
Additionally, the use of English throughout the entire research and development
process augments our ability to collaborate on international projects. This exposure to a broader scope of work further expands our horizons and enhances our
9 / 110
CHAPTER 1
SECTION 1.2.
prospects for engaging in diverse and impactful initiatives on a global scale.
Our project holds a crucial position in the dynamic world of technology, where
constant progress and innovation are the norm. It serves as both a reference
point and an inspiring source for students and researchers seeking to explore
the frontiers of possibility. Our primary objective is to push the boundaries of
what can be achieved, and we envision our project as a guiding force, providing
invaluable insights and lessons that will drive future endeavors in the development
of state-of-the-art systems. By openly sharing our methodologies, discoveries, and
breakthroughs, particularly in relation to the comprehensive data set we have
meticulously constructed for sign language translation, we foster a collaborative
environment that nurtures the growth of technological excellence and propels the
field forward.
1.2
Approach
This thesis represents a meticulous and comprehensive endeavor, drawing upon
a strong foundational knowledge base that will be thoroughly expounded upon a
strong foundational knowledge base that will be thoroughly expounded upon in
Chapter 2. The development process unfolds through several distinct stages, each
of which assumes a pivotal role in shaping the trajectory of our research.
To commence, we embark on a thorough survey of existing American Sign Language (ASL) translation technologies and mobile applications. This investigation
serves to identify best practices, potential challenges, and valuable insights to inform our own approach. Subsequently, we delve into a comprehensive study of
Vietnamese Sign Language (VSL), delving deeper into its unique features and
intricacies. This exploration provides us with a profound understanding of the
language and forms a critical basis for our subsequent endeavors.
The next phase of our development process centers around the collection and
preprocessing of a robust dataset consisting of VSL videos. This dataset serves
as the foundation for training and testing our sign language recognition models,
allowing us to refine their accuracy and effectiveness. Through extensive research
and experimentation, we endeavor to develop a sophisticated deep-learning model
that optimizes the learning process and enhances the recognition capabilities of
the system.
10 / 110
CHAPTER 1
SECTION 1.3.
With the core models in place, we proceed to develop the sign language translation system, integrating it with a streamlined and efficient streaming architecture.
This integration ensures seamless real-time translation capabilities, bolstering the
system’s practicality and usability. To further enhance the user experience and
validate the effectiveness of the developed system, extensive user testing is conducted. Through this process, we gain valuable insights into user feedback, enabling iterative improvements and refining the system’s performance.
In the final stages of our development process, we focus on integrating the sign
language translation system into a mobile application. This integration facilitates
wider accessibility and usage, enabling individuals to easily access and benefit
from the translation capabilities on their mobile devices. Through rigorous user
testing and evaluation, we aim to optimize the user experience, ensure the system’s
effectiveness, and demonstrate the practical application of the translation model
within a software context.
In summary, this thesis encompasses a multifaceted development journey, encompassing a comprehensive survey, in-depth study of VSL, dataset collection,
preprocessing, model development, system integration, and user testing. Through
these stages, we aim to contribute to the advancement of sign language translation
technology, improve user experiences, and exemplify the practical application of
our research findings.
1.3
Results
The outcomes of this thesis encompass a multitude of valuable contributions to
the field of Vietnamese sign language. These achievements include the construction of a meticulously curated Vietnamese sign language dataset, which serves as
a valuable resource for further research and development. Additionally, we have
successfully developed a robust and accurate model for sign language recognition and classification, enabling precise interpretation and understanding of sign
language gestures.
Building upon these foundations, we have also created a real-time sign language
translation system, incorporating cutting-edge technologies and algorithms. This
system facilitates seamless and instantaneous translation between sign language
and natural language, fostering effective communication and bridging the gap
between the deaf and hearing communities.
11 / 110
CHAPTER 1
SECTION 1.3.
Furthermore, we have developed a mobile application dedicated to sign language education. This application serves as a comprehensive learning platform,
providing resources, tutorials, and interactive exercises to empower individuals in
acquiring the foundation of Vietnamese sign language. Through this application,
we strive to promote inclusivity, accessibility, and equal opportunities for individuals with hearing impairments. Moreover, beyond its primary focus on sign
language education for the deaf community, this application also serves a vital
secondary purpose. It functions as a platform that extends the opportunity for
individuals who are not deaf to delve deeper into the world of sign language,
fostering a greater understanding and appreciation of the deaf community.
Collectively, these achievements signify a significant advancement in the domain of Vietnamese sign language. The dataset, sign language recognition model,
real-time translation system, and educational mobile application collectively contribute to the enrichment of communication, education, and inclusivity within the
deaf community.
12 / 110
Chapter 2
Foundational knowledge
This chapter serves as a comprehensive introduction to the key concepts and
foundations that underpin our thesis. We delve into essential areas such as the core
principles of Vietnamese Sign Language, the fundamentals of machine learning,
algorithms, and pertinent technologies. By establishing this groundwork, we lay a
solid foundation for the subsequent exploration and development of our research.
Furthermore, we acknowledge the significant contributions of related research that
we have consulted and drawn upon to enrich and inform our thesis, highlighting
the valuable insights and knowledge from the wider academic community that
have shaped our work.
2.1
Vietnamese Sign Language (VSL)
Vietnamese Sign Language (VSL) holds a significant place as the primary mode
of communication for the deaf community in Vietnam. It is a visual and gestural
language that utilizes hand movements, facial expressions, and body postures to
convey meaning and express thoughts and emotions. Over the course of history,
sign language in Vietnam has evolved and diversified, catering not only to everyday activities but also to the demands of professional settings. This dynamic
evolution has led to the development of a comprehensive sign language system
that encompasses a wide range of expressions and vocabulary, enabling effective
communication in various contexts.
13 / 110
CHAPTER 2
SECTION 2.1.
Fig. 2.1: VSL alphabet (Source: Circular 17/2020/TT-BGDĐT)
Common misconceptions regarding sign language often arise among individuals
who are not familiar with its intricacies. One prevalent misunderstanding pertains
to sign language being perceived as a universal or international language, wherein
deaf communities worldwide utilize a shared sign language system. However, this
assumption is inaccurate since sign language vocabularies differ across cultures.
Even within a single country, variations in vocabulary can be observed between different regions. Research conducted by the National Center for Special Education
in Vietnam has revealed that sign language in The Southern, Central, and Northern regions of the country exhibits a similarity of only 50%. Moreover, the study
has identified approximately 200 words that are commonly understood within
the deaf community in Vietnam. These findings shed light on the rich diversity
and unique linguistic characteristics of sign language, reinforcing the importance
of recognizing and appreciating the cultural and regional nuances that shape its
vocabulary and expression.
14 / 110
CHAPTER 2
SECTION 2.1.
Though it shares some common similarities with global sign language dictionaries regarding vocabularies, Vietnamese Sign Language possesses a distinct grammar, vocabulary, and syntax that set it apart from spoken Vietnamese. Notably,
while the structure of the Vietnamese spoken language typically follows a subjectverb-complement order, Vietnamese Sign Language adopts a different pattern. In
sign language, the order is subject-complement-verb, offering a unique linguistic
framework for conveying meaning. Additionally, the placement of numbers diverges between the two languages. In spoken Vietnamese, numbers tend to precede
the subject, while in Vietnamese Sign Language, numbers are typically positioned
after the subject. These variations highlight the intricacies and idiosyncrasies that
exist within the linguistic systems, emphasizing the need for a comprehensive understanding of Vietnamese Sign Language as a distinct and separate mode of
communication.
Fig. 2.2: VSL shares similarities with global sign language dictionaries
Despite its importance, Vietnamese Sign Language has faced challenges and
marginalization in society. Limited recognition and understanding of sign language by the wider community have created barriers for the deaf community in
various aspects of life, including education, employment, and social interactions.
Efforts are being made to promote awareness and inclusivity, advocating for the
15 / 110
CHAPTER 2
SECTION 2.2.
recognition of Vietnamese Sign Language as an official language and ensuring
accessibility for the deaf community.
Technological advancements, including the development of sign language recognition and translation systems, hold promise for improving communication and
accessibility for the deaf community. These innovations aim to bridge the communication gap between deaf and hearing individuals, facilitating effective interactions and fostering equal opportunities.
2.2
Survey of Existing Sign Language Translation Technologies
Sign language plays a crucial role in facilitating communication for individuals
in the deaf and hard-of-hearing community. In Vietnam, Vietnamese Sign Language (VSL) is the predominant sign language, while other countries have their
own unique sign languages such as American Sign Language (ASL) for the United
States and British Sign Language (BSL) for the United Kingdom. Over the years,
there has been significant progress in the development of technologies for ASL
translation to facilitate better communication between the deaf community and
the hearing world. In this section, we will provide a detailed survey of existing
sign language translation technologies and mobile apps.
2.2.1
Sign language recognition using sensor gloves
One of the early methodologies employed for sign language recognition involved
the use of sensor gloves, which are equipped with specific sensors to capture
hand movements and utilize machine learning algorithms for recognition purposes.
Notably, a significant contribution in this domain is the research paper titled
"Sign language recognition using sensor gloves" authored by Syed Atif Mehdi
et al, published in "Proceedings of the 9th International Conference on Neural
Information Processing, 2002 (ICONIP’02)".
The paper explores the feasibility of recognizing sign language gestures by utilizing sensor gloves, leveraging the idea of employing these gloves in gaming or
applications involving custom gestures. The outcome of this research effort is the
development of a project named "Talking Hands". This project features a sensor
glove capable of capturing American Sign Language signs performed by a user
and subsequently translating them into English sentences. Artificial neural networks are employed to recognize the sensor values obtained from the sensor glove,
16 / 110
CHAPTER 2
SECTION 2.2.
which are then categorized into 24 alphabetic characters in English along with
two punctuation symbols.
The system achieves an accuracy rate of up to 88%. The authors acknowledged
that this accuracy could potentially be even higher if the dataset used were sourced
from individuals with a proficient understanding of sign language, rather than
relying on samples from individuals who were not well-versed in sign language.
Fig. 2.3: An example of sensor glove used for detecting movement sequences.
Source: Cornell University ECE 4760 Sign language glove prototype
However, it is worth noting that this approach faces certain limitations. One
such challenge encountered by the project is its inability to effectively handle
characters associated with dynamic gestures or those that require the use of both
hands. Although this approach is practical and offers considerable applicability, it
is not without critical limitations. These include the need for a specialized hardware setup, limited accuracy rates, high cost, and difficulties in detecting facial
expressions and body language, which are integral components of sign language
communication.
17 / 110
CHAPTER 2
SECTION 2.2.
2.2.2
A Cost Effective Design and Implementation of Arduino Based Sign Language Interpreter
In response to the high cost associated with developing sign language recognition systems using sensor gloves, researchers have been focused on finding ways
to reduce the overall cost of the device. One such study is the research paper title
"A Cost Effective Design and Implementation of Arduino Based Sign Language
Interpreter" authored by Anirbit Sengupta et al, published in "2019 Devices for
Integrated Circuit (DevIC)".
This research explores the use of cloth-driven gloves with Bluetooth connectivity. The gloves are equipped with one accelerometer and five flexible sensors,
strategically placed along the length of each finger, including the thumb. These
flexible sensors play a crucial role in recognizing intricate hand gestures, while
the resistance value changes, generated by the extent of curvature in the sensors,
in combination with the accelerometer values measuring the slant position of the
hand in relation to the land surface, are also taken into account. The collected
data is then processed by a microcontroller module and can be transmitted to
any smartphone user through Bluetooth connectivity.
The research findings indicate an accuracy rate of approximately 86.67%, with
some biases observed in recognizing letters such as A, B, F, H, I, J, U, W, Y,
and Z. The authors also mentioned that certain letters like M, N, O, R, S, T, V,
and X cannot be effectively demonstrated as they share gestural similarities with
other letters. The glove can recognize the user’s hand gestures and convert them
into text and voice with the assistance of a smartphone application.
It is important to note that efforts to reduce the cost of implementing glovebased sign language translation systems come with trade-offs, as the reduced cost
may affect the overall performance or effectiveness of the device. Despite these
limitations, such research endeavors pave the way for exploring more accessible
and affordable alternatives for sign language recognition, contributing to the ongoing development and evolution of technology for the deaf community.
2.2.3
Neural Sign Language Translation
Another approach is the use of computer vision techniques that use cameras
to capture the signer’s hand, and body movements and recognize them as specific
signs. This approach involves three main steps: hand segmentation, feature extraction, and recognition. Hand segmentation involves separating the signer’s hand
18 / 110
CHAPTER 2
SECTION 2.2.
from the background. Feature extraction involves extracting relevant features from
the segmented hand image. Recognition involves classifying the extracted features
to recognize the sign. One of the prominent researches using this approach is the
research paper "Neural Sign Language Translation"[3].
This paper introduces the problem of Sign Language Translation (SLT), which
aims to generate spoken language translations from sign language videos, taking
into account the grammatical and linguistic structures of sign language. The authors propose a framework based on Neural Machine Translation (NMT) to learn
the spatial representations, language model, and mapping between sign and spoken language. They utilize 2D convolutional neural networks (CNNs) to learn spatial embeddings for sign videos and word embeddings for spoken language words.
The sign videos are tokenized using either frame-level or gloss-level tokenization,
while the spoken language sentences are tokenized at the word level.
In addition, an attention-based encoder-decoder network is employed to generate the target spoken language sentences, with attention mechanisms capturing the alignment between sign videos and spoken language sentences. The authors also introduce the RWTH-PHOENIX-Weather 2014T dataset, which provides continuous sign language videos with gloss annotations and spoken language
translations. Experimental results on this dataset demonstrate the effectiveness
of their approach. The paper concludes by discussing the findings and the future
directions for SLT research.
2.2.4
Deep Learning for Vietnamese sign language recognition in video sequence
The paper Deep Learning for Vietnamese sign language recognition in video
sequence[?] by Nguyen Thien Bao and other researchers is one of the more similar approaches towards Sign Language recognition and translation we could find
for our thesis. This paper presents a detailed investigation into the automatic
recognition of Vietnamese Sign Language (VSL) using various feature extraction approaches and deep learning techniques. The authors address the specific
challenges associated with VSL recognition in video sequences, including camera
orientation, hand position, inter-hand relation, and other factors that make the
task complex.
The proposed approach comprises two main types of feature extraction: spatial features and scene-based features. Spatial features involve the utilization of
local descriptors, namely Local Binary Pattern (LBP), Local Phase Quantization
19 / 110
CHAPTER 2
SECTION 2.3.
(LPQ), and Histogram of Oriented Gradients (HOG). These techniques aim to
capture essential information about hand gestures by analyzing texture, intensity, and gradient patterns within specific regions of interest. On the other hand,
scene-based features employ the GIST descriptor, which focuses on the dominant
spatial structure of a scene, taking into account perceptual dimensions such as
naturalness, openness, roughness, expansion, and ruggedness.
For the recognition stage, the authors explore traditional classification methods,
with Support Vector Machine (SVM) being the chosen approach. SVM classifiers
are trained using the extracted spatial and scene-based features, and the recognition performance is evaluated. Additionally, a deep learning-based approach called
Deep Vietnamese Sign Language (DVSL) is introduced. In this approach, Convolutional Neural Network (CNN) features are extracted from a pre-trained VGG16
model. These features are then fed into Long Short-Term Memory (LSTM) models, which learn to predict sign language based on image sequences.
Two VSL datasets are collected and utilized for experimentation. The first
dataset focuses on relative family topics and contains words with minimal changes
between frames. The second dataset involves more complex gestures, including
relative positions and orientations of body parts. To augment the datasets and
enhance the training process, data augmentation techniques are applied. These
techniques include rotation transformations and the addition of salt-and-pepper
noise, which provide variations in hand movement and position.
Experimental results demonstrate the effectiveness of the proposed approaches.
The SVM-based models achieve an accuracy of 88.5% on the VSL-WRF-01-EXT
dataset, while the DVSL model achieves an even higher accuracy of 95.83% on
the same dataset. These results indicate the promising performance of both the
traditional SVM-based approach and the deep learning-based DVSL approach in
recognizing VSL.
However, it should be noted that the performance of the DVSL model is relatively lower on the VSL-WRF-02-EXT dataset, which contains more complex
gestures involving the relationship between body parts. This highlights a potential
limitation of the deep learning-based approach and suggests the need for further
investigation to improve its performance in handling such complex gestures.
20 / 110
CHAPTER 2
SECTION 2.3.
2.3
Machine Learning Techniques and Algorithms
Machine learning techniques and algorithms are at the core of modern artificial
intelligence systems, enabling computers to learn from data and enhance their
performance autonomously. These methods encompass various approaches that
seek to extract valuable patterns and insights from intricate datasets. In recent
years, machine learning has made significant strides in the field of sign language
translation, particularly through computer vision algorithms and deep learning
architectures. These advancements have paved the way for the creation of advanced systems that facilitate communication between sign-language users and
non-sign-language speakers. In this section, we will delve into an exploration of
machine learning algorithms, concepts, architectures, and techniques employed in
this thesis, focusing specifically on their application to sign language translation.
2.3.1
Machine learning
Machine learning is a subset of artificial intelligence (AI) that focuses on developing algorithms and models that enable computers to learn from data and
make predictions or decisions without being explicitly programmed. It empowers machines to automatically improve their performance through experience and
exposure to relevant information. Machine learning algorithms can process and
analyze vast amounts of data to identify patterns, relationships, and insights that
humans may not easily discern.
At its core, machine learning is driven by the principle of training models on
labeled data to recognize patterns and generalize from examples. These models can
then be applied to new, unseen data to make predictions or classify information
accurately. The training process involves adjusting the model’s parameters to
minimize the difference between predicted outputs and the known correct outputs
in the training data, effectively learning from the data patterns.
Machine learning techniques can be broadly categorized into three main types:
supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, models are trained on labeled data where the desired outputs are
known, enabling them to make accurate predictions or classifications. Unsupervised learning, on the other hand, involves training models on unlabeled data to
discover patterns or groupings within the data. Reinforcement learning involves
training models to interact with an environment and learn optimal actions through
trial and error, guided by a reward system.
21 / 110