nghiên cứu phát triển hệ thống nhận diện cử chỉ sử dụng camera dựa trên công nghệ học máy nhẹ tinyml hoạt động trên vi điều khiển

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (5.27 MB, 84 trang )

Trang 1<div class="page_container" data-page="1">

(DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING TECHNOLOGY (TINYML) OPERATING ON A

Sinh viên: Trần Mạch Tuấn Kiệt

Mã số sinh viên: 19010214 Khóa: K13

Ngành: Kỹ tḥt Điều khiển và Tự đợng hóa Hệ: Đại học chính quy

Giảng viên hướng dẫn: TS. Lê Minh Huy

Hà Nội – 2024

Copies for internal use only in Phenikaa University

</div>Trang 2<div class="page_container" data-page="2">

(TINYML) HOẠT ĐỘNG TRÊN VI ĐIỀU KHIỂN

(DEVELOPING A GESTURE RECOGNITION SYSTEM USING CAMERAS BASED ON LIGHTWEIGHT MACHINE LEARNING

TECHNOLOGY (TINYML) OPERATING ON A MICROCONTROLLER)

Sinh viên: Trần Mạch Tuấn Kiệt

Mã số sinh viên: 19010214 Khóa: K13

Ngành: Kỹ tḥt Điều khiển và Tự đợng hóa Hệ: Đại học chính quy

Giảng viên hướng dẫn: TS. Lê Minh Huy

Hà Nội – 2024

Copies for internal use only in Phenikaa University

</div>Trang 9<div class="page_container" data-page="9">

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập - Tự do - Hạnh phúc

BẢN GIẢI TRÌNH SỬA CHỮA ĐỒ ÁN TỐT NGHIỆP

- Khoa Điện – Điện tử

Họ và tên tác giả đồ án/khóa luận: Trần Mạch Tuấn Kiệt

Ngành: Kỹ thuật điều khiển và Tự động hóa

Đã bảo vệ đồ án/khóa luận tốt nghiệp ngày 3 tháng 4 năm 2024

Tên đề tài: NGHIÊN CỨU PHÁT TRIỂN HỆ THỐNG NHẬN DIỆN CỬ CHỈ SỬ DỤNG CAMERA DỰA TRÊN CÔNG NGHỆ HỌC MÁY NHẸ (TINYML) HOẠT ĐỘNG TRÊN VI ĐIỀU KHIỂN

Giảng viên hướng dẫn: TS. Lê Minh Huy

Theo góp ý của Hội đồng, dưới sự định hướng của giảng viên hướng dẫn, tác giả đồ án/khóa luận đã nghiêm túc tiếp thu những ý kiến đóng góp của Hội đồng và tiến hành sửa chữa, bổ sung đồ án theo đúng tinh thần kết luận của Hội đồng. Chi tiết về các nội dung chỉnh sửa như sau:

1. Tác giả chỉnh sửa và bổ sung đồ án theo góp ý của Hội đồng

Nội dung Trang cũ Sửa thành Trang mới

gọn related work

2 Các thông số cài đặt huấn

luyện mô hình

Bảng tổng hợp các thơng số cài đặt huấn luyện mơ hình

31 Định nghĩa về các thơng số

đánh giá

Định nghĩa về các thông số đánh giá

</div>Trang 10<div class="page_container" data-page="10">

TÓM TẮT

Dự án này giới thiệu một cách tiếp cận mới sử dụng nhận dạng cử chỉ tay sử dụng camera trên các vi điều khiển cho mục đích điều khiển. Phương pháp đề xuất sử dụng mơ hình học máy nhẹ (TinyML) được thiết kế riêng để triển khai trên các nền tảng hạn chế về tài nguyên như bộ vi điều khiển. Đề xuất này được so sánh với các phương pháp trước đó mang lại kết quả cạnh tranh về đợ chính xác nhưng với kích thước nhỏ hơn đáng kể. Sau đó, mơ hình này được triển khai trên bợ vi xử lý ARM Cortex M7 và được áp dụng để điều khiển máy quét từ tính trong hệ thống Kiểm tra không phá hủy (NDT).

Dự án này là tổng hợp ứng dụng kiến thức về truyền thông và điều khiển các thiết bị tự động hóa. Ngoài ra, nó cũng yêu cầu các kỹ thuật lập trình điều khiển bằng cả LabVIEW và Python để tối ưu hóa vận hành của hệ thống.

Trong quá trình hoàn thiện, tác giả đã tích lũy được những kiến thức, kinh nghiệm, chuyên môn vững chắc trong lĩnh vực Deep Learning, TinyML và áp dụng thành công các kiến thức về giao tiếp giữa các thiết bị, điều khiển hệ thống tự động, v.v.

Copies for internal use only in Phenikaa University

</div>Trang 11<div class="page_container" data-page="11">

LỜI CAM ĐOAN

Tên tôi là: Trần Mạch Tuấn Kiệt

Ngành: Kỹ thuật Điều khiển và Tự động hóa

Tôi đã thực hiện đồ án tốt nghiệp với đề tài: Nghiên cứu phát triển hệ thống nhận diện cử chỉ sử dụng camera dựa trên công nghệ học máy nhẹ (TinyML) hoạt động trên vi điều khiển.

Tôi xin cam đoan đây là đề tài nghiên cứu của riêng tôi và được sự hướng dẫn của TS. Lê Minh Huy

Các nội dung nghiên cứu, kết quả trong đề tài này là trung thực và chưa được các tác giả khác cơng bố dưới bất kỳ hình thức nào. Nếu phát hiện có bất kỳ hình thức gian lận nào tôi xin hoàn toàn chịu trách nhiệm trước pháp luật.

GIẢNG VIÊN HƯỚNG DẪN

</div>Trang 12<div class="page_container" data-page="12">

LỜI CẢM ƠN

Để hoàn thành đồ án này, em xin được bày tỏ lòng biết ơn sâu sắc đến tất cả những người đã hỗ trợ, giúp đỡ em trong quá trình thực hiện đồ án. Quá trình hoàn thiện đồ án gặp vô số trở ngại, khó khăn vì tính mới của dự án. Tuy nhiên, em đã nhận được rất nhiều sự trợ giúp khác nhau trên tất cả các phương diện kiến thức, vật chất, tinh thần để có thể hoàn thiện đồ án tốt nghiệp với kết quả tốt nhất.

Đầu tiên, em xin chân thành cảm ơn thầy Lê Minh Huy. Thầy là người đã trực tiếp hướng dẫn em thực hiện đồ án, chỉ bảo tận tình cũng như trợ giúp em giải quyết các vấn đề khó khăn trong quá trình hoàn thiện đồ án.

Tiếp đến, em xin gửi lời cám ơn chân thành đến toàn thể các thầy cô trong khoa Điện- Điện tử, trường Đại học Phenikaa, đặc biệt là sự giúp đỡ từ ICSLab (A4-705, Trường Đại học Phenikaa) đã tạo điều kiện thuận lợi nhất cho em hoàn thành đề tài đồ án tốt nghiệp.

Em rất mong nhận được sự góp ý của thầy cô để em có thể khắc phục những nhược điểm và ngày càng hoàn thiện bản thân.

Xin chân thành cảm ơn!

Hà Nội, ngày tháng năm 2024 Sinh viên thực hiện

Copies for internal use only in Phenikaa University

</div>Trang 13<div class="page_container" data-page="13">

Abstract

This project introduces a novel approach utilizing camera-based hand gesture recognition for control purposes. The proposed methodology employs lightweight machine learning models (TinyML) tailored for deployment on resource-constrained platforms like microcontrollers. This proposal is compared with previous methods which give a competitive result in accuracy but with significantly smaller size. Subsequently, the model is implemented on an ARM Cortex M7 microprocessor and applied to govern a magnetic scanner within a Non-Destructive Testing (NDT) experimental system.

This project is the application of knowledge about communication and control of automation devices. Furthermore, it integrates control programming techniques using both LabVIEW and Python to optimize the system's operational fluidity. During the process of completing, the author has gained substantial knowledge, experience, and expertise in the field of Deep Learning, TinyML and successfully applied knowledge about communication between devices and automatic system controlling, etc.

Copies for internal use only in Phenikaa University

</div>Trang 14<div class="page_container" data-page="14">

CHAPTER III. EXPERIMENTAL RESULTS & BENCHMARK ... 31

3.1. Experimental setup & dataset ... 31

</div>Trang 15<div class="page_container" data-page="15">

CHAPTER IV. BUILDING A CONTROL SYSTEM USING HAND

4.4.1. Deploy model into OpenMV Cam H7 ... 51

4.4.2. Control programming on LabView ... 54

</div>Trang 16<div class="page_container" data-page="16">

Table of Figures

Figure 1.2 Workflow of the entire research for hand gesture recognition

(HGR). Part 1 involves the process of building a Deep Learning model. After

obtaining the model and successfully deploying it on OpenMV, part 2 describes the process of building a control system to integrate the HGR model into real applications. ... 7Figure 2.1 Block diagram of proposed method for hand gesture recognition. ... 8Figure 2.2 Data augmentation processing. ... 10Figure 2.3 Some samples after data augmentation with Random Zoom, Random Rotation, Random Brightness ... 11

Figure 2.4 Basic Convolutional Neural Network. A typical deep learning model can take an input and transform it into corresponding outputs. Raw image data can

be fed into the model without going through any traditional data extraction techniques. [30] ... 13

Figure 2.5 Comparison between systems with TinyML (a) and without

TinyML (b). TinyML is embedded directly on the microprocessors on the sensor

to process data before seeking help from edge AI or cloud AI [31]. ... 13

Figure 2.6 Kernel production operation. The process of multiplying the kernel

with each submatrix taken from the input respectively along the length and width of the input ... 15

Figure 2.7 Forward propagation and Backward propagation of convolution

operation. Both processes follow chain rules. ... 16

Figure 2.8 Operation of standard convolutional layer (a) replaced by depthwise convolutional layer with two separate layer included depthwise layer (b) and pointwise layer (c) [9] ... 19Figure 2.9 Squeeze and Excitation block. [24] ... 22

Figure 2.10 Proposed model architecture as base model for hand gesture

recognition. We use the first four stages as feature extractor and final stage as

lightweight classifier. ... 27

Copies for internal use only in Phenikaa University

</div>Trang 17<div class="page_container" data-page="17">

Figure 2.11 Detailed structure of each SE block used in micro-bottleneck.

Block a) is SE Conv Block to help reduce spatial dimension before entering block

b) SE Residual Block to better extract features. ... 27

Figure 3.1 Some sample from ASL dataset ... 32

Figure 3.2 Some sample from OUHANDS dataset ... 33

Figure 3.3 Experimental setup ... 33

Figure 3.4 Learning curve diagram when training the model on the ASL with expansion factor t = 3.0. The learning curve diagram shows the decrease in loss function and increase in accuracy. The model achieves the best results after more than 50 epochs and is nearly 100% accuracy for both training and validation sets. ... 35

Figure 3.5 Accuracy and number of parameters when using two different expansion factor, t = 0.25 and t = 3.0 ... 36

Figure 3.6 Confusion matrix of proposed method with a) t = 0.25 and b) t = 3.0 when deploying transfer learning in OUHANDS dataset ... 38

Figure 3.7 Comparing size (a) before and after implementing the quantization technique for each of the following models: proposed model with micro-bottleneck, Lightweight CNN [34], CrossFeat [33] and ExtriDeNet [16] ... 39

Figure 4.1 System design block diagram ... 42

Figure 4.2 The control system in practice ... 43

Figure 4.3 Motion Control PMC-1HS-USB ... 43

Figure 4.4 Driver MD5-HD14 ... 44

Figure 4.5 Step motor A8K-M566 ... 45

Figure 4.6 OpenMV Cam H7 ... 46

Figure 4.7 OpenMV IDE ... 48

Figure 4.8 LabVIEW front panel, which contains many types of objects placed inside many decorations’ blocks ... 49

Figure 4.9 LabVIEW block diagram. The input, output value and button are represented in front panel ... 49

Figure 4.10 Simple block diagram of connecting to OpenMV through LabVIEW [40] ... 50

Copies for internal use only in Phenikaa University

</div>Trang 18<div class="page_container" data-page="18">

Figure 4.11 LabVIEW & OpenMV connection UI [40] ... 50

Figure 4.12 User interface for deploying the model to OpenMV ... 52

Figure 4.13 OpenMV control algorithm flowchart ... 53

Figure 4.14 Result is shown in OpenMV interface ... 53

Figure 4.15 Algorithm flowchart of using LabVIEW ... 54

Figure 4.16 9 gestures being used. Where 1 – Left means that the class of the gesture below is 1 and LabVIEW considers it as command go left ... 55

Figure 4.17 LabVIEW’s user interface ... 57

Figure 4.18 when not define mode, no action is received even that OpenMV capture a “left” gesture. ... 57

Figure 4.19 Change parameters... 58

Figure 4.20 Change mode ... 58

Figure 4.21 Motors work in continuous mode ... 59

Figure 4.22 Motors work in use parameters mode ... 59

Copies for internal use only in Phenikaa University

</div>Trang 19<div class="page_container" data-page="19">

List of Tables

Table 2.1 Model configuration ... 25

Table 2.2 Details of block’s configuration in micro-bottleneck ... 28

Table 2.3 Details of learning process hyperparameters ... 31

Table 3.1. Summary of ablation evaluation of the proposed model in comparing its accuracy and model size with other some modifications ... 36

Table 3.2 Summarize the result of the proposed model and previous model in HGR System ... 41

Table 4.1 PMC-1HS-USB Specifications ... 44

Table 4.2 MD5-HD14 Specifications ... 45

Table 4.3 A8K-M566 Specifications ... 46

Table 4.4 OpenMV H7 Specifications ... 46

Copies for internal use only in Phenikaa University

</div>Trang 20<div class="page_container" data-page="20">

CHAPTER I. INTRODUCTION

1.1. Introduction

With the development and increasing popularity of computers in daily life, bridging the communication gap between humans and machines has become a vital area of research. While conventional input methods like keyboards and mice remain prevalent, they do not represent the most natural way of human interaction between people. Furthermore, as technologies like virtual reality, remote control, and augmented reality gain traction, traditional input devices often prove inappropriate. Considering these constraints, hand gesture recognition (HGR) emerges as a promising alternative, offering a more intuitive and natural means of interaction between humans and technology.

Hand gesture recognition, a pivotal component of human-computer interaction, has emerged as a transformative technology enabling users to communicate with computers through intuitive hand movements, eliminating the need for traditional input devices such as keyboards and mice, and offering a more natural and immersive interaction experience. HGR aims to develop methods to detect, recognize human hand gestures and translate it into commands [1]. To achieve this objective, some different techniques are employed to collect data or information for recognition, broadly categorized into two main approaches [2]. The first approach is sensor-based, wherein sensors attached directly to the user's hand or arm capture various data types such as shape, position, movement, or trajectory. Today’s commonly used devices are gloves with accelerometers, position sensors, among others, or employ electromyography (EMG) to detect and interpret muscle electrical impulses, then decoding these signals to specific finger movements. More advanced approaches use vision-based techniques by using one or more types of cameras to capture motion or gesture appearance. While this method is simple to set up, it requires handling with various features such as a variety of lighting, complex background, along with others [3]. However, by leveraging computer vision techniques and deep learning models, hand gesture recognition

Copies for internal use only in Phenikaa University

</div>Trang 21<div class="page_container" data-page="21">

systems can interpret and classify intricate hand poses and motions in real-time, enabling seamless interaction with digital interfaces and environments. This approach has gradually become the main direction in recent years due to the development of vision-based applications and the trend of touchless control in many fields such as gaming, electronic device control, and virtual reality applications [4]. In this study we only mention related methods based on vision. Hand gesture recognition currently has a multitude of applications across various domains, spanning industrial environmental to daily life. Potential fields of application encompass gaming, offering enhanced user interaction experiences, while also extending to touchless control systems, which streamline device interaction without physical contact. In the field of accessibility, HGR hold promises to facilitate communication through sign language translation, thereby helping hearing-impaired people integrate more easily. Moreover, within professional environments, HGR can be used for control robotic systems, or assist in communicating with disabilities during medical tasks [5]. HGR also helps people with disabilities, the elderly, or children to access computers faster, more conveniently and more interestingly. In this era of computing, where devices are increasingly integrated into our daily lives, the ability to interact with technology effortlessly and intuitively has become important. Hand gesture recognition stands at the forefront of this revolution, promising to redefine the way we interact with computers, devices, and the digital world at large.

1.2. Related work

Due to decades of study in vision-based hand gesture recognition, early efforts achieved good outcomes by leveraging the characteristics of images in processing images. For instance, color and edge details are two of the most frequent attributes leveraged to identify and discern specific gestures. Color detection principally targets isolating the skin tones present on the hand against the environment. Edge finding facilitates extraction of the hand region as the figure of interest from the surroundings based on discontinuities in pixel intensities. Method [6] approached

Copies for internal use only in Phenikaa University

</div>Trang 22<div class="page_container" data-page="22">

webcam image acquired through a low-cost camera in multistep process. First, Jayashree et al. applied a gray threshold technique along with median and Gaussian filters in tandem to prune noise and transform the original RGB into a denoised binary images. Following this preprocessing stage, a “Sobel” edge detection algorithm was leveraged to extract region of interest. Finally, using a feature matching methodology called Euclidean distance, authors quantified the similarity between centroid and area of differentiated edges in the test versus training set. The proposed method was evaluated on a dataset consisting of the American Sign Language (ASL) alphabet, which contained 26 static hand gestures corresponding to each letter from A to Z. Using this test set, the method achieved a positive recognition rate of 90.19%. However, following the success of state-of-the-art deep learning models in image-related tasks, the domain of image processing take advantage from deep learning [7] [8] [9] [10] [11]. Rather than completely eliminate traditional vision techniques, a hybrid approaches using a Dual–Channel Convolutional Neural Network (DC-CNN), fusion the hand gesture images and hand edge images after preprocessing using Canny edge detection [12]. The output can be classified using the SoftMax classifier, and each of the two-channel CNNs has a unique weight. The proposed system's recognition rate is 98.02%. However, the performance of these methods is limited by how well the handcrafted extractor selected features represent the characteristic of hand. Researchers recognized the potent representational proficiencies of learned functions over hand-engineered extraction and progressively transitioned toolkits toward end-to-end structured learning drawn directly from pixels. Hussain et al [13] used two CNNs parallel which each of that is a state-of-the-art architecture - Inception V3 and Efficient B0 - that have achieved noticeable performance on various image-related tasks. Both models were trained in the same RGB images of recorded hand gestures. These models were evaluated on the ASL dataset yielding accuracy of 90% and 99%, respectively. Improvements in sensors technologies bring new approaches to leveraging depth image data captured by devices such as Kinect and Intel RealSense. The method [14] uses two VGG19 with same architectures different input types. Specifically, VGG19-v1 was fed the RGB images to extract skin-color

Copies for internal use only in Phenikaa University

</div>Trang 23<div class="page_container" data-page="23">

maps, while VGG19-v2 took the depth images as input to learn depth-based information. By combining the two streams of information, the authors were able to achieve classification accuracy as high as 94.8% on the ASL dataset. More advanced deep learning models aim to combine multi-scale or multi-level features to enhance network learnability. Method [15] using an innovative approach with 2 stages. First, they applied deep learning with a lightweight encoder-decoder structure to transfer the RGB images to segmented images. This lightweight structure is based on dilated residual and using atrous spatial pyramid pooling as multi-scale extractor. After finding the desired segmentation, the double-channel CNN was fed by the input RGB images and the corresponding segmented images, to learn the necessary features separately. This method achieved 91.17% on OUHANDS dataset with only 1.8 MB in model size. To do this, the model takes advantage of depthwise separable convolution (DSC) layers in building encoder-decoder and CNN structures. Recent methods move the problem from classification to detection to achieve better results while still trying to keep a compact structure. Recent works such as [16] and [17] have improved YOLO architecture by replacing the original backbone and neck components by lightweight modules. In particular, the approach described in [16], which using ShuffleNetV2 as the backbone in YOLOv3, has achieved impressive results on two challenging datasets with complex backgrounds such as senz3D dataset reaching 99.5% and Microsoft Kinect dataset reaching 99.6%. Significantly, the model size was only 8.9 MB compared to the 123.5 MB size of the original YOLOv3 network. Although many studies have been carried out, these studies focus heavily on improving model accuracy and pay little attention to computational costs. This poses challenges for executing the model on low-cost, constraint hardware such as microcontroller devices with limited memory capacity and computational speed, resulting in significant inference time delays.

</div>Trang 24<div class="page_container" data-page="24">

This surge in adoption opens endless possibilities for various applications, such as smart manufacturing, personalized healthcare, precision agriculture, automated retail, and UAV applications, along with others. The appeal of these low-cost and energy-efficient microcontrollers lies in their potential to facilitate a new frontier in technology known as tiny machine learning (Tiny-ML) [19]. By deploying deep learning models directly on these compact devices, data analytics can be performed near the sensor, leading to a substantial expansion in the realm of Artificial Intelligence (AI) applications. Utilizing deep learning models on microcontrollers allows for localized intelligent tasks, leading to improved performance, privacy, security, and energy efficiency. The integration of deep learning on such tiny platforms presents an exciting opportunity to revolutionize the field of AI and further democratize its capabilities. Nevertheless, integrating deep learning models with microcontrollers poses significant challenges due to the constrained resources available on these devices [20]. Limited memory capacity can restrict the size of the model that can be deployed, while processing power limitations can impact the speed and efficiency of model execution. Additionally, the limited battery life of microcontrollers necessitates the use of energy-efficient algorithms to ensure prolonged operation without draining the power source rapidly. Overcoming these challenges is crucial to fully leverage the potential of deep learning on microcontrollers and enable the deployment of intelligent applications in resource-constrained environments.

The popular adoption of microcontrollers offers an opportunity to deploy HGR system on these hardware platforms, thereby enhancing flexibility and expanding the application scope of such systems. Furthermore, leveraging resource-constrained devices enables improved optimization in terms of energy consumption, construction, and operational costs.

1.4. Problems and Research methods

Considering the above limitations, the research proposed micro convolutional neural network (micro-CNN) architecture based on TinyML to identify the

Copies for internal use only in Phenikaa University

</div>Trang 25<div class="page_container" data-page="25">

morphology of hand gestures. From this premise, we have developed a comprehensive HGR system tailored for controlling a non-destructive testing (NDT) system to conduct basic experiments. This marks the first step in investigating the practicality of the method in real-world environments. Our focus lies in constructing a lightweight deep learning model adept at recognizing gestures from RGB images captured by conventional cameras. Emphasizing the achievement of accurate results from a single input frame, our model significantly enhances inference time efficiency. The qualified model is integrated into an ARM Cortex M7 processor used on the OpenMV H7 platform. Additionally, we established a framework facilitating motor control through LabView software. Figure 1.2 illustrates the workflow in this study. Furthermore, our approach undergoes evaluation and comparison with state-of-the-art models presented in prior research.

1.5. Article structure

The following sections of this document are presented in order: Chapter II will detailly describe our proposed method, encompassing the processes of data preprocessing, augmentation, and our proposed model architecture. In Section 3, a comprehensive evaluation of the results, including in-depth analysis and comparative against preceding methods, will be provided. This section will also describe the benchmarking of our model's performance on a variety of microcontrollers from the STM32 family. The application of this proposal to build a practical system is presented in chapter IV. This section also describes in detail the process of building a framework that is integrated into the NDT system. Further experimental results and demo are discussed in Chapter V. The conclusions will be summarized and presented in the final Chapter VI.

Copies for internal use only in Phenikaa University

</div>Trang 26<div class="page_container" data-page="26">

Figure I.1 Workflow of the entire research for hand gesture recognition (HGR).

Part 1 involves the process of building a Deep Learning model. After obtaining the model and successfully deploying it on OpenMV, part 2 describes the process

of building a control system to integrate the HGR model into real applications.

CHAPTER II. PROPOSED MODEL ARCHITECTURE

The main purpose of this research is to develop a compact HGR system tailored for microcontrollers with constrained resources. Considering the limitations mentioned above, our focus lies in proposing a lightweight CNN architecture that satisfies the requirements of model size, inference time, and computational cost. Additionally, we explored optimization techniques aimed at compressing model size before implementation on these microcontrollers. This outlines the entire major contributions within this research. The architecture is based on MobileNetV2 [21] and MobileNetV3 [22] with 2D depthwise separable convolution (DSC), bottleneck and Squeeze-and-Excitation block [23]. Our proposal is used as pixel-based feature extractor, which extracts spatial features in the image. After obtaining high-level features using our architecture as backbone, we integrate a classifier or detector as top module depending on intended use. The proposed method exhibits the competitive result with state-of-art model when evaluate in two different datasets American Sign Language (ASL) [24] and OUHANDS [25] datasets. Figure 2.1 illustrates the block diagram of the process involved in building proposed method. Before performing model training, we use

Copies for internal use only in Phenikaa University

</div>Trang 27<div class="page_container" data-page="27">

data augmentation techniques to enhance training data diversity and then normalize the entire dataset to make sure the pixel values of images are within a consistent range. We train the model using the classification module with the ASL dataset due to its substantial volume, facilitating enhanced feature learning of hand gestures. Additionally, we employ transfer learning techniques by leveraging obtained pre-trained model with the OUHANDS dataset, utilizing detector module as top layers. This approach enables the model to benefit from prior learning and adapt more effectively to the nuances of hand gesture recognition tasks. After training the model, the quantization algorithm is employed using the TFLite framework. This algorithm facilitates the reduction of the float-point TensorFlow model to 8-bit precision, making it compatible with embedded hardware that exclusively supports 8-bit computations. Final evaluation will be done on some STM32 microcontrollers to investigate inference time. Upon successful completion of the assessments, the validated model will be integrated onto the OpenMV platform, enabling its deployment to address real-world challenges.

Figure II.1 Block diagram of proposed method for hand gesture recognition.

Copies for internal use only in Phenikaa University

</div>Trang 28<div class="page_container" data-page="28">

2.1. Dataset preparation

2.1.1. Data augmentation

Data augmentation is a critical technique in training deep learning models, crucial for addressing limited datasets and enhancing model generalization. By artificially expanding the dataset through transformations such as rotation, flipping, and adding noise, data augmentation exposes the model to a broader range of input variations, mitigating overfitting and improving robustness. This is particularly important in scenarios where the amount of original training data is limited or when the model needs to be invariant to certain transformations. Additionally, data augmentation helps models adapt to diverse environmental conditions and input variations encountered in real-world settings, enhancing their reliability and performance [26]. Integrating data augmentation into the training pipeline is essential for developing highly effective and generalized models across various domains and applications.

Image augmentation leverages fundamental principles of image processing to enhance the training dataset for deep learning models. An object may be photographed from different angles and distances, appear in various sizes, or be partially obscured. Scale invariance, where models should recognize objects regardless of size, and viewpoint invariance, which allows models to recognize objects from different angles are the key principles for geometrics transformations. By simulating real-world variations during training, geometric transformations like rotation, scaling, and translations help the model to not only recognize the object regardless of its orientation or scale but also to understand the underlying structure of the object. This adaptability is crucial for the model to accurately identify objects across diverse visual scenarios and is especially beneficial for tasks such as object detection and classification where perspective and scale have significantly affected [27]. Color invariance plays a pivotal role in enhancing model performance by allowing the model to focus on structural and texture features rather than color, which can be highly variable due to lighting conditions or camera settings. This principle is particularly important for tasks like object

Copies for internal use only in Phenikaa University

</div>Trang 29<div class="page_container" data-page="29">

recognition, where the shape and texture of an object are more defining characteristics. Implementing color space transformations, such as adjusting brightness, contrast, and saturation, improving model accuracy and reliability in practical applications [28].

In the field of HGR, the accurate recognition of distinct gestures heavily depends on the hand’s shape and structural characteristics. Given the variability inherent in real-life scenarios, where gestures may be captured from diverse angles and positions within the frame, geometric transformations are essential to simulate these conditions within the dataset. These transformations, including translation, rotation, and random zoom, serve to augment the dataset, enabling the model to learn robust representations of hand gestures. Moreover, to reduce the model’s sensitivity to variations in lighting conditions during usage, a random brightness operation adjustment method was implemented. This method introduces controlled coefficient in brightness, ranging from -0.1 to 0.1, thereby enhancing the model’s resilience to environmental factors and contributing to its overall robustness and generalization capabilities. Because the purpose is to help the model learn the necessary features, the augmentation technique is only applied to the training data set and not to the testing data set as Figure 2.2 described. Examples of some images after the augmentation process are shown in Figure 2.3.

Figure II.2 Data augmentation processing.

Copies for internal use only in Phenikaa University

</div>Trang 30<div class="page_container" data-page="30">

Figure II.3 Some samples after data augmentation with Random Zoom, Random Rotation, Random Brightness

2.1.2. Normalization

Normalizing an image refers to the process of adjusting the pixel values to conform to a standardized scale or distribution. This typically involves scaling the pixel values to have a mean of zero and a standard deviation of one, or normalizing them to a specific range, such as [0, 1] or [-1, 1]. This process aids in improving convergence during optimization, as it minimizes issues related to features at different scales, thereby facilitating smoother convergence and preventing oscillation. The process of updating the weights matrix involves adding the gradient error vector, which is multiplied by a learning rate, via backpropagation. These adjustments are made continuously throughout the training process. If an input is not normalized before being fed into model training, the learning rate can lead to differences in the corrections applied to different dimensions, potentially leading to overcompensation in one area. dimension and compensate for the deficiency in another dimension. Furthermore, normalization acts as a form of regularization by preventing models from being outrageously sensitive to certain features, thus aiding in preventing overfitting and enhancing generalization performance.

Given an image 𝐼 represented as a matrix of pixel values, where 𝐼𝑖𝑗 denotes the pixel value at row 𝑖 and column 𝑗, the process of normalizing using Min-Max scaling to rescale the pixel value to range [0, 1] as Equation 1 below:

𝐼𝑖𝑗′ = 𝐼𝑖𝑗 − min(𝐼)

Copies for internal use only in Phenikaa University

</div>Trang 31<div class="page_container" data-page="31">

2.2. Proposed architecture

2.2.1. Theoretical basis

2.2.1.1. Machine Learning, Deep Learning & TinyML

Machine Learning (ML) represents transformative paradigms within the field of artificial intelligence, revolutionizing the way in which computers learn from data and making decisions. At its core, ML encompasses a diverse set of algorithms and techniques that enable computers to learn patterns and relationships from data without being explicitly programmed. These algorithms are capable of iteratively improving their performance over time, allowing for the development of predictive models and systems that can make decisions and perform tasks autonomously. Deep Learning, a subset of ML, has emerged as a particularly powerful approach, leveraging artificial neural networks with multiple layers of abstraction to extract intricate features and representations from raw data. Unlike traditional ML methods, DL excels at handling large-scale, high-dimensional datasets, enabling the development of sophisticated models capable of learning complex patterns and relationships directly from raw input data. These raw inputs are passed through a series of learnable convolutional layers, fully connected to updating their weights without any specific transformation method as shown in Figure 2.4.

These traditional Deep Learning models focus a lot on their robustness but rarely put much emphasis on optimizing computational costs. This makes it infeasible to deploy on-edge devices such as mobile phones or microcontrollers. With the current popularity of IoT devices, microcontrollers have played an important role in operating many systems in the world. To address the limitations of deep learning, TinyML has emerged as a promising field focusing on techniques that help deploy models to devices with limited hardware. Lightweight models are directly embedded to operate on these small devices such as sensors and actuators to perform specific tasks. Furthermore, the adoption of TinyML techniques in systems brings about significant advantages in data transmission efficiency. Rather than transmitting all raw data to the central processing unit, terminals can transmit condensed, processed information, reducing the burden on network bandwidth and

Copies for internal use only in Phenikaa University

</div>Trang 32<div class="page_container" data-page="32">

enhancing overall system scalability. Besides, by processing data locally before transmission, the privacy of raw data can be effectively protected. Figure 2.5 depicts a comparison between systems with and without TinyML. Using TinyML, data is processed locally before seeking help from edge AI or cloud AI [29].

Figure II.4 Basic Convolutional Neural Network. A typical deep learning model can take an input and transform it into corresponding outputs. Raw image data

can be fed into the model without going through any traditional data extraction techniques. [30]

Figure II.5 Comparison between systems with TinyML (a) and without TinyML

(b). TinyML is embedded directly on the microprocessors on the sensor to process

data before seeking help from edge AI or cloud AI [29].

Copies for internal use only in Phenikaa University

</div>Trang 33<div class="page_container" data-page="33">

2.2.1.2. Standard convolutional layer

Kernel convolution is a fundamental operation utilized not only in Convolutional Neural Networks (CNNs) but also in various other Computer Vision algorithms. It involves applying a small matrix of numbers, referred to as a learnable kernel or filter, over an image and transforming it based on the values within the filter. These kernels, typically small in spatial dimensionality, extend across the entirety of the input depth. During the traversal of data through a convolutional layer, each filter convolves across the spatial dimensions of the input, generating a 2D activation map. This process enables the network to use kernels that trigger upon detecting specific features at distinct spatial positions within the input. These 2D activation maps, also known as feature maps, can be calculated according to the following Equation 2.2:

𝐹𝑚×𝑛 = (𝑓 ∗ 𝑘)𝑚×𝑛 = ∑ ∑ 𝑘(𝑗,ℎ)𝑓(𝑚−𝑗,𝑛−ℎ)

(2.2) The result feature map is denoted by 𝐹 ∈ ℝ𝑚×𝑛, the input image is denoted by 𝑓 ∈ℝ𝑗×ℎ where 𝑗 and ℎ represent the indexes of rows and columns respectively. The learnable kernel 𝑘 has size 𝑘 × 𝑘 will pass through the length and width of the input in turn, extract a submatrix of the same size from the input and multiply each element by position and sum them all together to form one element in the result matrix. The whole process is described in Figure 2.6.

For inputs with multiple channels such as RGB images, the kernels need to have the same number of channels correspond to input. This principle, known as convolution over depth, is an important property that both helps convolution work with color images and allows you to use multiple filters in the same layer. When using multiple filters for a single input, the convolution operation is performed separately for each filter. The resulting feature maps are then concatenated together to form an output. For an input 𝑋 ∈ ℝℎ×ℎ×𝑐, implementing convolution with kernel size 𝑘 × 𝑘 × 𝑐 using 𝑛𝑐 filters, we can obtain the result as Equation 2.3:

Copies for internal use only in Phenikaa University

</div>Trang 34<div class="page_container" data-page="34">

[ℎ, ℎ, 𝑐] ∗ [𝑘, 𝑘, 𝑐] ∗ 𝑛𝑐 = [ℎ + 2𝑝 − 𝑘

ℎ + 2𝑝 − 𝑘

𝑠 + 1, 𝑛𝑐] (2.3) where, 𝑝 and 𝑠 describe padding and strides used in this convolution operation, respectively.

The forward propagation process includes the calculation of intermediate matrix 𝑍, and the application of a non-linear activation function 𝛿. Here the model can be learned by adjusting parameters including weights and bias to produce appropriate output. This process is described in Equation 2.4:

Where 𝐴[𝑙−1] is activation map obtained from layer 𝑙 − 1 as input to layer 𝑙.

Figure II.6 Kernel production operation. The process of multiplying the kernel

with each submatrix taken from the input respectively along the length and width of the input

Backpropagation, short for "backward propagation of errors," is a fundamental process used in training CNN models specifically and deep neural networks in general. This process optimizes model’s parameters by iteratively adjusting to minimize the difference, called error, between the predicted output and the actual target output. In the backward pass, the error signal is propagated backward

Copies for internal use only in Phenikaa University

</div>Trang 35<div class="page_container" data-page="35">

through the network, layer by layer, using the chain rule of calculus. At each layer, the algorithm computes the gradient of the loss function with respect to the parameters of the layer. This gradient indicates the direction and magnitude of the adjustment needed to minimize the error. For instance, with kernel 𝑘 ∈ ℝ𝑘×𝑘 have weight matrix 𝜔 ∈ ℝ𝑘×𝑘, the upgrade process can be described in Figure 2.7 following by Equation 2.5

𝜔𝑖 = 𝜔𝑖− 𝛼 × 𝜕𝐿

𝜕𝜔𝑖 = ∑𝜕𝑧𝑗𝜕𝜔𝑖∗

(2.5.2)

𝜕𝐿𝜕𝑍 =

Where 𝛼 is learning rate, 𝑦 is predict result, and 𝛿′ is the derivative of the activation function 𝛿.

Figure II.7 Forward propagation and Backward propagation of convolution

operation. Both processes follow chain rules.

2.2.1.3. Depthwise separable convolution

First introduced in Paper MobileNet [9], Depthwise separable convolution has become an essential component in many lightweight modeling architectures [10]

Copies for internal use only in Phenikaa University

</div>Trang 36<div class="page_container" data-page="36">

[31]. The standard convolutional layer performs a convolution operation on each input channel and combines all filters in a single step. In contrast, DSC divides these processes into two distinct layers that operate consecutively. Initially, a depthwise convolutional layer applies individual filters to each input channel. Subsequently, a convolutional layer with a kernel size of 1x1, called pointwise convolution, computes the combination of input channels to generate fresh feature maps. The processes can be described in Figure 2.8

Given an input 𝐼 of size 𝐻 × 𝑊 × 𝐶, the standard convolutional layer uses a kernel 𝐾 of size 𝐾 × 𝐾 × 𝐶 × 𝑁𝑐 to produce a feature map 𝑂 of size 𝐻 × 𝑊 × 𝑁𝑐. Assuming stride is one and padding to make output has same size as input, the output feature map can be computed as Equation 2.6:

(2.6) where 𝐻 and 𝑊 are the spatial height and width of the input feature map, respectively, 𝐶 is the number of channels of the input. For the output feature, 𝑁𝑐 is the depth of the output which also corresponds to the number of filters used in this convolutional layer. 𝐾 × 𝐾⁡are the horizontal and vertical dimensions of a square kernel, respectively. From there we can calculate the computational cost by multiplying these parameters together as Equation 2.7 below:

Meanwhile, with the same input 𝐼 as above, the DSC layer is divided into two separate tasks. First, the depthwise convolution layer uses C filters with kernel 𝐾̂ ∈ℝ𝐾×𝐾 corresponding to the number of input channels. The 𝑐𝑡ℎ filter corresponds to the 𝑐𝑡ℎ channel of input produce the 𝑐𝑡ℎ channel of the filtered output feature map 𝑂̂, the formula can be described as Equation 2.8:

</div>Trang 37<div class="page_container" data-page="37">

works separately in each channel to create new feature. This makes the computational cost 𝑁𝑐 times smaller and can be calculated as Equation 2.9:

After that, pointwise operator can be applied to obtained feature maps by depthwise to combine across entire filters to generate final output features. Therefore, to synthesize the computational cost of the entire DSC, we have the formula as Equation 2.10:

𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 + 𝐶 × 𝑁𝑐× 𝐻 × 𝑊 (2.10) Comparing the two types of layers together, we obtain a reduction in computational cost calculated as Equation 2.11

𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 + 𝐶 × 𝑁𝑐× 𝐻 × 𝑊𝐻 × 𝑊 × 𝐶 × 𝐾 × 𝐾 × 𝑁𝑐 =

(2.11)

2.2.1.4. Pooling layer

Pooling layer is also known as down sample layers, are an essential component of convolution neural networks (CNNs) used in Deep Learning. It is responsible for reducing the spatial dimensions of the input data, in terms of width and height, while retaining the most important information.

Pooling layers divide the input data into small regions, called pooling windows or receptive fields, and perform an aggregation, such as taking the maximum or average value of each window. This aggregation reduces the size of the feature maps, resulting in a compressed representation of the input data.

The process of Pooling Layers involves three following steps:

1. Divide the input data into non – overlapping regions or windows.

2. Apply an aggregation function, such as max pooling or average pooling, on each window to obtain a single value.

3. Combine the values obtained from each window to create a down sampled representation of the input data.

Copies for internal use only in Phenikaa University

</div>Trang 38<div class="page_container" data-page="38">

For a feature map having dimensions 𝑛ℎ × 𝑛𝑤× 𝑛𝑐, the dimensions of output obtained after a pooling layer is (𝑛ℎ− 𝑓 + 1)/𝑠 × (𝑛𝑤− 𝑓 + 1)/𝑠 × 𝑛𝑐

Where 𝑛ℎ and 𝑛𝑤 is height and width of the feature map, 𝑛𝑐 is number of channels in this feature map with 𝑓 is size of filter and 𝑠 is stride using in this pooling layer.

Figure II.8 Operation of standard convolutional layer (a) replaced by depthwise convolutional layer with two separate layer included depthwise layer (b) and

pointwise layer (c) [9]

Copies for internal use only in Phenikaa University

</div>Trang 39<div class="page_container" data-page="39">

There are some types of pooling layers: max pooling, min pooling, average pooling, global pooling. The summary of the features in a region are represented by the maximum/minimum/average value of that region. Max/ Min/ Average pooling smooths the harsh edges of a picture and is used when such edges are not important.

With global pooling, each channel in the feature map is reduced to just one value. The value depends on the type of global pooling, which can be any one of the previously explained types.

2.2.1.5. Batch Normalization

Batch normalization is a technique used to improve the performance of a deep learning network by first removing the batch mean and then splitting it by the batch standard deviation.

During training, the activations of a layer are normalized for each mini-batch of data using the following equation 2.12:

- Batch mean: 𝜇𝐵 = 1

- Batch variance: 𝜎𝐵2 = 1

𝑚∑𝑚𝑖=1 (𝑥𝑖− 𝜇𝐵)2 (2.12.2) - Normalized activations: 𝑥̅ =𝑖 𝑥𝑖−𝜇𝐵

√Var𝑥+𝜖𝑥 + (𝛽 + 𝛾𝐸𝑥

√Var𝑥+𝜖) (2.13.4)

Copies for internal use only in Phenikaa University

</div>Trang 40<div class="page_container" data-page="40">

2.2.1.6. Activation

H-swish activation function is known as Hard Swish activation function. Hard

Swish is a type of activation function based on Swish but replace the computationally expensive sigmoid with a piecewise linear analogue. H-swish takes one input tensor and produces output tensor where the hard version of the swish function is applied to tensor element wise. It is defined as Equation:

hswish(𝑥𝑖) = 𝑥𝑖∗ReLU6(𝑥𝑖+ 3)

where 𝑥𝑖 is the 𝑖-th slice in the given dimension of the input Tensor.

SoftMax is an activation function that transforms the raw outputs of the neural

network into a vector of probabilities, essentially a probability distribution over the input classes. The equations 2.15 of the SoftMax function is given as follows:

2.2.1.7. Squeeze and Excitation

The Squeeze-and-Excitation Block is an architectural block that allows a network to dynamically undertake channel-wise feature recalibration, hence increasing its representational power [23]. The detailed process of SE block is shown in Figure 2.9.

Squeeze-and-Excitation Networks (SENets) introduce a building block for CNNs that improves channel interdependence at almost no computational cost. To achieve adaptive weighting, there are three phases inner working of the SE block: Squeeze Phase, Excitation Phase, and Scale and Combine.