Tải bản đầy đủ (.pdf) (64 trang)

Study and improve few shot learning techniques in computer vision application

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.05 MB, 64 trang )

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE & ENGINEERING
——————– * ———————

BACHELOR THESIS

Study and Improve Few-shot Learning
Techniques in Computer Vision Application
Major: Computer Engineering

Council: Computer Engineering 1
Supervisor: Dr. Le Thanh Sach
Dr. Nguyen Ho Man Rang
Reviewer: Dr. Nguyen Duc Dung
—o0o—
Student: Nguyen Duc Khoi (1752302)

HO CHI MINH CITY, 8/2021


I HC QUC GIA TP.HCM
---------TRNG I HC BỗCH KHOA
KHOA:KH & KT M‡y t’nh ____
BỘ MïN: KHMT ___________

CỘNG HđA XÌ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phc

NHIM V LUN ỗN TT NGHIP
Ch ý: Sinh viãn phi d‡n tờ nˆy vˆo trang nhất của bản thuyết tr“nh



MSSV: 1752302 Họ vˆ T•n SV: NGUYEN DUC KHOI
1. Đầu đề luận ‡n:

Ngˆnh: Kỹ thuật M‡y t’nh

EN: A study on few-shot learning for computer vision applications
VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng
trong thị giác máy tính
2. Nhiệm vụ (yãu cu v ni dung v s liu ban u):
Ơ Study Deep learning, and do literature review for few-shot learning;
¥ Propose a learning techinque for training deep models (in computer vision) with
popuplar datasets on the Internet;
¥ Apply few-learning to an application in computer vision, from training, tuning, to
deploying the trained model on embeded systems supported by NVIDIA’s
technologies.
3. Ngˆy giao nhiệm vụ luận ‡n: 01/ 01 /2021
4. Ngˆy hoˆn thˆnh nhiệm vụ: 01/ 08 /2021
5. Họ t•n giảng vi•n hướng dẫn:

Phần hướng dẫn:

1) L• Thˆnh S‡ch

Đồng hướng dẫn __________________

2) Nguyễn Hồ Mẫn Rạng

Đồng hướng dẫn __________________


Nội dung vˆ y•u cầu LVTN đ‹ được th™ng qua Bộ m™n.
Ngˆy th‡ng năm 2021
CHỦ NHIỆM BỘ MïN

GIẢNG VIæN HƯỚNG DẪN CHêNH

(Ký vˆ ghi r› họ t•n)

(Ký vˆ ghi r› họ t•n)

L• Thˆnh S‡ch
PHẦN DËNH CHO KHOA, BỘ MïN:
Người duyệt (chấm sơ bộ): ________________________
Đơn vị: _______________________________________
Ngˆy bảo vệ: ___________________________________
Điểm tổng kết: _________________________________
Nơi lưu trữ luận ‡n: _____________________________


TRNG I HC BỗCH KHOA
KHOA KH & KT MỗY TờNH

CNG HđA XÌ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phœc
---------------------------Ngˆy 09 th‡ng 08 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN
(Dˆnh cho người hướng dẫn)
1. Họ vˆ t•n SV:
MSSV: 1752302 Họ vˆ T•n SV: NGUYEN DUC KHOI


Ngˆnh: Kỹ thuật M‡y t’nh

2. Đề tˆi:
EN: A study on few-shot learning for computer vision applications
VN: Nghiên cứu và cải tiến kỹ thuật học với số ít mẫu được làm nhãn cho các ứng dụng
trong thị giác máy tính
3. Họ t•n người hướng dẫn: TS. L• Thˆnh S‡ch
4. Tổng qu‡t về bản thuyết minh:
Số trang:
Số chương:
Số bảng số liệu
Số h“nh vẽ:
Số tˆi liệu tham khảo:
Phần mềm t’nh to‡n:
Hiện vật (sản phẩm)
5. Tổng qu‡t về c‡c bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ kh‡c:
- Số bản vẽ vẽ tay
Số bản vẽ tr•n m‡y t’nh:
6. Những ưu điểm ch’nh của LVTN:
¥ The author masters different techniques required for designing deep learning models, and
for training, tunning, and deploying models to GPU cards with NVIDIAÕs technologies .
¥ The thesis consists of a science and an engineering task related to deep learning as follows:
(a) Science: improve a selected few-shot learning technique for computer vision. The author
has proposed an idea that is based on the episodic training and the dense convolution.
The proposed idea has been evaluated with popular datasets reserved for the reseach

field, it can gain some improvements. The reseachÕs results have been submitted to an
international conference and wait for the reviewersÕ conclusions.
(b) Engineering: apply the few-shot to train a selected computer vision task and then deploy
the trained model to an embeded system GPU card. To this end, the author selected
application Ịdrowsiness detectionĨ. He utilized few-shot to train YOLOv5 and then
deploy the trained model to NVIDIA Jetson TX2 successfully. The demo application can
run and detect the drowsiness live.
7. Những thiếu s—t ch’nh của LVTN:
¥ The publication is not available at the defenseÕs time as designed.
¥
8. Đề nghị: Được bo v ỵ
B sung thãm bo v o
Khng c bảo vệ o
9. Ba c‰u hỏi SV phải trả lời trước Hội đồng:
10. Їnh gi‡ chung (bằng chữ: giỏi, kh‡, TB): 10 (mười)

Ký t•n (ghi r› họ t•n)


L• Thˆnh S‡ch


TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA KH & KT MÁY TÍNH

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
---------------------------Ngày 01 tháng 08 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN

(Dành cho người hướng dẫn/phản biện)
1. Họ và tên SV: Nguyễn Đức Khôi
MSSV: 1752302
Ngành (chuyên ngành): Computer Engineering
2. Đề tài: Research and Apply Few-shot Learning Techniques in Drowsiness Detection
3. Họ tên người hướng dẫn/phản biện: Nguyễn Đức Dũng
4. Tổng quát về bản thuyết minh:
Số trang:
Số chương:
Số bảng số liệu
Số hình vẽ:
Số tài liệu tham khảo:
Phần mềm tính tốn:
Hiện vật (sản phẩm)
5. Tổng quát về các bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ khác:
- Số bản vẽ vẽ tay
Số bản vẽ trên máy tính:
6. Những ưu điểm chính của LVTN:
The thesis focus on detecting drowsiness from the human face using deep learning approaches. The
team proposed using ResNet block instead of normal convolutional block in the YOLOv5 network
to improve the detection accuracy. The team also deploy this model to the embedded system (Jetson
TX2) for realtime performance. The results show some improvement in the detection accuracy.
7. Những thiếu sót chính của LVTN:
The replacement of ResNet block in the network has been utilized for awhile, which makes this
contribution a bit weak. The drowsiness detection problem, however, can be solved better by other
vision techniques, which can be very fast and realtime. The choice of current approach is very bias

and need to be considered in the future. The few-shot learning scheme is irrelevant to the main topic
we are discussing.
8. Đề nghị: Được bảo vệ o
Bổ sung thêm để bảo vệ o
Không được bảo vệ o
9. 3 câu hỏi SV phải trả lời trước Hội đồng:
a. Why don't you use other vision algorithms to detect drowsiness? Even if it will give much better
performance comparing to YOLO?
b. Explain why Few-shot learning matter. The discussion need to be improved.
c.
10. Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi

Điểm: 9 /10
Ký tên (ghi rõ họ tên)

Nguyễn Đức Dũng


Declaration
We hereby declare that this thesis titled ‘Research and Apply Few-shot LearningTechniques in Computer Vision Application’ and the work presented in it are our own. We
confirm that:
• This work was done wholly or mainly while in candidature for a degree at this University.
• Where any part of this thesis has previously been submitted for a degree or any other
qualification at this University or any other institution, this has been clearly stated.
• Where we have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely our own work.
• We have acknowledged all main sources of help.
• Where the thesis is based on work done by ourselves jointly with others, we have
made clear exactly what was done by others and what we have contributed ourselves.



Acknowledgments
First and foremost, I am tremendously grateful for my advisers Dr. Sach Le Thanh and Dr. Rang
Nguyen Ho Man for their continuous support and guidance throughout my project, and for providing me
the freedom to work on a variety of problems. Second, I take this opportunity to express gratitude to all
of the Faculty of Computer Science and Engineering members for their help and support. I also thank
my parents for the unceasin encouragement, support and attention.


Abstract
Artificial intelligence for driving is receiving more attention. Drowsiness detection is one of
the smaller tasks to improve the driving experience. A drowsiness detector can detect and warn
the drivers when they fall asleep and prevent accidents caused by drivers’ drowsiness. A simple
approach is to consider the drowsiness detection problem as an object detection problem. In
this thesis, we adopt a powerful object detector called YOLOv5. It is one of the most popular frameworks for object detection that was released to the public. In our experiments, the
YOLOv5 framework can achieve excellent detection performance with abundant supervised
data. In terms of speed performance, we deploy the trained model to the Jetson TX2 using
TensorRT, which significantly outperforms the released Pytorch implementation. In practice,
we are not always able to access an abundant amount of labeled data. The limited number of
training examples can lead to severely deficient performance, as shown in our experiments. We
propose to pretrain the model with other datasets to improve the overall performance without introducing any computational inference cost. We introduce a pretraining method from few-shot
learning that achieves state-of-the-art in widely used few-shot learning benchmarks to pretrain
the model. We extensively conduct experiments with several pretraining methods to analyze
their transfer performance to object detection tasks.


Contents
1

Introduction

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Foundations
2.1 Probabilities and Statistic Basics . . . . . . . . . . . . . . . . . . . .
2.1.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Probability Distributions . . . . . . . . . . . . . . . . . . . .
2.1.3 Discrete Random Variables - Probability Mass Function . . .
2.1.4 Continuous Random Variables - Probability Density Function
2.1.5 Marginal Probability . . . . . . . . . . . . . . . . . . . . . .
2.1.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . .
2.1.7 Expectation and Variance . . . . . . . . . . . . . . . . . . . .
2.1.8 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.9 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
2.2 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . .
2.2.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . .
2.3 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

4
4
4
4
4
4
5
5
5
5
6

7
7
9
9
9
10

Related work
3.1 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Meta-Learning . . . . . . . . . . . . . . . . . . . . .
3.1.2 Metrics-Learning . . . . . . . . . . . . . . . . . . . .
3.1.3 Boosting few-shot visual learning with self-supervision
3.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Two-stage Detectors . . . . . . . . . . . . . . . . . .
3.2.2 One-stage Detectors . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.

.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.

11
11
11
13
13
16
16
18

Methods
4.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Bag of freebies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 A Strong Baseline for Few-Shot Learning . . . . . . . . . . . . . . .
4.3.1 Joint Training of Episodic and Standard Supervised Strategies
4.3.2 Revisiting Pooling Layer . . . . . . . . . . . . . . . . . . . .
4.4 YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

21
21
22
22

23
24
25

3

4

i

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.

1
1
2
2


4.4.1
4.4.2
5

YOLOv5 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .
ResNet-50-YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
25

Experiments
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Results of Training ResNet-50-YOLOv5 from Scratch with Abundant Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Performance of Deploying ResNet-50-YOLOv5 with TensorRT . . . . . . . .
5.3.1 Comparison between TensorRT and Pytorch . . . . . . . . . . . . . . .
5.3.2 Effective of image’s resolution on performance . . . . . . . . . . . . .
5.4 Result of Baseline on Few-shot Benchmarks . . . . . . . . . . . . . . . . . . .

5.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Results of Training ResNet-50-YOLOv5 with Limited Annotations . . . . . . .
5.5.1 Results of Training ResNet-50-YOLOv5 from Scratch with Limited
Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Results of Training Pretrained ResNet-50-YOLOv5 with Limited Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28
28

6

Conclusion

7

Appendix
7.1 Network architecture terminology . . . . . . . . .
7.2 Jetson TX2 . . . . . . . . . . . . . . . . . . . . .
7.2.1 Jetson TX2 Developer Kit . . . . . . . . .
7.2.2 JetPack SDK . . . . . . . . . . . . . . . .
7.3 Tensor RT . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Developing and Deploying with Tensor RT

29
29
29
30
30
30

31
31
31
32
33
34
34
42

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.


.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

43
43
46
46
47
48

48


List of Tables
2.1
2.2

Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quantities in Confusion Matrix of Testing for Coronavirus . . . . . . . . . . .

8
9

5.1
5.2
5.3
5.4
5.5
5.6

Evaluation of training ResNet-50-YOLOv5 from scratch . . . . . . . . . . . .
Performance of deploying trained ResNet-50-YOLOv5 into Jetson TX2 . . . .
Comparison to prior work on CIFAR-FS and FC100 . . . . . . . . . . . . . . .
Comparison with previous works on mini-ImageNet . . . . . . . . . . . . . . .
Evaluation of training ResNet-50-YOLOv5 from scratch . . . . . . . . . . . .
Comparison between Our Baseline and Standard Supervised Training on miniImageNet benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of mini-ImageNet-pretrained ResNet-50-YOLOv5 . . . . . . . .
Performance of ImageNet-pretrained ResNet-50-YOLOv5 . . . . . . . . . . .

29

31
34
35
36

5.7
5.8

iii

36
36
39


List of Figures
1.1

Giraffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2.1
2.2

The graph of the standard normal distribution . . . . . . . . . . . . . . . . . .
Mnist digit dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6
8


3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

MAML algorithms . . . . . . .
LEO algorithms . . . . . . . . .
Meta-SGD algorithms . . . . . .
[9] . . . . . . . . . . . . . . . .
Relation Network . . . . . . . .
Attention-based weight generator
R-CNN . . . . . . . . . . . . .
Fast R-CNN . . . . . . . . . . .
Faster R-CNN . . . . . . . . . .
You only look once model . . .
Single Shot MultiBox Detector .

4.1
4.2
4.3

Problem formulation. . . . . . . . . . . . . . . . . . . . . . . . .

Overall development of our system. . . . . . . . . . . . . . . . .
a) Different kernel sizes of pooling layers applied to feature maps.
b) Adapted pooling layer. . . . . . . . . . . . . . . . . . . . . . .
YOLOv5 Architecture . . . . . . . . . . . . . . . . . . . . . . .
resnet50xYOLOv5 architecture . . . . . . . . . . . . . . . . . . .

4.4
4.5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.

11
12
13
13
15
16
17
17
18
19
20

. . . . . . .
. . . . . . .

21
22

. . . . . . .
. . . . . . .

. . . . . . .

24
26
27

Dataset samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
mini-ImageNet sample images . . . . . . . . . . . . . . . . . . . . . . . . . .
Qualitative results of training ResNet-50-YOLOv5 from scratch with abundant
annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance on different image sizes . . . . . . . . . . . . . . . . . . . . . . .
Evaluation on different kernel sizes of the last pooling layer . . . . . . . . . . .
Precision of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set .
Recall of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set . .
of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set
:0.95 of mini-ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set . . . .
Recall of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set . . . . .
of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set . . .
:0.95 of ImageNet-pretrained ResNet-50-YOLOv5 on Validation Set

28
29

iv

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

30
32
33
37
37
38
38
40
40
41
41


7.1
7.2

Jetson TX2 Developer Kit components . . . . . . . . . . . . . . . . . . . . . .
NVIDIA SDK Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46
47



Chapter 1
Introduction
1.1 Motivation
Drowsiness detection is usually applied in modern vehicles to enhance driving safety. Detecting drowsiness from the drivers can prevent potential accidents. With the advent of deep
learning, many visual tasks such as image classification [13], object detection [28], and semantic segmentation [22] have achieved great performances. As a result, consider drowsiness
detection as a visual object detection gives a wider range of powerful solutions from the deep
learning approach. However, deep learning methods typically require a large set of labeled data
for training and a relatively long processing time for an embedded application.
Humans are very good at grasping new concepts and adapting to unforeseen circumstances.
For instance, humans can recognize a giraffe from just a single picture. This ability is a hallmark
of human intelligence. The secret behind the ability is that humans can leverage prior experience
to reinforce new concepts. In contrast, the traditional classification model learns an object from
scratch and required massive labeled training data. For example, ResNet [13] was trained on
1.28 million training images of ImageNet [32] to achieve a classification accuracy of humans
level.

Figure 1.1: Giraffe. Humans can recognize a giraffe after viewing one a single time. In contrast,
an object recognition system like Resnet [13] has to train with a far more number of examples
to achieve the human level of accuracy.
In some areas, labeled data is just way too expensive. For instance, the process of collect
medical data is very complicated. It consumes time, resources, or even acceptance from the
patients. This causes practical difficulties for traditional machine learning systems to handle
1


such situations. However, the available labeled or unlabeled data from other distributions is
enormous. The question is that can we transfer knowledge from available data to new tasks?
Can we train a model with available data such that a few annotation information from new tasks
can produce a great performance model?

Modern deep learning works are usually implemented by general-purpose deep learning
frameworks, e.g., Pytorch, Tensorflow. These libraries provide great flexibility to construct
loss functions at training time, build and modify the models, etc. A deep learning model is
usually trained on data center GPUs with great processing capacity. However, at deploying
stage, the inference is sometimes required to run on cheaper devices with smaller capacities.
Inferencing using the same general-purpose deep learning libraries is relatively slow in terms of
speed for some applications. Typically, drowsiness detection requires low latency for immediate
response to the drowsiness. Hence, there is a need to optimize the trained deep learning model
for embedded devices. The most popular approach for NVIDIA embedded devices is to use
NVIDIA TensorRT.

1.2 The Scope of the Thesis
We consider drowsiness detection as an object detection problem. More specifically, we
consider the problem of detecting drowsiness from the human face. The problem consists of
localizing the human face on camera and classifying the drowsy expression from it. We analyze the difficulty of limited data in training a modern object detector. To this extent, we
review recent few-shot learning and object detection literature. We analyze how those few-shot
techniques can be applied when transfer to object detection tasks. We then demonstrate the
performance of several state-of-the-art pretraining techniques in transferring to object detection
tasks. We propose a new object detection network called ResNet-50-YOLOv5 which adopts
ResNet-50 as a backbone in YOLOv5 architecture. The combined network is native to a lot of
available unsupervised and few-shot learning methods. Finally, we deploy the proposed model
to Jetson TX2 using the leading framework for processing time, i.e., TensorRT.

1.3 Organization of the Thesis
• In chapter 2, we briefly describe few-shot learning and object detection settings. We also
provide some basic math and machine learning concepts that might be helpful for the reader
to understand the rest of the text.
• Chapter 3 summarizes some legacy research works in the areas in a consistent way.
• In chapter 4, we briefly provide object detection settings. We introduce our few-shot learning method that achieves state-of-the-art in popular benchmarks, and we show how to apply
it to improve a particular object detector. We describe one of the most powerful models for

object detection, i.e., YOLOv5. Finally, we provide a detailed description of the network
architecture of the proposed ResNet-50-YOLOv5.
• In chapter 5, we first describe specifically the drowsiness detection problem, which includes dataset, performance metrics, etc. We demonstrate the results of a naive approach
to drowsiness detection with full annotations using ResNet-50-YOLOv5. We also report
the overall performance of the model when being deployed on Jetson TX2. We then investigate the performance of training ResNet-50-YOLOv5 on limited numbers of training
data. Finally, we conduct extensive experiments on pretraining the backbone of ResNet50-YOLOv5.
2


• Finally, we conclude by summarize what we have done so far and discussing pros and cons
of the work in chapter 6.

3


Chapter 2
Foundations
In the first part of this section, we provide basic math and machine learning concepts that
are useful for further discussion in the field. We describe essential subfields of machine learning: supervised, unsupervised learning. Each of them has been studied extensively in few-shot
learning literature recently and still has room for improvements.
In the second part, we provide a formal definition of the few-shot learning object recognition, object detection, and their terminologies.

2.1 Probabilities and Statistic Basics
2.1.1

Random Variables

A random variable is a variable that can take on different values, each occurring with some
probability. If the set of such values is discrete, the random variable is said to be discrete,
otherwise it is continuous.

For example, the outcome of tossing an unbias coin can be modeled with a random variable:
x 2 {0, 1}, where 0 indicates the outcome “head”; 1 indicates the outcome “tail”; each case
occurs with probability 12 .

2.1.2

Probability Distributions

2.1.3

Discrete Random Variables - Probability Mass Function

When describing a probability distribution associated with some discrete random variable,
we use probability mass function. A probability mass function takes a value of the specified
random variable as input and outputs the corresponding probability.
A typical probability mass function P of a random variable x must satisfy these properties:
• The domain of P must be the set of all possible values of x.
• 8x, 0  P (x)  1.
• ∑x P (x) = 1.

2.1.4

Continuous Random Variables - Probability Density Function

In the case of describing the probability distribution of a continuous random variable, we
use probability density function.
A typical probability density function P of a random variable x must satisfy these properties:
4



• The domain of p must be the set of all possible values of x.
• 8x, p (x) ≥ 0.
R
• x p (x) = 1.

2.1.5

Marginal Probability

Probability distribution can be defined on a set of more than one variable. The probability
distribution over a subset of them is call the marginal probability distribution. For instance,
let P (x, y) be the probability distribution defined over two random variables x and y. Then the
marginal probability distribution P (x) is
P (x) = ∑ P (x, y) .

(2.1)

y

The formula 2.1 is known as the sum rule. For the case of continuous random variables:
p (x) =

Z

p (x, y) .

(2.2)

y


2.1.6

Conditional Probability

The conditional probability that y occurs given x is denoted as p (y|x). The conditional
probability can be derived as follow
p (y|x) =

2.1.7

p (x, y)
.
p (x)

(2.3)

Expectation and Variance

The expectation of some function f (x) with respect to a probability mass function P (x) is
defined as
Ex⇠P [ f (x)] = ∑ P (x) f (x) .

(2.4)

x

For the case of probability density function p (x):
Ex⇠p [ f (x)] =

Z


p (x) f (x) dx.

(2.5)

The variance of f (x) is
h
i
var ( f (x)) = E f (x) − E [ f (x)]2

2.1.8

(2.6)

Sample

Statistical inference is concerned with making decisions about a population based on the
information in a random sample drawn from that population.
Random sample. The random variables X1 , X2 , . . . , Xn are a random sample of size n if the Xi ’s
are independent random variables and every Xi has the same probability distribution.
Statistic. A statistic is any function of the observations in a random sample.

5


¯ sample variance S2 and sample standard deSome important statistics are sample mean X,
viation S. Because observation may vary as sample are randomly drawn, the statistic will also
vary. As a result, a statistic is a random variable associated with some probability distribution.
The probability distribution of a statistic is called a sampling distribution.
X1 + X2 + · · · + Xn

(2.7)
X¯ =
n
Central Limit Theorem. If X1 , X2 , . . . , Xn is a random sample of size n taken from a population
(either finite or infinite) with mean µ and finite variance σ 2 and if X¯ is the sample mean, the
limiting form of the distribution of
X¯ − µ
(2.8)
Z= σ
p

n

as n 7! ∞, is the standard normal distribution.

Figure 2.1: The graph of the standard normal distribution.
The t distribution. Let X1 , X2 , . . . , Xn be a random sample from a normal distribution with
unknown mean µ and unknown variance σ 2 . The random variable
T=

X¯ − µ
pS
n

(2.9)

has a t distribution with n–1 degrees of freedom.

2.1.9


Confidence Intervals

Obviously, the sample mean X¯ is a point estimator of the unknown population mean µ.
However, we have no clue about how well the sample mean estimate the population mean.
Confidence interval gives us a quantity way to reason about this.
Confidence interval the population mean, variance known. If x¯ is the sample mean of a
random sample of size n from a normal population with known variance σ 2 , a 100 (1 − α) %
confidence interval on µ is given by
σ
σ
x¯ − z α2 p  µ  x¯ + z α2 p
n
n

(2.10)

, where z α2 is the upper 100α/2 percentage point of the standard normal distribution.
Confidence interval the population mean, variance unknown. If x¯ and s are the mean and
standard deviation of a random sample from a normal distribution with unknown variance σ 2 ,
a 100 (1 − α) % confidence interval on µ is given by
6


s
s
x¯ − t α2 ,n−1 p  µ  x¯ + t α2 ,n−1 p
n
n

(2.11)


, where t α2 ,n−1 is the upper 100α/2 percentage point of the t distribution with n – 1 degrees of
freedom.
It is important not to confuse the meaning of a 100 (1 − α) % confidence interval with the
probabilities of the population’s mean lying within the interval. If we repeatedly generate random samples and compute the confidence interval for each random sample, 100 (1 − α) % number of intervals will contain the population’s mean.

2.2 Machine Learning Basics
Many works described in this text are deep learning techniques, which are a particular type
of machine learning. Understanding machine learning basic concepts is crucial for discussing
deep learning as well as few-shot learning algorithms.
A machine learning algorithm is an algorithm that is able to learn from data. A particular
task in machine learning can be defined by two sets: training and test sets corresponding to two
stages. At the first stage or training time, the training set is given to the model. The model aims
to learn from the training set so that at the second stage or test time, the model can perform
well on the test set. A performance metric is used to evaluate how good the model is for a
particular dataset. Sometimes we encounter the term ’validation set’ in literature. This set is
used to evaluate the model performance before being actually tested on test set. Feedback from
the validation set also help tune hyperparameters of a model and avoid overfitting.

2.2.1

Supervised Learning

In supervised learning problems, the training set is a collection of pairs of input data point
and its associated target or label. Supervised learning aims to learn a function that accurately
predicts the targets for novel data points.
Formally, given a training dataset D = {(xi , yi ) , i = 1, 2, . . . }, the task is to learn a model
that produces a prediction y predict for an unseen data point x⇤ as accurately as possible. The
accuracy of a model is defined by a loss function L y predict , ytrue of the prediction y predict and
a ’ground truth’ target ytrue associated with x⇤ .

In some cases, the predictions are based on a conditional distribution p (y|x, D). The task
is now to model the conditional distribution. This is referred to as probabilistic supervised
learning.
Classification
Classification is one of the most widespread problems of supervised learning. The task is
to classify some input data point into one of the given categories. The set of categories are a
discrete and ordered set. Formally, each data point is associated with a target drawn from the
set {1, 2, . . . , k} of k categories. The model is asked to predict the category of the given input.
In probabilistic perspective, the model produces a probability distribution over categories given
the training set and the specified data point. An example of classification is to recognize the
digit from an image of a handwritten number [16].
Accuracy. Accuracy is one of the ways to evaluate performance of a classification model.
Accuracy is the number of correct predictions divided by the total number of predictions. For
7


Figure 2.2: Mnist digit dataset. [16].
instance, let y predict = [1, 1, 1, 0] be the predictions for four samples whose labels are ytrue =
[0, 0, 1, 1]. Then the accuracy is 25% since there is one correct prediction. The accuracy gives
a good view of the model’s performance. However, it gives no insight into how the underlying
model performs in predicting each class.
Confusion Matrix. Confusion matrix gives more details about model’s performance. More
specifically, it records how each class being predicted by the model. For example, y predict =
[1, 1, 1, 0] and ytrue = [0, 0, 1, 1] will have the following confusion matrix:

Class 0
Class 1

Predicted as 0
0

1

Predicted as 1
2
1

Table 2.1: Confusion matrix. Confusion matrix with y predict = [1, 1, 1, 0] and ytrue = [0, 0, 1, 1].

The accuracy can be derived as the sum of the numbers on the diagonal of the confusion
matrix divided by the sum of all the numbers on the matrix. Other metrics can also be derived
from confusion matrix. Typical, one could had more interest in a class than the others. In such
case, we define that class as positive class, whereas all the other classes are defined as negative.
For example, for a coronavirus test, if it predicts that a person had coronavirus, then the test is
‘positive’. Otherwise, it is ‘negative’. We have the following quantities:





A test correctly predicts if a person has coronavirus (True Positive)
A test incorrectly predicts if a person has coronavirus (False Positive)
A test correctly predicts if a person does not have coronavirus (True Negative)
A test incorrectly predicts if a person does not have coronavirus (False Negative)
Table 2.2 shows these quantities on a confusion matrix. Precision and Recall are two metrics
TP
that are used to evaluate this test. More specifically, precision is defined as T P+FP
and recall
TP
is defined as T P+FN . Obviously, precision and recall are ranging from 0 to 1. It is very crucial
not to miss any infected person, so we need the test to have high recall. On the other hand,

wrongly classify a person as being infected will waste medical resource, then we also want the
precision to be high. A good test will have high precision and high recall, which is not always
achievable in practice. There is sometimes a trade-off between precision and recall, depending
on the underlying task that we are solving.

8


The person has coronavirus
The person does not have coronavirus

The person tested positive
True Positive (TP)
False Positive (FP)

The person tested negative
False Negative (FN)
True Negative (TN)

Table 2.2: Quantities in Confusion Matrix of Testing for Coronavirus. Illustration of different quantities in Confusion matrix.
Regression
This class of task is quite similar to classification, except that the output is now a real value.
The model has to make a approximation of the corresponding target given some input data
point. Formally, the learning algorithms is asked to produce a function f : Rn 7! R, where n is
the dimension of input x. In probabilistic perspective, the learning algorithms might output a
probability mass function over y.

2.2.2

Unsupervised Learning


In unsupervised learning, the data points are provided without corresponding targets. The
model aims to extract compact information from the data. By extracting information, we mean
learning from the distribution of the given data. Examples of unsupervised learning tasks are
density estimation, anomaly detection, clustering data, generate new data points with the distribution. Given a dataset D = {xi , i = 1, 2, . . . }, the probabilistic unsupervised learning tend to
model the distribution p (x) of over data points.

2.2.3

Semi-supervised learning

In machine learning, semi-supervised learning is a family of algorithms in which the model
learns from labeled and unlabeled data simultaneously. More formally, the training data is then
D = {(xi , yi ) , i = 1, 2, . . . , k} [ x j , j = k + 1, k + 2, . . . , k + l .

2.3 Few-shot Learning
Few-shot Learning Problem Formulation. In few-shot image classification, we are given two
Tn
b
disjoint sets: base set Db = {(xi , yi )}Ti=1
and support set Dns = x j , y j j=1
, where Db contains
s
a large number of labeled samples of Nb base classes and Dn contains N novel classes with each
having K samples. Given Db and Dns , the aim of few-shot learning is to adapt the model to Dns
q
after being well trained on Db . Another set Dn of query samples drawn from novel classes is
used to evaluate the generalization ability of a few-shot learning model. Such configuration is
q
called an N-way K-shot task. Dn is called the query set.

Few-shot learning was proposed to tackle the problem of data scarcity, and meta learning
is the most promising approach. Most meta learning algorithms can be classify into black-box
adaptation, optimization-based inference, and non-parametric methods. We briefly describe
meta-learning and some terminologies.
Meta-learning. Machine learning algorithms deal with tasks, e.g., classification, regression,
anomaly detection, sampling from distribution. A task usually consists of a training set, a test
set, and a performance measure. For example, the object recognition task’s training set and test
set are sets of images and labels. The two sets have no common images. Labels of the test set
9


are for evaluating machine learning algorithms. The performance is the accuracy of predicting
categories for the test set. Let denote training set and test set as D tr and D ts , respectively.
In the text, we mostly consider the supervised task where D tr = {(xi , yi ) , i = 1, 2, . . . , k} and
D ts = {(xi , yi ) , i = 1, 2, . . . , l}.
Meta-learning does not treat tasks independently. Instead, meta-learning accumulate experience from prior tasks to quickly adapt to a new one. The process mimics human behavior as we do not learn tasks from scratch. Formally, given a meta-training data Dmeta−train =
{(Ditr , Dits ) , i = 1, 2, . . . , k}, the meta-learning model learns to perform well on meta-test data
Dmeta−test = {(Ditr , Dits ) , i = 1, 2, . . . , l}. There is an analogy between meta-learning and standard supervised learning. Supervised learning learns to predict the target y, given the data point
x, whereas meta-learning learns to perform well on D ts given D tr .
In the paradigm of meta-learning, a component called “learner” learns the new tasks, and an
other component called “meta-learner” trains the learner. As a result, there are two optimization
corresponding to the two components. Some literature refers to the optimization of the metalearner as outer loop and the optimization of the learner as inner loop or adaptation.

2.4 Object Detection
In some contexts, we want to correctly classify images into predefined classes and want
to precisely localize regions that contain one or multiple objects within the considered image.
The evolution of deep learning recently begins a native research wave to this topic, which is
then well defined as object detection. Object detection can be further divided into two subtopics, namely general object detection and its applications. General object detection tends
to explore new general approaches, algorithms, or techniques toward improving the model’s
localizing and classifying ability. Research into applications spans over a large number of

areas, such as pedestrian detection, car detection, face detection, etc. Object detection, along
with image classification, is one of the most critical problems in modern visual understanding.
Their applications are related to many downstream tasks, including autonomous driving, object
tracking, medical, etc.
Object detection consists of two subtasks, namely object localization, and object classification. Given an image of objects, the object detection model aims to predict the bounding boxes
which localize the underlying objects in the image and associate them with class predictions.
In object detection, the model is usually trained using data with bounding boxes and classes
label. For a single task, typically, there are a specific set of underlying object categories. The
bounding boxes of the object within an image are defined by their relative offsets to the size of
the image. The class label for an object is encoded by numerical indexes.

10


Chapter 3
Related work
3.1 Few-shot Learning
Most few-shot learning algorithms can be divided into the categories of meta-learning and
metric-learning; we describe these categories and give some legacy research works from each
category.

3.1.1

Meta-Learning

In recent few-shot learning literature, the gradient-based metal-learning methods were referred to as meta-learning, whereas metric learning was classified as a non-parametric metalearning technique. For more detail on meta-learning, see [8].
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
In [7], the authors proposed to learn a set of model parameters such that a few numbers
of gradient descent iterations will produce a good performance on that task. Intuitively, the
model is trained to an initial state, and for each new task, good performance can be obtained by

fine-tuning on a small amount of data.

Figure 3.1: MAML algorithms (figure from [7]). During meta-training, a large numbers of
batch, consists of multiple tasks sampled from p (T ), is generated. For each task Ti in the
batch, the set θi of task specific parameters is derived by fune-tuning the model parameters θ
on training set of that task. The task specific parameters are then evaluated on the task’s test set.
Finally, the loss accumulated for the batch is used to update model parameters θ .

11


In effect, for a parametric model fθ , the MAML aims to learn a single set of parameters θ ⇤
that is ’close’ to all the task specific parameters. Starting at such set of parameters, the model
is able to quickly adapt to novel tasks with a few optimization steps.
The work is compatible with a range of models trained with gradient descent and applicable
to many applications. However, models that learn to adapt with MAML tend to be unstable and
hard to archive good performance [2].
Meta-learning with latent embedding optimization
In MAML, the learner space and the meta-learner space are the same. As a result, the inner
loop as well as the outer loop has to be performed on a high-dimensional parameter space, which
is inefficient. In Meta-learning with latent embedding optimization [33], the authors relax this
limitation by introducing a latent space.
The authors argue that it is beneficial to relax the assumption that there exists an optimized
θ ⇤ and replace such θ with a generative model in low-dimensional space. More specifically,
a task-dependent conditional probability distribution over θ . The adaptation steps are now
performed on the low-dimensional latent space and task specific parameters are sampled from
the generative model.

Figure 3.2: LEO algorithms (figure from [33]). Instead of maintain a high-dimensional parameters θ to captures information from the tasks distribution as in MAML, LEO proposed a
low-dimensional generative model, conditioned on underlying tasks with an encode technique.

The adaptation steps are now performed on this latent space. Finally, the task specific parameters are sampled from the generative model.
Although LEO has relaxed one difficulty when working with MAML as well as other
parametric-based meta-learning algorithms that is the high-dimensional optimization space, it
is still very unstable to actually archive good performance. In LEO work, the authors actually
has to tune the model hyperparameters carefully.
Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
Meta-SGD [18] is closely related to MAML, but it has ’higher capacity’. Meta-SGD not
only learns the initialize parameters for the model but it also learns the direction, learning rate
for adaptation procedure. In the beginning, the model parameters and learning rate of adaptation
steps are randomly initialized. The inner loop is mostly the same as in MAML. However, in the
outer loop, the paratemets as well as the learning rate are updated according to the loss of the
batch.
12


×