A traffic sign recognition system with convolutional neural network

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (619.24 KB, 7 trang )

Kỹ thuật điều khiển & Điện tử

A TRAFFIC SIGN RECOGNITION SYSTEM WITH
CONVOLUTIONAL NEURAL NETWORK
Luong Cong Duan1,*, Nguyen Hong Kiem2, Nguyen Ngoc Minh1
Abstract: In this research, we used Convolutional Neural Network [1][2] (CNN)
to the task of Traffic Sign Recognition. This research is foundation for us to
continue our research on self-driving. Convolutional Neural Network is a multistage architectures. It can be automatically learn features. We have used
Tensorflow library and Python as main tool for test our research. After conducting
research and testing, the results of the architectures reached 91.1% accuracy.
Keywords: Traffic Sign Recognition, Convolution Neural Network, CNN, Self-Driving.

1. INTRODUCTION
Our long-term goal in this research is self-driving vehicles and research on traffic sign
identification is is one of the first researches. Traffic sign identification can apply many
areas of traffic as: Notification signal information changes on the road, reminder about
wrongful when joining traffic and automated driving. Traffic signals often have clear
differences but their quantity of type is quite large. In addition, the quality of image
signals is greatly affected by the angle of view, the light, the obscurity, colors fading and
speed of movement... In this paper, our aims are building a test identifier that ignores
conditions that are too difficult, it will be conducted further research. In this paper, we
have used a basic dataset called: German Traffic Sign [3]. This is a dataset be used in
GTSRB (German Traffic Sign Recognition Benchmark) competition. It provides more
than 50,000 sample pictures including 43 different classes: speed limits, dangerous curves,
slippery road…. This dataset was used in a competition a few years ago. The best result
for the competition correctly guessed 99.46% of the signs that was designed by the IDSIA
team using the Committee of the CNNs method [3].
Traditional methods for traffic sign recognition generally consists two task: detection
and classification. Detection is first handled with computationally inexpensive, handcrafted algorithms. Classification is subsequently performed on detected candidates with
more expensive, but more accurate, algorithms. Hand-crafted features are also called
shallow features, are not discriminative enough as databases become larger and larger and

generic deep features should push the recognition performance even further. Classification
has been approached with a number of popular classification methods such as Neural
Networks [4], Support Vector Machines [5]…. In global sign shapes are first detected with
various heuristics and color thresholding, then the detected windows are classified using a
different Multi-Layer Neural Net for each type of outer shape. These neural nets take
32x32 inputs and have at most 30, 15 and 10 hidden units for each of their 3 layers. While
using a similar input size, the networks used in the present work have orders of magnitude
more parameters.
Current popular algorithms mainly use convolutional neural networks to execute both
feature extraction and classification[6]. Experiments have shown that CNN has many
advantages in recognition problems. There are a variety of CNN variants having been
proposed in GTSRB. Pierre Sermanet and Yann LeCun [7] fed both the high-level and
low-level features extracted by different convolutional layers to the fully-connected layers.
This method combined global invariant features with the local detailed ones and the
accuracy record was 99.17%.

118 L. C. Duan, N. H. Kiem, N. N. Minh, “A traffic sign recognition system… neural network.”

Nghiên cứu khoa học công nghệ

From those information we decided to choose CNN as the basic method for traffic sign
recognition task. CNN is a biologically-inspired, multilayer feed-forward architecture that
can learn multiple stages of invariant features using a combination of supervised and
unsupervised learning. Each stage is composed of a (convolutional) filter bank layer, a
non-linear transform layer, and a spatial feature pooling layer. The spatial pooling layers
lower the spatial resolution of the representation, thereby making the representation robust
to small shifts and geometric distortions, similarly to “complex cells” in standard models
of the visual cortex [8]. CNN are generally composed of one to three stages, capped by a
classifiercomposed of one or two additional layers.

Figure 1. Typical CNN architecture (Wikipedia).
After building architecture, we used a method to optimize the loss function. One of the
most popular methods is Gradient Descent [1][9]. Gradient descent is a way to minimize
an objective J ( ) function parameterized by a model’s parameters    d by updating the
parameters in the opposite direction of the gradient of the objective function  J ( ) to the
parameters. The learning rate 
determines the size of the steps we take
to reach a (local) minimum. In other
words, we follow the direction of the
slope of the surface created by the
objective function downhill until we
reach a valley.
Currently, there are many libraries
and programming languages that
support user programming and training
machine learning. With its machine
learning background, Google has
created an open source library called
Tensorflow. It has flexible architecture
that allows user to deploy computation
to one or more CPUs or GPUs in a
desktop, server, or mobile device with a
single API [10]. We have decided to use
this libraries for our project.

Figure 2. Gradient descent on a series of
level sets.

2. NETWORK ARCHITECTURE

The architecture used in the present work departs from traditional CNN[5] by the use of
connections that skip layers, and by the use of pooling layers with different subsampling
ratios for the connections that skip layers and for those that do not.

Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018

119

Kỹ thuật điều khiển & Điện tử

We have run the test a number of times and by this time we have temporarily selected
the architectures include 4 stage as follows:
1

Name
Inputs data

2

1st stage

3

2st stage

4

3st stage

5

Output - 4st stage

Describe
[batch, 32, 32, 3] YUV data
Input = inputs data
Conv1 + ReLU : kernel size = 5, layer width = 108
channel Y connect 100 kernel.
channel UV connect 8 kernel.
Max pooling : kernel size = 2
Output = “conv1”
Input = “conv1”
Conv2 + ReLU : kernel size = 3, layer width = 200
Max pooling : kernel size = 2
Output = “conv2”
Combine “conv1(flatten)” with “conv2(flatten)”
Input = concat "conv1(flatten)" and “conv2(flatten)”
Fully network + ReLU : layer width = 300
Output = “fc1”
Input = “fc1”
Out : layer width = 43
Figure 3. Network architecture.

Figure 4. Diagram of netwoek architecture.
3. EXPERIMENT
A. Data Preparation
Currently, GTSRB dataset has about 50.000 sample pictures of 43 class. However, the
number of images for each class is uneven. Below is the detailed information on the
distribution of the dataset:

Figure 5. Number of inputs per class before balancing data.

120 L. C. Duan, N. H. Kiem, N. N. Minh, “A traffic sign recognition system… neural network.”

Nghiên cứu khoa học công nghệ

It can be sent that are differences between the classes. We should create some data to
balance the number of inputs. We have used an easy method to increment number of
images. That is rotating images by a few degrees. This is the distribution after this
operation:

Figure 6. Number of inputs per class before balancing data.
The data is more balanced, and each class has at least 500 images. This new dataset
will help to train our network better.
Additionally, all images are down-sampled or upsampled to 32x32 (dataset samples
sizes vary from 15x15 to 250x250) and converted to YUV space. The Y channel is then
preprocessed with global and local contrast normalization while U and V channels are left
unchanged.
B. Network optimization
After preparing the input data, we conducted the training using the Gradient Descent
optimization with simple dataseet with purpose of optimizing our network. We use 200
training epochs to test and calibration them.
During training, we have tried to change the order of “Batch Normalization” and “Max
Pooling” to compare differences in training speed. (BP means: “Conv  Batch
Normalization  Max Pooling” and PB means: “Conv  Max Pooling  Batch
Normalization”). Two ways to arrange the results are as follows:

Figure 7. Compare between BP and PB.

The chart clearly shows that the PB architectures is better than the BP architectures. So
in this paper we use PB to desgin our architecture. After that, we tried the difference of the

Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018

121

Kỹ thuật điều khiển & Điện tử

network when it has difference number of fully layer. We have assumed that the network
has one more fully layer will better. But the reality is the opposite.

Figure 8. Compare Fully Layer number.
With our data, the network with one Fully Layer is better than no and two. It suggests
that in each case, complex architecture is not meant good results. We need to test and find
the suitable architecture. After optimization network, we have selected the network
architectures as mentioned in section II.
C. Trainning and Result
After choosing the architecture and parameters, we conducted training with the dataset
that was developed above. The program was trained with 39.209 samples with label and
tested with 12.630 without label. The final result is as follows:
>>
>>
>>
>>

Time to trainning: 4673.0710661411285s
Validation accuracy: 0.9854
Test accuracy: 0.9260

Time to process a picture: 0.253s

Figure 9. Loss and Accuracy of training process.
The result shows that after training and testing, the match rate of the training data with
our architecture is 98.54% and the match rate of testing data with our architecture is
92.6%. The tranning process is conducted in nearly 40,000 steps but the graph shows that
from about 10,000th steps, the loss rate and accuracy of the network changes very slowly,

122 L. C. Duan, N. H. Kiem, N. N. Minh, “A traffic sign recognition system… neural network.”

Nghiên cứu khoa học công nghệ

this is the phase of completion of the coefficients. Sometimes, the loss rate increases and
the accuracy decreases very fast then returns to the old value range. This is an anomaly, so
during training, the programmer should check the change of these parameters to ensure
stability before the training stops for the best training results.
In this paper, we conducted experiments with no GPU machine. The results show that
processing time of each image is about 0.253s (3.95 fps). That is a good parameter for our
next research. GPU supports parallel computing so the current processing speed can be
upgraded to realtime processing.
4. SUMMARY
In this paper, a simple architecture for traffic sign recognition is proposed. We have
conducted trials to change the order of processes and find out the best choice. With the
same number of elements, the arrangement of elements is very important for CNN. In
addition, complexity is not always good, with each type of data we need to change
accordingly to have the most appropriate network architecture. Although the design
architecture is simple, it gives a good result. This architecture has the following
advantages: simple, easy to deploy in both high and low language; uses less system
resources, high processing speed.

The accuracy of our architecture is 92.6%. This result is not really high but the
architecture is much simpler than other architectures. We can use it with low-profile
computers such as embedded computers or FPGAs. However, before doing it, we will be
using some filter and image processing tools as a pre-processing for better input quality.
In the next phase of research, we will rebuild our architectures with C/C++ language
more optimized for speed and continue to further optimize the architectures and continue
to solve the next problem as: sensor problems, case handling, automatically control… to
build a model of self-driving vehicles.
Finally, after solving the component problems, we will try to employ it into some
embeded computers and FPGA to run testing device and evaluate performance.
REFERENCES
[1]. Ian Goodfellow and Yoshua Bengio and Aaron Courville, “Deep Learning”, MIT
Press, 2016
[2]. Jianxin Wu, LAMDA Group, National Key Lab for Novel Software Technology,
“Introduction to Convolutional Neural Networks”, on May 2017
/>[3]. J. Torresen, J. W. Bakke and L. Sekanina, "Efficient recognition of speed limit signs,"
Proceedings. The 7th International IEEE Conference on Intelligent Transportation
Systems (IEEE Cat. No.04TH8749), 2004, pp. 652-656.
[4]. De la Escalera, A, Moreno, L, Salichs, M, and Armingol, J. “Road traffic sign detection
and classification”. Industrial Electronics, IEEE Transactions, on 848 –859, 1997.
[5]. R. Girshick, J. Donahue, T. Darrell and J. Malik, "Region-Based Convolutional
Networks for Accurate Object Detection and Segmentation," in IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142-158, Jan. 1 2016.
[6]. Sermanet, Pierre, and Yann LeCun, “Traffic sign recognition with multi-scale
convolutional networks”. Neural Networks (IJCNN), The 2011 International Joint
Conference on. IEEE, 2011.
[7]. LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P. “Gradient-based learning applied to
document recognition”. Proceedings of the IEEE, 86(11):2278–2324, November 1998

Tạp chí Nghiên cứu KH&CN quân sự, Số 53, 02 - 2018

123

Kỹ thuật điều khiển & Điện tử

[8]. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton,
Greg Hullender, “Learning to Rank using Gradient Descent”, Proceeding ICML '05
Proceedings of the 22nd international conference on Machine learning Pages 89 – 96,
August 2005
[9]. />TÓM TẮT
NHẬN DIỆN BIỂN BÁO GIAO THÔNG VỚI MẠNG NORON TÍCH CHẬP
Trong nghiên cứu này, chúng tôi sử dụng mạng tích chập (CNN) thực hiện nhiệm
vụ xây dựng chương trình nhận diện biển báo giao thông. Đây là nền tảng để thực
hiện những nghiên cứu tiếp theo về xe tự lái. Mạng tích chập là mạng noron có kiến
trúc nhiều lớp và áp dụng thêm các thuật toán nhân chập giữa các lớp. Mạng này
có khả năng tự động học các đặng tính của đối tượng. Sau khi xây dựng kiến trúc
của mạng chúng tôi sử dụng thư viện Tensorflow và ngôn ngữ lập trình Python là
công cụ chính để thử nghiệm. Và kết quả thử nghiệm cho thấy mặc dù kiến trúc
mạng đơn giản chỉ gồm 4 lớp đã có thể đạt được độ chính xác là 92,6%.
Từ khóa: CNN, Nhận diện biển báo giao thông, Mạng tích chập, Xe tự lái.

Received date, 11th November, 2017
Revised manuscript, 10th December, 2017
Published, 26th February, 2018
Author affiliations:
1
Post and Telecommunication Institute of Technology, Km10, Nguyen Trai, Ha Đong, Ha Noi;
2
Telecommunication University, No.11 Mai Xuan Thuong, Nha Trang, Khanh Hoa.

*
Corresponding author:

124 L. C. Duan, N. H. Kiem, N. N. Minh, “A traffic sign recognition system… neural network.”

A traffic sign recognition system with convolutional neural network

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về