Tải bản đầy đủ (.pdf) (6 trang)

Standardization procedure for automatic environmental data A case study in Hanoi, Vietnam

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (321.25 KB, 6 trang )

2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)

Standardization procedure for automatic
environmental data: a case study in Hanoi, Vietnam
Linh Nguyen Duc, Man Duc Chuc, Bui Quang Hung, Nguyen Thi Nhat Thanh
Center of Multidisciplinary Integrated Technology for Field Monitoring,
University of Engineering and Technology, Vietnam National University.
Hanoi, Vietnam

solutions to limit the severely decreasing air quality in Vietnam
at the present.
In Vietnam, there are two systems of environmental
monitoring stations, both are managed by the Ministry of
Natural Resources and Environment [2]. Most of the stations
are automated stations. The stations measures meteorological
indicators and air pollution indicators by hour. Measured data
is stored in local memory and transferred to main center daily
or weekly. There are also many abnormal data and many gaps
in the data due to problems during operation such as sensor’s
problems, maintaining of stations. Furthermore, the data has
not been undergone any fixing or recovering process. This
makes some obstacles for researchers when they use the data to
study.
Currently, the authorities mainly use traditional
statistical tools, i.e. Microsoft Excel, this may result in more
processing time especially when the data volume is huge.
Additionally, it is very time and cost consuming to detect
abnormal data or filling in missing data by human. Thus an
automatic tool is needed to help the authorities or researchers
work with the data.
Current problems appearing in the measured data at


the ground stations are described below:
- The data is not consistent: Data is not stored in a
commonly standardized output. The data is stored in
different structures using different units of
measurement, column names, date and time formats...
This cause a lot of difficulties to analysis the data.
- Noisy data: occurring in several cases such as
equipment failure, transmission errors and
unidentified errors.
- Missing data: data is missed in some situations such
as the monitoring modules are broken unexpectedly,
power failure or by changing the position of the
measuring devices....
In this paper, we address the second and third
problems. The proposed standardization procedures helps in

Abstract - In Vietnam, environmental data collected from
ground-based stations may contain abnormal or missing values
due to several problems during operation, i.e. sensor’s problems.
This paper proposes a standardization procedure which try to
detect unusual values and fill in missing data. Experiments were
conducted for PM10 data. Two datasets measured in 01/2011 and
01/2012 at Nguyen Van Cu station in Hanoi, Vietnam is used for
experiments. For the abnormal detection process, unusual data
can be informed to the data analyzers at ground stations for
judging. For the missing filling process, the first dataset is used as
training dataset to construct regression models for predicting
missing data, the second dataset is used as testing data. In the
worst case, suppose 100% PM10 is missing, Root Mean Square
Error (RMSE) and Mean Absolute Percentage Error (MAPE)

are 51 μg/m3 and 45% respectively. Correlation coefficient (R)
between original PM10 data and predicted PM10 data is 0.56. In
addition, different scenarios taking account of percentage of
missing data of the whole testing dataset are also considered.
Experimental results showed that it is best to perform missing
filling process on datasets that contain 10% to 30% of missing
data. For this case, RMSE ranges from 15-25 μg/m3 and MAPE
varies from 5 to 13%.
Keywords—environmental data, abnormal detection, missing
filling, PM10.

I. INTRODUCTION
Environmental monitoring data is a dataset obtained
by the process of measuring one or more indicators of physical
properties, chemical and biological components of the
environment, according to a preset plan which covers time,
space, methods and measurement process, to reliably and
accurately provide the field information.
Ground-based environmental data can be used in
various real life applications such as air pollution modeling,
healthcare studies [11]. For example, healthcare sector can use
the data to make analysis and assess the impact of physical,
chemical and biological factors on dermatological, respiratory
or epidemic diseases... [12] Also, the data can help the
managers in decision making process to create appropriate

978-1-4673-8929-7/16/$31.00 ©2016 IEEE

321



31/01/2011 and 01/01/2012 to 31/01/2012 (Table 1). Total
number of records for each dataset is 744.

the synthesis, cleaning and missing filling of data, to save time
and effort for managers, researchers when working with the
data.

Table 1. Statistics on data structure, volume in 01/2011 and
01/2012

II. DATA

Time
Monitoring

As mentioned before, in Vietnam, there exists two
automatic air monitoring stations which are managed by the
Ministry of Natural Resources and Environment. The first is
monitoring networks of meteorological and environment
parameters (10 stations), the second is a network of national
environmental monitoring stations (7 stations). The monitoring
stations hourly measured data. The air pollution parameters
measured at all of the stations include carbon monoxide
popular (CO), nitric oxide (NO), nitrogen dioxide (NO2),
sulfur dioxide (SO2), ozone (O3), PM, wind speed, wind
direction, temperature, relative humidity, barometer, radiation,
inner temperature. In addition, these stations also measure
meteorological information such as wind speed.
In Hanoi, there are three air monitoring stations, one

is located at Phao Dai Lang, Dong Da, the other two stations
located at 556 Nguyen Van Cu and Ho Chi Minh Mausoleum.
In this study, we used data from Nguyen Van Cu station for
analysis. The station is launched in 2009 with regular
maintenance. This ground station is located in the Centre for
Environmental Monitoring (CEM) which is the most stable
operation and the data could be representative for Hanoi area.
Particulate matter (PM) is solid and liquid particles
suspended in the atmosphere. PM includes both organic and
inorganic particles such as dust, pollen, soot, smoke, and liquid
droplets. These particles vary greatly in size, composition, and
origin. PMs can be divided into three categories based on its
diameter including PM10, PM2.5 and PM1. Dust monitoring
data includes PM10, PM2.5, PM1 and PM10 is the main focus
in this study. PM10 data collected from 01/01/2011 to
31/01/2011 and 01/01/2012 to 31/01/2012 at the Nguyen Van
Cu station, Hanoi. This is close to the time when Nguyen Van
Cu station was set up and data quality is guaranteed. For the
period of time from 2013-2016, the monitoring module are not
well maintained thus resulting in more errors and missing data.

Number of
.xls files

01/2011

31

01/2012


31

Indicators
Wind speed, wind
direction, temperature,
relative humidity,
barometer, radiation,
inner temperature,
NO, NO2, SO2, CO,
O3, PM10, PM2.5 and
PM1.

B. Missing status
The first dataset collected in 01/2011 has a low
missing rate, i.e. about 2% for PM and 0% for other indicators.
The second dataset collected in 01/2012 have a higher missing
rate, i.e. 23% for SO2 and 37.4% for O3. But PM indicators in
this dataset were fully recorded (Table 2).
Table 2. Statistics on the number of missed records according
to indicators in two datasets.

Indicators
SO2
O3
PM10
PM2.5
PM1

01/2011
0

0
15/744
15/744
15/744

01/2012
170/744
278/744
0
0
0

According to the statistics, the first dataset (01/2011)
is used as training dataset because the amount of data PM10
quite full and monitoring data of other indicators have high
completeness. The second dataset (01/2012) is used as test
dataset.
III. METHODOLOGY
Based on the characteristics of data, we propose a
standardized procedure for automatic environmental data (Fig
1) as described below:
1.

A.

Structure and volume
The two datasets collected from Nguyen Van Cu
station consisting of 15 indicators including wind speed, wind
direction, temperature, relative humidity, barometer, radiation,
inner temperature, NO, NO2, SO2, CO, O3, PM10, PM2.5 and

PM1. The data is stored in Microsoft Excel format (.xls). Data
of each day is saved in a separate file. Thus the two datasets
contain 62 files corresponding to 62 days from 01/01/2011 to

2.

322

Data collection: collect data from the stations. After that
to build common dataset defined by a conventional
structure. The aim is to create a dataset of standard data
structure that simplifies the process of managing and
analyzing data. If dataset structure has not correct, collect
data again and go to Data overview step when it correct.
Data overview (based on statistics): using the statistical
methods to extract statistical characteristics of the data,
trends of data and prescreen it to assess against reality.


3.

4.

appearing during peak hours every day. Average values of
PM10 for each hours calculated from data of each month
showed an agreement with the general trend (Fig 2). Apply
similar evaluation methods for other indicators such as NO2,
SO2, CO ... The results showed that the two datasets are
reliable and follow general trends that were reported in the
literature. This guarantees the following steps to be conducted.


This is just to get an overview of the data and to get a feel
if the data is noisy or missing. This step help us assess the
quality of existing data. If dataset have good quality then
call to the next step. If not, determine the data source in
first step.
Noise detecting: removing data based on data reliability
range or using correlation analysis methods. This is to
detect the days that have abnormally observational data.
This is to suggest unusual data to the analysts to make
decision on the data. If the day had detected are not noise
data then revaluation noise detecting method, else go to
next step.
Fill in missing data: using correlation analysis between
target indicator and other indicators to build linear
regression models. The models are used to predict values
for missing data records of the target indicator. If the
dataset has been filled is true, finish process, else
revaluation filling missing method.

Start

Data collection
Problems
Evalu
ation
Next

IV. EXPERIMENTS AND RESULTS
A. Data Collection and Data Overview

Based on data and basic statistical indicators, we can
draw some conclusions on the PM10 data from the two datasets
as Table 3.
Table 3. The results of some statistical indicators were
calculated on 2 datasets
Month
Mean
Median Mode
Q1
Q3
01/2011 141.37
129.68
40.91 56.07 210.41
01/2012
87.18
75.39
97.22 49.61 113.61

Data overview
Problems
Evalu
ation
Next
Noise detecting

Overall, the average PM10 concentrations range from
85-140 ug/m3. This is close to the QCVN 05:2013/BTNMT
standard which states the standard of air pollution in Vietnam
for PM10 is 150 ug/m3. In general, the statistical indicators of
PM10 in the second dataset often have lower values than those

of the first dataset.
Previous study conducted in Hanoi showed that the
average of monitoring indicators are often higher in winter and
lower in summer [1]. The maximum PM10 value is often
observed in the period from October to January with average
PM10 value ranging from 100 to 150 ug/m3. This is similar to
the above statistical data.
Previous study also showed an evolution of air
pollution levels in 05/2003 and 09/2003 in Hanoi [1]. During
these days, air pollution level tends to rise during peak hours
from 7-9am and 18-20pm. Furthermore, the highest peaks of
air pollution level in the morning are often similar to those in
the evening. This is because of high volume of vehicles

Problems
Evalu
ation
Next
Fill in missing
Problems
Next
Finish

Evalu
ation

Fig 1. Data processing framework proposed.

323



Correlation analysis is another way to detect abnormal
data. We propose to detect potential abnormal data based on
analysis of correlation between daily data and monthly average
data. First, the average value of each hour in a day is calculated
from observed data in a month at the hour. Thus for each
month, 24 average PM10 values corresponding to 24 hours in a
day are constructed. The values are considered to represent the
daily trend of PM10 for the month. Correlation analysis is then
conducted for PM10 data measured in a particular day in the
month with the average PM10 values. If the correlation
coefficient is low then the data is considered to be noisy or
abnormal. Specifically, the range of [-0.3; 0.3] is used to filter
out potentially abnormal data for further analysis and
evaluation. The range of [-0.3; 0.3] was chosen as it is
negligible correlation based on research of Mukaka [14].
Besides, in order to evaluate abnormal data, professional
experience in meteorology, environment plus further
assessment of originality of the pollutions such as traffic,
industrial zones, the surrounding area at the measured time,
status of measurement equipments at the time. By applying the
proposed range of [-0.3; 0.3] to training dataset, there are 8
days have low correlation coefficient as described in Table 5.
Fig 2. PM10 daily trend in 01/2011 and 01/2012.
Table 5. List of dates which have low correlation coefficient
between day and monthly average in training dataset

B. Noise detecting
Noise removing aims to detect potential abnormal
data in daily basis. This can be based on constructing reliable

data range or correlation analysis or combination of both
methods.
The confidence interval can be used to determine a
reliable range of values which is used to remove noise data.
This method requires analysts to have good experience of
working with observational data in a long time in order to
construct good data range. Through research and
environmental reports [2, 3, 4, 5, 6, 7] we proposed a range of
reliable values for PM10 is [0-400] ug/m3. By applying the
proposed range to training dataset, there are 4 potentially
abnormal records as described in Table 4:

Date observation
03/01/2011
04/01/2011
09/01/2011
11/01/2011
13/01/2011
17/01/2011
19/01/2011
23/01/2011

C. Filling missing
Previous studies show that some environmental
indicators have significant correlations [9, 13]. This means that
missing PM10 data can be recovered from suitable
environmental parameters by constructing linear regression
models.
In this study, we build linear regression models using
training dataset. The models are used to predict missing PM10

values in testing dataset. Table 6 shows correlations between
PM10 and other environmental indicators derived from training
dataset:
Table 6. Table correlation between PM10 and other
environmental indicators in training dataset.

Table 4. List of date have valuable outside the confidence
interval in 01/2011

Datetime
12/01/2011 10:00
17/01/2011 08:00
17/01/2011 17:00
17/11/2011 18:00

Correlation coefficients
-0.2829
0.2108
-0.0953
0.1110
0.1502
0.2299
-0.2411
-0.0405

Observation value of
PM10
490
420.656
462.044

425.139

324


WindSpd
WindDir
Temp
RH
Barometer
Radiation

0.04982
0.03815
0.08365
0.34409
0.03855
-0.0124

InnerTemp
NO
NO2
SO2
CO
O3

models to use requires understanding of the data to be
recovered. In Table 9, a list of suggested models to use
according to different status of the data.
Table 9. Cases of missing data and suggested linear regression

models.

0.02089
0.23985
0.59005
0.53962
0.44486
0.09338

From the table, there are three indicators owning high
correlation with PM10 including NO2, SO2 and CO. Seven
linear regression models are constructed to predict PM10 from
the three parameters. As described before, in training dataset,
15 records have missing PM10 values. To ensure completeness
of data for building regression models, the records are removed
thus resulting in a training dataset containing 725 records.
After building seven linear regression models, we validated the
predicted PM10 values of each models with actual PM10
values. Assuming the data is missing 100% PM10, from that
R2, RMSE and MAPE are used to quantitatively assess
performance of each models (Table 7).
Table 7. Validation results of 7 linear regression models on
training dataset (100% PM10 missing).
Parameter for
R2
RMSE*
MAPE*
model

SO2

NO2
CO
SO2, NO2
SO2, CO
NO2, CO
SO2, NO2, CO

0.3
0.35
0.2
0.43
0.4
0.35
0.43

75.6
72.6
80.5
67.9
69.5
72.6
67.6

Records missing SO2, NO2, CO
Records missing SO2
Records missing NO2
Records missing CO
Records missing SO2, NO2
Records missing SO2, CO
Records missing NO2, CO

Records missing all SO2, NO2, CO

Next, we validated the models on testing dataset. The
dataset has no missing PM10, CO and NO2, but SO2 missing
170/744 records. This is a good basis for assessment process.
Assuming 100% PM10 data is missing from the testing dataset,
the results showed the correlation coefficient between the
predicted value of PM10 and actual PM10 is 0.56. RMSE and
MAPE are 51 ug/m3 and 45% respectively (Table 10). This
result is acceptable because it ensures data completeness and
the R, RMSE and MAPE are in the medium level. The MAPE
value in this case smaller than MAPE in Table 7 because two
linear regression models was applied to pedicted PM10 so the
error rate will be smaller than use of only one model.
Table 10. Results after filling missing PM10 in testing dataset.
Assuming that 100% PM10 data is missed in 01/2012.
Number
Number
of
Correlation
of
records
records coefficients
RMSE*
MAPE*
{NO2,
{NO2,
*
SO2,
CO}

CO}

80
74.7
87.5
68.9
71.8
74.7
68.8

* Predicted PM10 and Actual PM10
Based on the results, priorities are set for each model
when applying to real life problems as Table 8:
Table 8. Table ordered models corresponding to the priority.
Parameter
Linear regression equation Priority
for model
SO2, NO2, CO
SO2, NO2
SO2, CO
NO2, CO
NO2
SO2
CO

Y= -8.98 + 2.02*SO2 +
1.35*NO2 + 0.011*CO
Y= 0.79 + 1.87*SO2 +
1.80*NO2
Y= -1.95 + 2.59*SO2 +

0.028*CO
Y= 20.5 + 2.51*NO2 0.0004*CO
Y= 20.2 + 2.5*NO2
Y= 52.9 + 3.01*SO2
Y= 42.5 + 0.04*CO

Linear
regression
model number
1
4
3
2
7
5
6
Can not predict

Record status

574

1

170

0.56

51.4


45.3

* Predicted PM10 and Actual PM10
A test to evaluate the impact of missing data rate is
also performed. Different missing rate of PM10 are assumed
including 10%, 20%, 30%, 40%, and 50%. For each missing
rate, 10 datasets are randomly generated. Average results of 10
assessments perform on 10 datasets are reported in Table 11.
Table 11. PM10 missing filling results, considering different
missing rates in testing dataset.
Missing per
10% 20%
30% 40% 50%
cent
Total
744
records

2
3
4
5
6
7

In practice, NO2, SO2, CO are not always available.
They can be missed like PM10. Therefore, deciding on which

Correlation


325

0.94

0.91

0.86

0.78

0.75


REFERENCES

coefficients
*
RMSE*
15.75 20.77 24.92 34.29 36.8
MAPE*
4.93
9.10
13.46 18.04 23.06
* Predicted PM10 and Actual PM10
In general, the results significantly disparity.
Specifically, when 10% PM10 is missed, R, RMSE and MAPE
are 0.94, 15.75 ug/m3 and 4.9% which indicates best recovery
results in the test. With a lack of data from 20% to 30%,
MAPE and RMSE range from 9 to 13% and 20-25 ug/m3
respectively. When missing rate is higher than 30% or more,

RMSE rate started increasing from 18% to 23%. The worst
case is of 50% missing rate with RMSE of 36.8 ug/m3 and R =
0.75. From the results, it is observed that it is better to perform
missing filling process when the data is of 30% missing rate or
less. However, these results show the potential of applying the
method to real life problems.

[1]
[2]
[3]
[4]
[5]
[6]

[7]

[8]

[9]

V. CONCLUSION
In this paper, we propose a framework for automatic
environmental data, from data collection to building a dataset
which ensuring a standardized structures and acceptable
quality. The proposed workflow includes different stages
including data collection, data overview, noise detecting,
missing filling and evaluation. Different techniques at each
stage are also introduced and experimentally evaluated.
Although the framework is an overall process but
analysts can customize every step in the process. Besides, the

framework still exist some unresolved issues which include
historical knowledge for noise removal and the completeness
of other environmental indicators to estimate missed PM10
values. In future, exploiting the use of meteorology or weather
stations in the same area to employ more environmental
indicators to improve the overall quality for the workflow.

[10]

[11]

[12]

[13]

[14]

ACKNOWLEDGMENT

326

Pham Duy Hien. Current status and laws of changes of air quality in
Hanoi, 03/2006
The Ministry of Natural Resources and Environment Vietnam. National
environmental report in 2013,
The Ministry of Natural Resources and Environment Vietnam. National
environmental report in 2010.
Ngo Tho Hung. AARHUS University, Urban Air Quality Modelling and
Management in Hanoi, Vietnam. PhD Thesis, 2010.
Clean Air Initiative for Asian Cities (CAI-Asia) Center. Viet Nam: Air

Quality Profile 2010 Edition
Cao Dung Hai, Nguyen Thi Kim Oanh. Effects of local, regional
meteorology and emission sources on mass and compositions of
particulate matter in Hanoi. Atmospheric Environment Volume 78,
October 2013, Pages 105–112
Nguyen Tran Huong Giang, Nguyen Thi Kim Oanh. Roadside levels and
traffic emission rates of PM2.5 and BTEX in Ho Chi Minh City,
Vietnam. Atmospheric Environment Volume 94, September 2014, Pages
806–816
Dang Manh Doan, Tran Thi Dieu Hang, Phan Ban Mai. Institute of
Meteorology, Hydrology and Environment. The situation of air pollution
in Hanoi and recommendations to reduce pollution, 2007
Jung-Moon Yoo a, Yu-Ri Lee b, Dongchul Kim c,g,*, Myeong-Jae
Jeong d, William R. Stockwell e, Prasun K. Kundu f,g, Soo-Min Oh a,
Dong-Bin Shin b, Suk-Jo Lee. New indices for wet scavenging of air
pollutants (O3, CO, NO2, SO2, and PM10) by summertime rain.
Atmospheric Environment Volume 82, January 2014, Pages 226–237
Ping Wang, Junji Cao, Xuexi Tie, Gehui Wang, Guohui Li, Tafeng Hu,
Yaoting Wu, Yunsheng Xu, Gongdi Xu, Youzhi Zhao, Wenci Ding,
Huikun Liu, Rujin Huang, Changlin Zhan. Impact of Meteorological
Parameters and Gaseous Pollutants on PM2.5 and PM10 Mass
Concentrations during 2010 in Xi’an, China. Aerosol and Air Quality
Research, 15: 1844–1854, 2015
Gharehchahi E, Mahvi AH, Amini H, Nabizadeh R, Akhlaghi AA,
Shamsipour M, et al. Health impact assessment of air pollution in
Shiraz, Iran: a two-part study. J Environ Health Sci Eng. 2013; 11: 1 – 8
Brauer M, Amann M, Burnett RT, Cohen A, Dentener F, Ezzati M, et al.
Exposure assessment for estimation of the global burden of disease
attributable to outdoor air pollution. Environ Sci Technol. 2012; 46: 652
– 660

Dragan M. Markoviü, Dragan A. Markoviü, Anka Jovanoviü, Lazar
Laziü, Zoran Mijiü, Determination of O3, NO2, SO2, CO and PM10
measured in Belgrade urban area, Environmental Monitoring and
Assessment October 2008, Volume 145, Issue 1, pp 349-359
M. M. Mukaka. A Guide to Appropriate Use of Correlation Coefficient
in Medical Research. Malawi Medical Journal, Vol. 24, No. 3, 2012, pp.
69-71.



×