Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email:
Website: />
1
Outline
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
2
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, …
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
3
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it is
analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
4
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the
work of building a data warehouse
5
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
6
Data type
Numeric: The most used data type, and the stored content is numeric
Characters and strings: strings are arrays of characters
Boolean: for binary data with true and false values
Time series data: including time-or sequential-related properties
Sequential data: data itself has sequential relationship
Time series data: each data will be subject to change with time
7
Data type
Spatial data: for data including special related attributes
For example, Google Map, Integrated Circuit Design Layout, Wafer Exposure
Layout, Global Positioning System (GPS), etc.
Text data: for paragraph description, including patent reports, diagnostic
reports, etc.
Structured data: library bibliographic data, credit card data
Semi-structured data: email, extensible markup language (XML)
Unstructured data: social media data of messages in Facebook
Multimedia data: Including data of pictures, audio, video, etc. in media with
mass data volumes as compared to other types of data that need data
compression for data storage
8
Data scale
“A proxy attribute is a variable that is used to represent or stand in for
another variable or attribute that is difficult to measure directly. A
proxy attribute is typically used in situations where it is not possible or
practical to measure the actual attribute of interest. For example, in a
study of income, the amount of money a person earns per year may be
difficult to determine accurately. In such a case, a proxy attribute, such
as education level or occupation, may be used instead.” ChatGPT
Each variable of data has its corresponding attribute and scale to quantify and
measure its level
natural quantitative scale
qualitative scale
When one variable is hard to find the corresponding attribute, proxy attribute
can be used instead as a measurement
Common scales: nominal scale, categorical scale, ordinal scale, interval scale,
ratio scale, and absolute scale
9
Six common scales
nominal scale: only used as codes, where the values has no meaning for
mathematical operations
categorical scale: according to its characteristics, and each category is marked
with a numeric code to indicate the category to which it belongs
ordinal scale: to express the ranking and ordering of the data without
establishing the degree of variation between them
interval scale: also called distance scale, can describes numerical differences
between different numbers in a meaningful way
ratio scale: different numbers can be compared to each other by ratio
absolute scale: the numbers measured have absolute meaning
10
Data inspection
Goal: Inspects the obtained data in different view points to find the errors in
advance and then correct or remove some of them after discussion with domain
experts
Data are categorized into quantitative and qualitative aspects
Quantitative data
Data inspection: number of samples, number of variables or features, and different data
values
Sample sizes: too small samples may affect the results, while too much samples may
affect statistical significance
Variable sizes: too much may cause much time for computation
Qualitative data
Inspect centralized trends (mean, median, etc.) and variability
Inspect data omissions, data noise, etc. in different graphs
11
Data discovery and visualization
Statistical table: a table is made according to specific rules after organized the data
Statistical chart: graphical representation of various characteristics of statistical data
in different graphic styles
Data Type:
Frequency: histogram, bar plot, pie chart
Distribution: box plot, Q-Q plot
Trends: trend chart
Relationships: scatter plot
Different data categories have different statistical charts
Categorical data: Bar chart applicable
Continuous data: histogram and pie chart applicable
12
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
13
Forms of Data Preprocessing
14
Descriptive data summarization
15
Mining Data Descriptive Characteristics
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
16
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Weighted arithmetic mean:
Median: A holistic measure
n
x=
w x
i =1
n
i
1 n
x = xi
n i =1
=
x
N
i
w
i =1
i
Middle value if odd number of values, or average of the middle two values
otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
mean − mode = 3 (mean − median)
17
Symmetric vs. Skewed Data
Median, mean and mode of symmetric,
positively and negatively skewed data
Data Mining: Concepts and Techniques
18
Four moments of distribution: Mean, Variance, Skewness, and
Kurtosis
19
20
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier
individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
2
s =
( xi − x ) =
[ xi − ( xi ) ]
n − 1 i =1
n − 1 i =1
n i =1
2
1
=
N
2
n
1
(
)
−
=
x
i
N
i =1
2
n
xi − 2
2
i =1
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
21
Example
22
Properties of Normal Distribution Curve
The normal (distribution) curve
From μ–σ to μ+σ: contains about 68% of the measurements (μ:
mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
23
Boxplot Analysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the
height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and
Maximum
24
Visualization of Data Dispersion: Boxplot Analysis
25