Bài giảng khai phá dữ liệu (data mining) data preprocessing

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.06 MB, 71 trang )

Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email:
Website: />
1

Outline
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation

 Data reduction
 Discretization and concept hierarchy generation
 Summary

2

Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain attributes of interest, …


e.g., occupation=“ ”

 noisy: containing errors or outliers
 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or names

 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

3

Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it is

analyzed.
 Human/hardware/software problems

 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission

 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)

 Duplicate records also need data cleaning
4

Why Is Data Preprocessing Important?
 No quality data, no quality mining results!

 Quality decisions must be based on quality data


e.g., duplicate or missing data may cause incorrect or even misleading statistics.

 Data warehouse needs consistent integration of quality data

 Data extraction, cleaning, and transformation comprises the majority of the

work of building a data warehouse

5

Multi-Dimensional Measure of Data Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness

 Believability
 Value added
 Interpretability
 Accessibility

6

Data type

 Numeric: The most used data type, and the stored content is numeric
 Characters and strings: strings are arrays of characters
 Boolean: for binary data with true and false values

 Time series data: including time-or sequential-related properties
 Sequential data: data itself has sequential relationship
 Time series data: each data will be subject to change with time

7

Data type
 Spatial data: for data including special related attributes
 For example, Google Map, Integrated Circuit Design Layout, Wafer Exposure
Layout, Global Positioning System (GPS), etc.
 Text data: for paragraph description, including patent reports, diagnostic

reports, etc.
 Structured data: library bibliographic data, credit card data
 Semi-structured data: email, extensible markup language (XML)
 Unstructured data: social media data of messages in Facebook

 Multimedia data: Including data of pictures, audio, video, etc. in media with

mass data volumes as compared to other types of data that need data
compression for data storage
8

Data scale

“A proxy attribute is a variable that is used to represent or stand in for
another variable or attribute that is difficult to measure directly. A
proxy attribute is typically used in situations where it is not possible or
practical to measure the actual attribute of interest. For example, in a
study of income, the amount of money a person earns per year may be
difficult to determine accurately. In such a case, a proxy attribute, such
as education level or occupation, may be used instead.” ChatGPT

 Each variable of data has its corresponding attribute and scale to quantify and

measure its level
 natural quantitative scale
 qualitative scale

 When one variable is hard to find the corresponding attribute, proxy attribute

can be used instead as a measurement
 Common scales: nominal scale, categorical scale, ordinal scale, interval scale,
ratio scale, and absolute scale

9

Six common scales
 nominal scale: only used as codes, where the values has no meaning for







mathematical operations
categorical scale: according to its characteristics, and each category is marked
with a numeric code to indicate the category to which it belongs
ordinal scale: to express the ranking and ordering of the data without
establishing the degree of variation between them
interval scale: also called distance scale, can describes numerical differences
between different numbers in a meaningful way
ratio scale: different numbers can be compared to each other by ratio
absolute scale: the numbers measured have absolute meaning
10

Data inspection
 Goal: Inspects the obtained data in different view points to find the errors in

advance and then correct or remove some of them after discussion with domain
experts
 Data are categorized into quantitative and qualitative aspects
 Quantitative data

Data inspection: number of samples, number of variables or features, and different data
values
 Sample sizes: too small samples may affect the results, while too much samples may
affect statistical significance
 Variable sizes: too much may cause much time for computation
 Qualitative data
 Inspect centralized trends (mean, median, etc.) and variability
 Inspect data omissions, data noise, etc. in different graphs



11

Data discovery and visualization
 Statistical table: a table is made according to specific rules after organized the data
 Statistical chart: graphical representation of various characteristics of statistical data

in different graphic styles
 Data Type:
 Frequency: histogram, bar plot, pie chart
 Distribution: box plot, Q-Q plot
 Trends: trend chart
 Relationships: scatter plot

 Different data categories have different statistical charts
 Categorical data: Bar chart applicable
 Continuous data: histogram and pie chart applicable
12

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve

inconsistencies

 Data integration
 Integration of multiple databases, data cubes, or files

 Data transformation
 Normalization and aggregation

 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical

results

 Data discretization
 Part of data reduction but with particular importance, especially for numerical data

13

Forms of Data Preprocessing

14

Descriptive data summarization

15

Mining Data Descriptive Characteristics


Motivation




To better understand the data: central tendency, variation and spread

Data dispersion characteristics


median, max, min, quantiles, outliers, variance, etc.

16

Measuring the Central Tendency


Mean (algebraic measure) (sample vs. population):




Weighted arithmetic mean:

Median: A holistic measure


n

x=

w x
i =1
n

i

1 n
x =  xi
n i =1

=

x

N

i

w
i =1

i

Middle value if odd number of values, or average of the middle two values
otherwise



Mode


Value that occurs most frequently in the data



Unimodal, bimodal, trimodal



Empirical formula:

mean − mode = 3  (mean − median)

17

Symmetric vs. Skewed Data
 Median, mean and mode of symmetric,

positively and negatively skewed data

Data Mining: Concepts and Techniques

18

Four moments of distribution: Mean, Variance, Skewness, and
Kurtosis

19

20

Measuring the Dispersion of Data


Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile)



Inter-quartile range: IQR = Q3 – Q1



Five number summary: min, Q1, M, Q3, max



Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier
individually




Outlier: usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation (sample: s, population: σ)


Variance: (algebraic, scalable computation)

1 n
1 n 2 1 n 2
2
s =
( xi − x ) =
[ xi − ( xi ) ]

n − 1 i =1
n − 1 i =1
n i =1
2



1
 =
N
2

n

1
(
)

−

=
x

i
N
i =1
2

n

 xi −  2
2

i =1

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

21

Example

22

Properties of Normal Distribution Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the measurements (μ:

mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

23

Boxplot Analysis
 Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e., the

height of the box is IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to Minimum and

Maximum

24

Visualization of Data Dispersion: Boxplot Analysis

25

Bài giảng khai phá dữ liệu (data mining) data preprocessing

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về