Tải bản đầy đủ (.pdf) (84 trang)

Openk data cleansing system a clustering based approach for detecting data anomalies

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3 MB, 84 trang )

VIETNAM NATIONAL UNIVERSITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

GRADUATION THESIS
OPENK : DATA CLEANSING SYSTEM - A
CLUSTERING-BASED APPROACH FOR
DETECTING DATA ANOMALIES
Council:

Information System

Instructor:

Assoc. Prof. Dr. Dang Tran Khanh

Reviewer:

Dr. Phan Trong Nhan

Student:

Nguyen Dinh Khuong - 1752306

Ho Chi Minh City, August 2021


----------

-


-

KHOA:KH & KT Máy tính

KHMT

trình

MSSV: 1752306
NGÀNH: COMPUTER SCIENCE

KHM2

1.
OPENK : DATA CLEANSING SYSTEM
DETECTING DATA ANOMALIES

-

-

-

-

CLUSTERING-BASED APPROACH FOR

Learn requirements, analysis, design and implementation of data cleansing system running on web
app platform. Research and apply Edit-based similarity algorithms, using knowledge and
methodologies from Algorithm Design and Analysis, Database Management System, Clustering

Methods, Web development to provide reasonable and optimized approach in detecting and
clustering cluster of anomalies data, which will be ready for the next steps.
Reading scientific papers and proposing a solution to prevent inconsistent and duplicate data
based on clustering methods.
Researching different related works - others data cleansing systems such as GoogleRefine,
BigDansing, NADEEF, ... thereby making reasonable assessments and comparisons for the
advantages and disadvantages of the current system. After that, developing further functions
performance and system optimization.
Apply K-NN methods (LD, Damerau LD, Hamming), Similarity (Jaro, Jaro-Winkler) methods and
Key Collision (Fingerprint, N-gram Fingerprint) for detecting and clustering.
Test and evaluate the proposed system.

3

02/02/2021
26/07/2021
All thesis

PGS


TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA KH & KT MÁY TÍNH

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
---------------------------Ngày 10 tháng 08 năm 2021

PHIẾU CHẤM BẢO VỆ LVTN!
(Dành cho người hướng dẫn)

1. Họ và tên SV: Nguyễn Đình Khương
MSSV: 1752306
Ngành (chuyên ngành): Khoa học máy tính
!
2. Đề tài: OPEN : DATA CLEANSING SYSTEM – A CLUSTERING-BASED APPROACH FOR
DETECTING DATA ANOMALIES
3. Họ tên người hướng dẫn: PGS.TS. Đặng Trần Khánh
4. Tổng quát về bản thuyết minh:
Số trang:
Số chương:
Số bảng số liệu
Số hình vẽ:
Số tài liệu tham khảo:
Phần mềm tính tốn: Windows, Python, …!
Hiện vật (sản phẩm)
5. Tổng quát về các bản vẽ:
- Số bản vẽ:
Bản A1:
Bản A2:
Khổ khác:
- Số bản vẽ vẽ tay
Số bản vẽ trên máy tính:
6. Những ưu điểm chính của LVTN:

Developed a cleansing tool for improving (big) data quality in order to achieve the high
utility in businesses.
Moreover, the student had finished the following:
-! Studying Pandas, Numpy, JSON Python library, and other relevent programming
tools.
-! Investigating algorithms for measuring text similarity using different methods.

-! Studying cleansing and validating data tools such as OpenRefine, Cerberus.
-! Reading scientific papers and proposing a solution to prevent inconsistent and
duplicate data based on the clustering method.
-! Build a visualization method for users to have a better view about the collected data.
-! Build an API-based library for the developer community.
7. Những thiếu sót chính của LVTN:

The thesis presentation can be improved.
8. Đề nghị: Được bảo vệ □
Bổ sung thêm để bảo vệ □
Không được bảo vệ □
9. 3 câu hỏi SV phải trả lời trước Hội đồng:
a. Point out a better functionality of OPEN! comparing with the known existing work/systems?
10. Đánh giá chung (bằng chữ: xs/giỏi, khá, TB): Xuất sắc

Điểm: 10 /10
Ký tên (ghi rõ họ tên)

PGS.TS. Đặng Trần Khánh


KHOA KH & KT MÁY TÍNH

---------------------------Ngày 03 tháng 08
2021

Nguy
ình Kh
MSSV: 1752306
Ngành (chuyên ngành): Khoa h

áy tính
Openk: Data Cleansing System - A Clustering-based Approach for Detecting Data
Anomalies
TS. Phan Tr
ân
ng:
S
nv
-The student has developed a web application that supports users recognizing data anomalies by a
clustering-based approach with some built-in methods.
-The student has employed modern technologies for development such as flask, jinja, pandas, numpy,
html, css, javascript, and performed some basic empiriments (loading time, error, running time).
-The system can connect to files and cloud-based database management systems.

-The way of identifying data anomalies based on a clustering approach does not really show the
anomalies. For example, it shows the two different texts as abnormal.
-The evaluation and comparison are simple and towards time than accuracy. In addition, it does not
clearly show how effective the system helps in anomaly detection.
-The system is inflexible to add more methods. Moreover, how to choose the parameter values is a
problem to a user (e.g., k parameter, the limitation of records loading from Azure database).

a. Would you please show a use-case in that a user can benefit from your system?
b. Any comparison with some related work (e.g., Open Refine)?
c.
Good

: 9/10
Ký tên (ghi rõ h tên)

Phan Tr


ân


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Acknowledgements
First and foremost we would like to thank my supervisor Dr. Dang Tran Khanh,
not only for his academic guidance and assistance, but also for his patience and personal
support which made me truly grateful.
I would like to guarantee that this research is my own, conducted under the supervision and guidance of Dr. Dang Tran Khanh. The result of my research is legitimate
and has not been published in any forms prior to this. All materials used within this
researched are collected by myself, by various sources and are appropriately listed in
the references section. In addition, within this research, we also used the results of several other authors and organizations. They have all been aptly referenced. In any case of
plagiarism, we stand by my actions and are to be responsible for it. Ho Chi Minh city
University of Technology therefore is not responsible for any copyright infringements
conducted within my research.

GRADUATION THESIS

Page 4/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Abstract
At the moment, massive amounts of data are created every second over the internet, making the most efficient decisions has become a critical goal. Assume that we had
all of the information, but that extracting the valuable knowledge would be extremely

difficult. The following are the reasons for this assumption: data is not always clean or
at least correct since data obtained from many sources may be redundant, some of them
can be duplicated. These data must be cleaned before they can be utilized for further
processing.
Any inconsistencies or duplication in the datasets should be detected using a detection procedure. Widowing, blocking, and machine learning are among of the methods
that are utilized to identify anomalous data. The goals of this thesis are to offer OpenK ,
a simple yet efficient data cleansing system based on clustering approaches. In this scenario, a cluster will comprise all data that are similarity-based assumptions, is detected
by several techniques: Nearest Neighbor (Levenshtein Distance, Damerau-Levenshtein
Distance, Hamming Distance), Similarity Measurement (Jaro Similarity, Jaro-Winkler
Similarity) and Key Collision (Fingerprints, N-gram Fingerprints). This tool will be
evaluated in order to see how the efficiency of it and compare to other tool for better
view of assessment. We used airlines dataset from acamp.
com/production/repositories/5737/datasets and special case study Real Estate dataset which is crawled from . OpenK
also aids the user in loading and viewing data. Beside that, CRUD procedures, Pagination, Toggle column ON/OFF, Sort column, and Search keywords are being used for
analyzing and wrangling input data.
Keywords: Data Cleansing, Levenshtein Distance, Jaro-Winkler Similarity, Fingerprints, Anomaly detection

GRADUATION THESIS

Page 5/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Contents
1 Introduction

10


1.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3

Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4

Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Theoretical Background

13

2.1

Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Data Anomalies Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3


2.2.1

Conception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2

Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1

2.3.2

Key Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1.a

Fingerprint . . . . . . . . . . . . . . . . . . . . 16

2.3.1.b

N-gram Fingerprint . . . . . . . . . . . . . . . . 17

Nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2.a

Hamming distance . . . . . . . . . . . . . . . . 19

2.3.2.b

Levenshtein distance . . . . . . . . . . . . . . . 20


2.3.2.c

Damerau-Levenshtein distance . . . . . . . . . 22

2.3.2.d

Jaro Distance - Jaro-Winkler Distance . . . . . 23

3 Methodologies And Design

26

GRADUATION THESIS

Page 6/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

3.1

3.2

General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1

Main components . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


3.1.2

Detecting anomaly execution flow . . . . . . . . . . . . . . . . . 28

3.1.3

Use-case of clustering data site . . . . . . . . . . . . . . . . . . . 29
3.1.3.a

Actor determination and following use-case . . 29

3.1.3.b

Use-case diagram and specification . . . . . . . 30

Existing System and Design . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 System Implementation

39

4.1

Technologies and Framework . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2

Function implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 System Evaluation


50

6 Thesis Denouement

54

6.1

Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2

Assessment of Thesis Connotation . . . . . . . . . . . . . . . . . . . . . . 55

6.3

Future Advancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 APPENDIX A : USER MANUAL

GRADUATION THESIS

59

Page 7/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering


List of Figures
2.1

Example of applying fingerprint algorithm for name . . . . . . . . . . . . 16

2.2

Formula of Hamming distance calculation . . . . . . . . . . . . . . . . . 19

2.3

3-bit binary cube for finding Hamming distance . . . . . . . . . . . . . . 20

2.4

Levenshtein Distance calculation formula . . . . . . . . . . . . . . . . . . 21

2.5

Example of Levenshtein distance calculation table. . . . . . . . . . . . . . 22

2.6

Example of Damerau-Levenshtein distance calculation table. . . . . . . . 22

2.7

Formula of Jaro similarity calculation . . . . . . . . . . . . . . . . . . . . 23


2.8

Jaro-Winkler similarity calculation example . . . . . . . . . . . . . . . . . 24

2.9

Comparision of barcode correction using different techniques . . . . . . . 25

3.1

Overall architecture of OpenK system . . . . . . . . . . . . . . . . . . . . 26

3.2

Data type format converter . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3

Data cleansing component illustration . . . . . . . . . . . . . . . . . . . . 27

3.4

Clustering Operations illustration . . . . . . . . . . . . . . . . . . . . . . 28

3.5

Activity diagram of OpenK system . . . . . . . . . . . . . . . . . . . . . . 28

3.6


Use case diagram of Data site of OpenK system . . . . . . . . . . . . . . 31

3.7

Use case specification of viewing data . . . . . . . . . . . . . . . . . . . . 32

3.8

Use case specification of paging data . . . . . . . . . . . . . . . . . . . . 32

3.9

Use case specification of searching data keywords . . . . . . . . . . . . . 32

3.10 Use case specification of sorting data column . . . . . . . . . . . . . . . . 33
3.11 Use case specification of export data . . . . . . . . . . . . . . . . . . . . . 33
3.12 Use case specification of Hiding column data . . . . . . . . . . . . . . . . 34

GRADUATION THESIS

Page 8/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

3.13 Use case specification of Manage data cluster . . . . . . . . . . . . . . . . 34
3.14 Use case specification of cluster data using knn method . . . . . . . . . . 35
3.15 Use case specification of cluster data using similarity method . . . . . . . 35
3.16 Use case specification of cluster data using key collision method . . . . . 36

3.17 Overall architecture of BigDansing . . . . . . . . . . . . . . . . . . . . . 37
3.18 Overall architecture of NADEEF . . . . . . . . . . . . . . . . . . . . . . . 38
4.1

Relation diagram of OpenK routing system . . . . . . . . . . . . . . . . . 42

4.2

Flow chart diagram of Upload function implementation . . . . . . . . . . 43

4.3

Flow chart diagram of Data function implementation . . . . . . . . . . . 44

4.4

Class diagram of clustering method . . . . . . . . . . . . . . . . . . . . . 45

4.5

Flow of clustering data with KNN class . . . . . . . . . . . . . . . . . . . 46

4.6

Flow of clustering data with Similarity class . . . . . . . . . . . . . . . . 47

4.7

Implementation code of clustering data with Fingerprint algorithm . . . . 48


4.8

Flow of clustering data with Fingerprint algorithm . . . . . . . . . . . . . 49

5.1

Time performance for loading & visualizing input dataset of OpenK and
OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2

Time performance for detecting & clustering input dataset of OpenK and
OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3

Error percentage for detecting & clustering input dataset of OpenK and
OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

GRADUATION THESIS

Page 9/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

1 Introduction
“Most of the world will make decisions by either guessing or using their gut.
They will be either lucky or wrong.”

– Suhail Doshi, CEO of Mixpanel

1.1

Problem Statement
Enormous amounts of data that are available for the company or organization,

will influence their business decision. Data collected from the various resources may
be dirty and this will affect the precision of the predicted result. Data cleansing offers a
better data quality which will be a great help for the organization to make sure their data
is ready for the analyzing phase. Because of the reasons above, the job of cleaning and
detecting the anomalies of the collected data is the “A-number-1 crucial”.

1.2

Objective
The main objective of the thesis is to develop a cleansing tool for improving data

quality in order to achieve the high performance in businesses. The system should also
be able to resolve some problems that might occur during the cleaning phase such as
unusual data or duplicate data. Moreover, this thesis will propose the API library for the
developer community.

1.3

Scope
In the scope of this thesis, we will carry out the following tasks:

• Study data analysis tools such as: Pandas, Numpy, JSON Python library, ...
• Study about algorithms for measure text similarity using different method such as

Levenshtein distance, Damerau-Levenshtein distance, Jaro algorithm, Jaro-Winkler

GRADUATION THESIS

Page 10/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Distance using similarity-based metric, Fingerprint method, ...
• Study about cleansing and validating data tools such as OpenRefine, Cerberus.
• Propose a solution to prevent inconsistent and duplicate data.
• Build a visualization method for users to have a better view about the collected data.
• Build an API library for the developer community.

1.4

Thesis Structure

• Chapter 1 : Introduction - In this chapter, the thesis covers the (i) problem statement - the problem that this thesis is aimed to solve. (ii) Also, thesis will involve the
list of mission need to be accomplished. (iii) The scope of thesis - the carry out’s
list of tasks, (iv) and the structure of the thesis is also inscribed.
• Chapter 2: Theoretical Background - In this chapter, (i) First is to mentioned
the related works of others authors, so that we can have an overview of what have
been done previously. Secondly, (i) Data anomalies detection - the conception and
existing methods. Clustering methods (iii) are acknowledged, these methods - Key
Collision and Nearest Neighbors - will be written about the conception and how to
work with it.
• Chapter 3: Methodologies and Design - In this chapter, (i) This part consists the

system objectives - clearly show the target of the cleansing system . Additionally,
(ii) General architecture will give you the overall point of view of the system design.
Moreover, (iii) execution flow - show how which individual statements, components
or function calls of an imperative program are executed or evaluated. (iv) Existing
design. Lastly, (v) achievements and discussion.
• Chapter 4: System Implementation - In this chapter, (i) Technologies and Framework used namely: Pandas, Numpy, JSON lweb for python, fingerprints library and
more,... . (ii) Function implementation - list of functions, classes, parameters are
being adopted.
• Chapter 5: System Evaluation - In this chapter, (i) Measurements will referred to
which criteria for evaluation of the efficiency of the system. Also, (ii) the score of
efficiency of above measurement - also known as appraisal.
GRADUATION THESIS

Page 11/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

• Chapter 6: Thesis Denouement - In this chapter, the summarization of the thesis is
written - which includes (i) final culmination. And thesis evaluation (ii), to ensure
the contribute of the thesis. Finally, is (iii) the future advancement for the later
development of the cleansing system.
• Chapter 7: References - In this chapter, all the referred results and references of
the others paper are kindly denoted.

GRADUATION THESIS

Page 12/83



Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

2 Theoretical Background
2.1

Related Works
For the time being, there are scads of systems and researches on data cleans-

ing system. In [1], BIGDANSING - the Big Data Cleansing System Ire presented to
tackle the problem of cleaning big data, detecting anomalies and generating possible
fixes for it. Moreover, [2] NADEEF - An open-source single-node platform supporting
both declarative and user defined quality rules, research efforts have targeted the usability of a data cleansing system, but at the expense of performance and scalability. With
Google Refine, latter known as OpenRefine [3], is a Java-based powerful tool that allows
user to load data, understand it, clean it up, reconcile it, and augment it with data coming
from the web. All from a web browser and the comfort and privacy of user’s computer.
OpenRefine notes that: clustering in OpenRefine works only at the syntactic level (the
character composition of the cell value) and, while very useful to spot errors, typos,
and inconsistencies, it’s by no means enough to perform effective semantically-aware
reconciliation - but can deny the fact that it is very effective in treating duplicate and
inconsistencies data point. And [4], provides powerful yet simple and light-weight data
validation functionality out of the box and is designed to be easily extensible allowing
for custom validation.
In this thesis, we will approach the method of detecting data anomalies problem
using clustering-based techniques like what OpenRefine did previously.

2.2

Data Anomalies Detection


2.2.1 Conception
In data analysis, anomaly detection (also outlier detection) is the identification
of rare items, events or observations which raise suspicions by differing significantly
from the majority of the data. Typically the anomalous items will translate to some kind
of problem such as bank fraud, a structural defect, medical problems or errors in a text.

GRADUATION THESIS

Page 13/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.
In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern
does not adhere to the common statistical definition of an outlier as a rare object, and
many outlier detection methods (in particular unsupervised methods) will fail on such
data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm
may be able to detect the micro clusters formed by these patterns.
Three broad categories of anomaly detection techniques exist. Unsupervised
anomaly detection techniques detect anomalies in an unlabeled test data set under the
assumption that the majority of the instances in the data set are normal by looking for
instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal”
and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised
anomaly detection techniques construct a model representing normal behavior from a
given normal training data set, and then test the likelihood of a test instance to be generated by the utilized model.

2.2.2 Existing methods

There are heretofore several designs [10] for detecting data anomaly:
1. One-class support vector machines [11].
2. Fuzzy logic-based outlier detection.
3. Deviations from association rules and frequent itemsets.
4. Bayesian networks [12].
5. Hidden Markov models (HMMs) [12].
6. Replicator neural networks [12], autoencoders - variational autoencoders [13].
7. Long Short term memory neuron network [14].

GRADUATION THESIS

Page 14/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

The performance of different methods depends a lot on the data set and parameters, and
methods have little systematic advantages over another when compared across many data
sets and parameters. [15]
In this context, we use clustering-based technique for detecting anomaly - Key
Collision & K-Nearest neighbors.

2.3

Clustering Methods
In OpenK , clustering on a column is a great way to look for inconsistencies in

your data and fix those. Clustering uses a variety of comparison methods to find text
entries that are similar but not exact, then shares those results with you so that you can

merge the cells that should match. Where editing a single cell or text facet at a time can
be time-consuming and difficult, clustering is quick and streamlined.
OpenK has clustering always requires the user to merge and edit the name of the
cluster - it will display value that we apply on the previous and it will be the cluster’s
name and apply it to all the elements of the cluster.
In order to do its analysis clustering performs a lot of cleaning actions behind the
scenes, but only the merges that you accept affect your data. Understanding the various
behind-the-scenes cleanups can assist you in determining which clustering approach is
the most accurate and successful.

2.3.1 Key Collision
Key Collision methods are based on the idea of creating an alternative representation of a value (a “key”) that contains only the most valuable or meaningful part of the
string or a sequence, a word together different ones based on the fact that their key is the
same (hence the name “key collision”).

GRADUATION THESIS

Page 15/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

2.3.1.a

Fingerprint
Fingerprint is the least likely to produce false positives, so it’s a good place

to start. It does the same kind of data-cleaning behind the scenes that you might think
to do manually: fix whitespace into single spaces, put all uppercase letters into lowercase, discard punctuation, remove diacritics (e.g. accents) from characters, split up all

strings (words) and sort them alphabetically (so “Khương, Nguyễn Đình” becomes “dinh
khuong nguyen”).
The process that generates the key from a string value is the following (note that
the order of these operations is significant):
+ Remove leading and trailing whitespace.
+ Change all characters to their lowercase representation.
+ Remove all punctuation and control characters.
+ Normalize extended Western characters to their ASCII representator example găodel"
godel").

+ Split the string into whitespace-separated tokens.
+ Sort the tokens and remove duplicates.
+ Join the tokens back together.

Figure 2.1: Example of applying fingerprint algorithm for name

There are several factors need to be considered in this method:
GRADUATION THESIS

Page 16/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Due to the fact that the normalization of space and reduced characters and deleted
punctuation, the fingerprint of these portions is not distinguished. Because these string
characteristics are the least important in meaning distinction, they are the most changing
sections of the strings and their removal has a considerable advantage in developing
clusters.

Since the portions of the string are sorted, it doesn’t matter the provided tokens
sequence (“Cruise", “Tom Cruise" and “Tom Cruise," both finish in a fingerprint and
end in the same cluster)
Normalizing extended western characters plays the role of reproducing data entry mistakes performed when entering extended characters with an ASCII-only keyboard. Note that this procedure can also lead to false positives. For example, găodel
and godộl would both end up with godel as their fingerprint, but they’re likely to be
different names, so this might work less effectively for data sets where extended characters play substantial differentiation role.

2.3.1.b N-gram Fingerprint
N-gram Fingerprint allows us to change the n value to anything you want, and
it will generate n-grams of n size (after cleaning), alphabetize them, and then reassemble
them into a fingerprint. A 1-gram fingerprint, for example, will simply sort all of the letters in the cell into alphabetical order by dividing them into segments of one character.
A 2-gram fingerprint will locate all two-character segments, eliminate duplicates, alphabetize them, and reassemble them (for example, "banana" yields "ba a na a na," which
becomes "anbana").
This can aid in matching cells with typos and spaces (for example, matching
"lookout" and "look out," which are not identified because fingerprinting separates words).
This can assist. The greater the n number, the lower the clusters. Keep a watch on misspelled values that are nearly one another (for example, the ’wellington’ and the ’Elgin
Town’) with 1 gram.
The n-gram fingerprint method does the following:
+ Change all characters to their lowercase representation
GRADUATION THESIS

Page 17/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

+ Remove all punctuation, whitespace, and control characters
+ Obtain all the string n-grams
+ Normalize extended western characters to their ASCII representator example găodel"

godel").

+ Split the string into whitespace-separated tokens.
+ Sort the tokens and remove duplicates.
+ Join the tokens back together.
So, for example, the 2-gram fingerprint of "Paris" is "arispari" and the 1-gram
fingerprint is "aiprs".
Check the code if you’re curious about the details.
Why is this useful? In practice, using big values for n-grams doesn’t yield any
advantage over the previous fingerprint method, but using 2-grams and 1-grams, while
yielding many false positives, can find clusters that the previous method didn’t find even
with strings that have small differences, with a very small performance price.
For example "Krzysztof", "Kryzysztof", and "Krzystof" have different lengths
and different regular fingerprints, but share the same 1-gram fingerprint because they
use the same letters.

2.3.2 Nearest neighbors
Nearest neighbors - while key collisions methods are very fast, they tend to be
either too strict or too lax with no way to fine tune how much difference between strings
we are willing to tolerate.
The Nearest Neighbor methods (also known as kNN), on the other hand, provide
a parameter (the radius, or k) which represents a distance threshold: any pair of strings
that is closer than a certain value will be binned together.

GRADUATION THESIS

Page 18/83


Ho Chi Minh City University of Technology, VNU-HCM

Faculty of Computer Science and Engineering

2.3.2.a

Hamming distance
Definition
The Hamming distance between two strings of similar length in information the-

ory is the number of locations where the corresponding symbols differ. In other words,
it assesses the smallest number of replacing strings that might have changed one string
into the other strings, or the minimum number of mistakes. The Hamming Distance, in
a more generic context, is one of several string metrics for quantifying the editing distance between two equal-length sequences. The name is given to Richard Hamming, an
American mathematician.

Figure 2.2: Formula of Hamming distance calculation

Hamming Distance application
Coding theory, especially block codes, in which the equal-length strings are vectors over a finite field, is a key application.
The minimal hamming distance is used to establish several basic coding theory
ideas, for example error detection and error correction codes. In specifically, if and only
if the lowest Hamming distance between any two of its codewords is at least k+1, a
code C is considered to be k error detecting . Consider the code “000" and “111", which
consists of two codewords. Because the hamming distance between these two words is
3, the error detection is k=2. This means that the error can be identified even if one or
two bits are reversed. “000" becomes “111" when three bits are inverted, and the error is
undetectable.
As a result, a code having a minimum Hamming distance d between its codeGRADUATION THESIS

Page 19/83



Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

words can identify and fix at most d-1 errors. The latter number is also known as the
code’s packing radius or error-correcting capabilities.
Complexity and example of Hamming Distance
Complexity of Hamming distance is:
+ Worst-case performance : O(n)
+ Best-case performance : O(1)
+ Average performance : O(n)
+ Worst-case space complexity : O(n)

Below is an example of Hamming distance between bits in 3-bit binary cube:

Figure 2.3: 3-bit binary cube for finding Hamming distance

2.3.2.b Levenshtein distance
Definition
Levenshtein distance is the edit distance is proposed by Russian scientist Vladimir
Levenshtein in 1965, character as the editing unit, which is the minimum number of operations (insert, delete, replace) that a string is turned into another string. It is often used
in the similarity calculation of strings. Set two strings S and T with length m and n respectively, construct matrix Lev [n + 1, m + 1], circulation calculating the value of each
cell Lev (i,j) in the matrix, a calculation formula is as follows:

GRADUATION THESIS

Page 20/83


Ho Chi Minh City University of Technology, VNU-HCM

Faculty of Computer Science and Engineering

Calculation formula
The Levenshtein distance between two strings a,b (of length | a | and | b | respectively) is given by :

Figure 2.4: Levenshtein Distance calculation formula

where the tail of some string x is a string of all but the first character of x, and
x[n] is the nth character of the string x, starting with character 0.
Note that the first element in the minimum corresponds to deletion (from a to b),
the second to insertion and the third to replacement.
Complexity and example of Levenshtein distance
It has been shown that the Levenshtein distance of two strings of length n cannot
be computed in time O(n2 ).
Below is an example of Levenshtein distance between 2 words “Saturday” and
“Sunday”:

GRADUATION THESIS

Page 21/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Figure 2.5: Example of Levenshtein distance calculation table.

2.3.2.c

Damerau-Levenshtein distance

Definition
Below is an example of Damerau-Levenshtein distance between 2 words “Satur-

day” and “Sunday”:

Figure 2.6: Example of Damerau-Levenshtein distance calculation table.

GRADUATION THESIS

Page 22/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

2.3.2.d Jaro Distance - Jaro-Winkler Distance
The Jaro–Winkler distance is a string metric used in computer science and statistics to measure the edit distance between two sequences. It’s a Jaro distance metric variation introduced by William E. Winkler in 1990. (1989, Matthew A. Jaro).
The Jaro–Winkler distance uses a prefix scale p which gives more favourable
ratings to strings that match from the beginning for a set prefix length l .
The shorter the Jaro–Winkler distance between two strings, the closer they are.
The score is adjusted so that a score of 0 indicates an exact match and a score of 1
indicates no similarity. Because the metric was described in terms of similarity in the
original study, the distance is defined as the inversion of that value (distance = 1 similarity).
The Jaro–Winkler distance, though sometimes referred to as a distance metric,
is not a metric in the mathematical sense since it does not follow the triangle inequality.

Figure 2.7: Formula of Jaro similarity calculation

Jaro–Winkler similarity uses a prefix scale p which gives more favorable ratings
to strings that match from the beginning for a set prefix length l . Given two strings s1

and s2, their Jaro–Winkler similarity simw is:
Below is calculation example of similarity Jaro-Winkler distance:

GRADUATION THESIS

Page 23/83


Ho Chi Minh City University of Technology, VNU-HCM
Faculty of Computer Science and Engineering

Figure 2.8: Jaro-Winkler similarity calculation example

simw = simj + lp(1 - simj ) where:
• simj is the Jaro similarity for strings s1 and s2.
• l is the length of common prefix at the start of the string up to a maximum of 4
characters .
• p is a constant scaling factor for how much the score is adjusted upwards for having
common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum
length of the prefix being considered), otherwise the similarity could become larger
than 1. The standard value for this constant in Winkler’s work is p = 0.1.
The Jaro–Winkler distance dw is defined as dw = 1-simw .
Although often referred to as a distance metric, the Jaro–Winkler distance is not
a metric in the mathematical sense of that term because it does not obey the triangle
inequality. The Jaro–Winkler distance also does not satisfy the identity axiom
d(x, y) = 0 ←
→ x = y.

GRADUATION THESIS


Page 24/83


×