Tải bản đầy đủ (.pdf) (82 trang)

Phân cụm các tập dữ liệu có kích thước lớn dựa vào lấy mẫu và nền tảng spark

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.37 MB, 82 trang )

ĐẠI HỌC QUỐC GIA TP. HCM
TRƯỜNG ĐẠI HỌC BÁCH KHOA
--------------------

NGUYỄN LÊ HỒNG

PHÂN CỤM CÁC TẬP DỮ LIỆU CĨ KÍCH THƯỚC LỚN
DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK

Ngành : Khoa học Máy tính
Mã số : 8480101

LUẬN VĂN THẠC SĨ

TP. HỒ CHÍ MINH, tháng 01 năm 2020


VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF TECHNOLOGY
———oOo———

NGUYEN LE HOANG

CLUSTERING LARGE DATASETS
BASED ON DATA SAMPLING AND SPARK

Computer Science
No.: 8480101

MASTER THESIS


Ho Chi Minh City, January 2020


CƠNG TRÌNH ĐƯỢC HỒN THÀNH TẠI
TRƯỜNG ĐẠI HỌC BÁCH KHOA
ĐẠI HỌC QUỐC GIA – TP. HỒ CHÍ MINH
Cán bộ hướng dẫn khoa học: PGS. TS. ĐẶNG TRẦN KHÁNH

……………………

Cán bộ đồng hướng dẫn: TS. LÊ HỒNG TRANG

……………………

Cán bộ chấm nhận xét 1: TS. PHAN TRỌNG NHÂN

……………………

Cán bộ chấm nhận xét 2: PGS. TS. HUỲNH TRUNG HIẾU

……………………

Luận văn thạc sĩ được bảo vệ tại Trường Đại học Bách Khoa, ĐHQG TP. HCM
ngày 30 tháng 12 năm 2019
Thành phần Hội đồng đánh giá luận văn thạc sĩ gồm:
1. PGS. TS. NGUYỄN THANH BÌNH
2. TS. NGUYỄN AN KHƯƠNG
3. TS. PHAN TRỌNG NHÂN
4. PGS. TS. HUỲNH TRUNG HIẾU
5. PGS. TS. NGUYỄN TUẤN ĐĂNG

Xác nhận của Chủ tịch Hội đồng đánh giá LV và Trưởng Khoa quản lý chuyên ngành
sau khi luận văn đã được sửa chữa (nếu có)

CHỦ TỊCH HỘI ĐỒNG

PGS. TS. NGUYỄN THANH BÌNH

TRƯỞNG KHOA KH & KTMT


ii
THE RESEARCH WORK FOR THIS THESIS
HAS BEEN CARRIED OUT AT
UNIVERSITY OF TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
Under the supervision of
• Supervisor: ASSOC. PROF. DR. DANG TRAN KHANH
• and co-supervisor: DR. LE HONG TRANG

Examiner Board
• Examiner 1: DR. PHAN TRONG NHAN
• Examiner 2: ASSOC. PROF. DR. HUYNH TRUNG HIEU

This thesis is reviewed and defended at University of Technology, VNU-HCMC
on December 30, 2019
The members of Thesis Defense Committee are:
1. ASSOC. PROF. DR. NGUYEN THANH BINH
2. DR. NGUYEN AN KHUONG
3. DR. PHAN TRONG NHAN
4. ASSOC. PROF. DR. HUYNH TRUNG HIEU

5. ASSOC. PROF. DR. NGUYEN TUAN DANG
Confirmation from President of Thesis Defense Committee and Dean of Faculty of
Computer Science and Engineering

President of Thesis Defense
Committee

Dean of Faculty of Computer Science
and Engineering

Assoc.Prof.Dr. Nguyen Thanh Binh
(signed)

Assoc.Prof.Dr. Pham Tran Vu
(signed)


ĐẠI HỌC QUỐC GIA TP.HCM
TRƯỜNG ĐẠI HỌC BÁCH KHOA

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc

NHIỆM VỤ LUẬN VĂN THẠC SĨ
Họ tên học viên : NGUYỄN LÊ HOÀNG ............................... MSHV : 1770472 ............
Ngày, tháng, năm sinh : 12/03/1988 .......................................... Nơi sinh : TP. HCM ........
Ngành : Khoa học Máy tính ....................................................... Mã số : 8480101 ..............

I. TÊN ĐỀ TÀI : PHÂN CỤM CÁC TẬP DỮ LIỆU CĨ KÍCH THƯỚC LỚN DỰA
VÀO LẤY MẪU VÀ NỀN TẢNG SPARK

II. NHIỆM VỤ VÀ NỘI DUNG :
-

Tìm hiểu và nghiên cứu về các bài toán gom cụm, các phương pháp lấy mẫu, tổng
quát hoá dữ liệu và nền tảng Apache Spark cho dữ liệu lớn.

-

Dựa vào phương pháp lấy mẫu, chúng tôi đề xuất và chứng minh các giải thuật xây
dựng tập coreset để tìm ra tập hợp con phù hợp nhất vừa có thể để giảm chi phí tính
tốn và vừa có thể được sử dụng như là tập hợp con đại diện cho tập dữ liệu gốc trong
các bài toán gom cụm.

-

Thử nghiệm và đánh giá các phương pháp đề xuất.

III. NGÀY GIAO NHIỆM VỤ : 04/01/2019
IV. NGÀY HOÀN THÀNH NHIỆM VỤ : 07/12/2019
V. CÁN BỘ HƯỚNG DẪN : PGS. TS. ĐẶNG TRẦN KHÁNH và TS. LÊ HỒNG TRANG

TP. HCM, ngày 07 tháng 12 năm 2019
CÁN BỘ HƯỚNG DẪN

PGS. TS. ĐẶNG TRẦN KHÁNH

TRƯỞNG KHOA KH & KTMT


iv


SOCIALIST REPUBLIC OF VIETNAM
Independence - Freedom - Happiness
——————————

VIETNAM NATIONAL UNIVERSITY HCMC
UNIVERSITY OF TECHNOLOGY
——————————

MASTER THESIS OBLIGATIONS
Student: NGUYEN LE HOANG
Date of Birth: March 12, 1988
Major: Computer Science

StudentID: 1770472
Place of Birth: Ho Chi Minh City
Number: 8480101

I. THESIS TITLE: CLUSTERING LARGE DATASETS BASED ON DATA
SAMPLING AND SPARK
II. OBLIGATIONS AND CONTENTS:
• Study and research about clustering problems, data sampling methods, data
generalization and Apache Spark framework for big data.
• Based on Data Sampling, we propose and prove algorithms for coreset constructions in order to find the most suitable subsets that can both be used to
reduce the computational cost and be used as the representative subsets of the
full original datasets in clustering problems.
• Do experiments and evaluate the proposed methods.
III. START DATE: January 4, 2019
IV. END DATE: December 7, 2019
V. SUPERVISORS: ASSOC. PROF. DR. DANG TRAN KHANH

and DR. LE HONG TRANG
Ho Chi Minh City, December 7, 2019

Supervisor

Dean of Faculty of Computer Science
and Engineering

Assoc.Prof.Dr. Dang Tran Khanh
(signed)

Assoc.Prof.Dr. Pham Tran Vu
(signed)


v

Acknowledgements
I am very grateful to my supervisor, Assoc. Prof. Dr. DANG TRAN KHANH
and co-supervisor Dr. LE HONG TRANG for the guidance, inspiration and constructive suggestions that help me in the preparation of this graduation thesis.
I would like to thank my family very much, especially to my parents, who have
always been by my side and supported me whatever I want.
Ho Chi Minh City, December 7, 2019.


vi

Abstract
Since the development of technology, data has become one of the most essential
factors in 21st century. However, the explosion of Internet has transformed these

data to big ones which are very hard to handle and execute. In this thesis, we propose
solutions for clustering large-scale data, a vital problem in machine learning and a
widely-applied matter in industry.
To solve this problem, we use the data sampling methods which are based on
the concept of coresets - the subsets of data that must be small enough to reduce
computational complexity but must keep all representative characteristics of original
one. In other words, now we can scale down big datasets to the much smaller ones
that can be clustered efficiently while these results can be considered as the solutions
for the whole original datasets. Besides, in order to make the solving process for
large-scale datasets much more faster, we apply the open framework for big data Apache Spark.
In the scope of this thesis, we propose and prove two methods for coreset constructions for k-means clustering. We also do some experiments and evaluate these
proposed algorithms to estimate the advantages and disadvantages of each one. This
thesis can be divided into four parts as follows:
• Chapter 1 and Chapter 2 are the introduction and overview about coresets and
related background. These chapters also provide a brief about Apache Spark,
some definitions as well as theorems that are used in this thesis.
• In Chapter 3, we propose and prove the first coreset construction which is
based on the Farthest-First-Traversal algorithm and ProTraS algorithm [58]
for k-median and k-means clustering. We also evaluate this method at the end
of this chapter.
• In Chapter 4, based on prior work about Lightweight Coreset [12], we propose and prove the correctness of the second coreset construction, the α lightweight coreset for k-means clustering, a general and adjustable-parameter
form of lightweight coreset.
• In Chapter 5, we apply the α - lightweight coreset and the data generalization
method for solving the whole problem of this thesis - clustering large scale
datasets. We also apply Apache Spark to solve the problem faster. To evaluate the correctness, we do experiments with some large scale benchmark data
samples.


Tóm tắt
Với sự phát triển của cơng nghệ, dữ liệu đã trở thành một trong

những yếu tố quan trọng nhất của thế kỷ 21. Tuy nhiên, sự bùng nổ của
Internet đã biến đổi những dữ liệu này thành những dữ liệu vô cùng lớn
khiến cho việc xử lý và khai thác trở nên cực kỳ khó khăn. Trong đề tài này,
chúng tôi sẽ đề xuất giải pháp để giải quyết bài tốn gom cụm cho dữ liệu
có kích thước lớn, đây được xem là một bài toán rất quan trọng của máy
học (machine learning) và cũng là một bài toán được áp dụng rộng rãi trong
cơng nghiệp.
Để giải bài tốn, chúng tôi sử dụng phương pháp lấy mẫu được dựa
trên khái niệm về tập coreset – được định nghĩa là một tập con nhưng thoả
mãn hai điều kiện: phải đủ nhỏ để giảm độ phức tạp trong tính tốn nhưng
phải mang đầy đủ các đặc trưng đại diện của tập gốc. Nói cách khác, chúng
ta bây giờ có thể thu nhỏ tập dữ liệu lớn thành một tập nhỏ hơn để có thể
phân cụm hiệu quả trong khi kết quả thu được trên tập con cũng được xem
là kết quả của cả tập gốc. Bên cạnh đó, để q trình xử lý trong tập dữ liệu
có kích thước lớn nhanh hơn, chúng tôi cũng sử dụng nền tảng xử lý dữ
liệu lớn Apache Spark.
Trong phạm vi của luận văn này, chúng tôi đề xuất và chứng minh hai
phương pháp để xây dựng tập cốt coreset cho bài tốn gơm cụm k-means.
Chúng tôi cũng thực thi các thử nghiệm và đánh giá các giải thuật được đề
xuất để tìm các ưu và khuyết của mỗi phương pháp. Luận văn được chia
thành 4 phần chính như sau:
• Chương 1 và chương 2 giới thiệu các khái niệm về tập coreset và các kiến
thức liên quan trong. Trong các chương này, chúng tôi cũng tóm tắt ngắn
gọn về Apache Spark và các định lý được sử dụng trong luận văn.
• Trong chương 3, chúng tôi đề xuất và chứng minh phương pháp đầu tiên
để xây dựng tập coreset dựa trên giải thuật Farthest-First-Traversal và giải
thuật ProTraS [58] cho bài tốn gơm cụm k-median và k-means. Chúng tôi
cũng tiến hành đánh giá giải thuật này trong cuối chương.
• Trong chương 4, dựa trên các cơng trình về lightweight coreset [12],
chúng tơi đề xuất và chứng minh tính đúng đắn của phương pháp xây

dựng coreset thứ hai, α - lightweight coreset, cho bài tốn gơm cụm kmeans, đây được xem là một dạng tổng quát và có thể điều chỉnh hệ số
của lightweight coreset.
• Trong chương 5, chúng tôi sử dụng phương pháp α - lightweight coreset
cùng với phương pháp tổng quát hoá dữ liệu để giải quyết tổng thể bài
tốn – gơm cụm trên tập dữ liệu có kích thước lớn. Chúng tơi cũng sử
dụng nền tảng Apache Spark để bài toán được giải quyết nhanh hơn. Để
đánh giá độ chính xác, chúng tơi tiến hành thử nghiệm và so sánh kết quả
trên các tập mẫu benchmark có kích thước lớn.


viii

Declaration of Authorship
I, NGUYEN LE HOANG, declare that this thesis - Clustering Large Datasets
based on Data Sampling and Spark, and the work presented in this thesis are my
own. I confirm that:
• This work was done wholly or mainly while in candidature for a Master of
Science at University of Technology, VNU-HCMC.
• No part of this thesis has previously been submitted for any degree or any other
qualification at this University or any other institution.
• Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have
made clear exactly what was done by others and what I have contributed myself.

Signed:

Date:



ix

Contents

Acknowledgements

v

Abstract

vi

List of Figures

xii

List of Tables

xiii

List of Algorithms

xiv

1

2

Introduction


1

1.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

The Scope of Research . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Research Contributions . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3.1

Scientific Significance . . . . . . . . . . . . . . . . . . . .

5

1.3.2

Practical Significance . . . . . . . . . . . . . . . . . . . . .


6

1.4

Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5

Publications relevant to this Thesis . . . . . . . . . . . . . . . . . .

6

Background and Related Works

7

2.1

k-Means and k-Means++ Clustering . . . . . . . . . . . . . . . . .

8

2.1.1

k-Means Clustering . . . . . . . . . . . . . . . . . . . . . .

8


2.1.2

k-Means++ Clustering . . . . . . . . . . . . . . . . . . . .

9

Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2


x

2.3

2.4
3

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.2

Some Coreset Constructions . . . . . . . . . . . . . . . . .

11


Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3.1

What is Apache Spark? . . . . . . . . . . . . . . . . . . . .

12

2.3.2

Why Apache Spark? . . . . . . . . . . . . . . . . . . . . .

13

Bounds on Sample Complexity of Learning . . . . . . . . . . . . .

14

FFT-based Coresets

15

3.1

Farthest-First-Traversal Algorithm . . . . . . . . . . . . . . . . . .

16


3.2

FFT-based Coresets for k-Median and k-Means Clustering . . . . .

16

3.3

ProTraS algorithm and limitations . . . . . . . . . . . . . . . . . .

21

3.3.1

ProTraS algorithm . . . . . . . . . . . . . . . . . . . . . .

21

3.3.2

Drawbacks in ProTraS . . . . . . . . . . . . . . . . . . . .

21

Proposed FFT-based Coreset Construction . . . . . . . . . . . . . .

24

3.4.1


Proposed Algorithm . . . . . . . . . . . . . . . . . . . . .

24

3.4.2

Initial Step . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.4.3

Decrease the Computational Complexity . . . . . . . . . .

25

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.5.1

Experiment Setup . . . . . . . . . . . . . . . . . . . . . . .

26

3.5.2

Results and Discussion . . . . . . . . . . . . . . . . . . . .


29

3.4

3.5

4

2.2.1

General Lightweight Coresets

33

4.1

Lightweight Coreset . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.2

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .


34

The α-Lightweight Coreset . . . . . . . . . . . . . . . . . . . . . .

35

4.2.1

35

4.2

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .


xi
4.2.2

5

Theorem about the Optimal Solutions . . . . . . . . . . . .

36

4.3

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38


4.4

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Clustering Large Datasets via Coresets and Spark

45

5.1

Processing Method . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.1.1

Data Generalization . . . . . . . . . . . . . . . . . . . . .

46

5.1.2

Built-in k-Means clustering in Spark . . . . . . . . . . . . .

46

5.1.3


Realistic Method . . . . . . . . . . . . . . . . . . . . . . .

47

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

5.2.1

Experimental Method . . . . . . . . . . . . . . . . . . . . .

48

5.2.2

Experimental Data Sets . . . . . . . . . . . . . . . . . . . .

48

5.2.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.2

6


Conclusions

57

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59


xii

List of Figures

1.1

Big Data properties . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Machine Learning: Supervised vs Unsupervised . . . . . . . . . . .

3

2.1

Spark Logo - . . . . . . . . . . . . . . . .


12

2.2

The Components of Spark . . . . . . . . . . . . . . . . . . . . . .

13

3.1

Original-R15 (small cluster) and scaling-R15 . . . . . . . . . . . .

23

3.2

Some data sets for experiments . . . . . . . . . . . . . . . . . . . .

27

3.3

ARI in relation to subsample size for datasets D1 - D8 . . . . . . .

31

3.4

ARI in relation to subsample size for datasets D9 - D16 . . . . . . .


32

5.1

ARI and Runtime of Birch1 in relation to full data . . . . . . . . . .

50

5.2

ARI and Runtime of Birch2 in relation to full data . . . . . . . . . .

50

5.3

ARI and Runtime of Birch3 in relation to full data . . . . . . . . . .

51

5.4

ARI and Runtime of ConfLongDemo in relation to full data . . . . .

51

5.5

ARI and Runtime of KDDCupBio in relation to full data . . . . . .


51


xiii

List of Tables

3.1

Data sets for Experiments . . . . . . . . . . . . . . . . . . . . . . .

27

3.2

Experimental Results - Adjusted Rand Index Comparison . . . . . .

30

3.3

Experimental Results - Time Comparison . . . . . . . . . . . . . .

30

5.1

Data sets for Experiments . . . . . . . . . . . . . . . . . . . . . . .

49


5.2

Experimental Results for dataset Birch1 . . . . . . . . . . . . . . .

52

5.3

Experimental Results for dataset Birch2 . . . . . . . . . . . . . . .

53

5.4

Experimental Results for dataset Birch3 . . . . . . . . . . . . . . .

54

5.5

Experimental Results for dataset ConfLongDemo . . . . . . . . . .

55

5.6

Experimental Results for dataset KDDCup Bio . . . . . . . . . . .

56



xiv

List of Algorithms
1
2
3
4
5
6
7

k-Means Clustering - Lloyd’s Algorithm [42] .
D2 Sampling for k-Means++ [6] . . . . . . . .
Farthest-First-Traversal algorithm . . . . . . .
ProTraS Algorithm [58] . . . . . . . . . . . . .
Proposed FFT-based Coreset Construction . . .
Lightweight Coreset [12] . . . . . . . . . . . .
Proposed α - Lightweight Coreset Construction

.
.
.
.
.
.
.

.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

8
9
16
22
24
35
38


1


Chapter 1

Introduction
1.1

Overview

Over the past few years, the development of technology has lead to the rapid increasing in the amount of data. Twenty years ago, almost data were just from sciences,
but the explosion of Internet takes human to new era of data. They are now existing everywhere around us. For instance, the growth of smart city and Internet of
Things (IoT) devices such as aerial (remote sensing), cameras, microphones, radiofrequency identification (RFID) readers, wireless sensor networks, etc. create a lot
of data every second. Besides, the increasing in mobile and personal devices such as
smart phones or tablets also make the data bigger every day, especially through photo
and video sharing in popular social networks like Facebook, Twitter, YouTube, etc.
Many researches have shown that the amount of data created each year is growing
faster than ever before and they estimate that by 2020, every human on the planet will
be creating 1.7 megabytes of information each second; and in only a year, the accumulated world data will grow to 44 zettabytes 1 [46]. Another research from IDC
predict that the amount of global data captured in 2025 will reach 163 zettabytes, a
tenfold increase compared to 2016 [55]
Consequently, researchers now have to face new hard situation: solving problems for data that have big amount in volume, variety, velocity, veracity and value.
(Figure 1.1 2 ). For the demand of understanding and explaining these data in order
to solve reality problems, it is very hard for human if there is no help from machine.
That’s why machine learning plays an important role in this decade as well as in the
future. By applying machine learning combined with artificial intelligence (AI), scientists can create systems having the ability to automatically learn and improve from
experience without being explicitly programmed.
For each specific purpose, machine learning is divided into two categories: supervised and unsupervised. Supervised learning is a kind of training model where the
training sets go along with provided target labels, the system will learn from these
1 One

zettabyte is equivalent to one billion gigabytes.
source: />

2 Image


Chapter 1. Introduction

2

F IGURE 1.1: Big Data properties

training sets and then is used to predict or classify future instances. In contrast, unsupervised machine learning approaches extract information from data sets where such
explicit labels are not available. The importance of this field is expected to grow as it
is estimated that 85% of global data in 2025 will be unlabeled [55]. In particular, data
clustering - the tasks of grouping together similar objects into clusters — seems to be
a fruitful approach for analyzing that data. [13]. Applications are broad and include
fields such as computer vision [61], information retrieval [35], computational geometry [36] and recommendation systems [41]. Furthermore, clustering techniques
can also be used to learn data representations that are used in downstream prediction
tasks such as classification and regression [16]. Machine learning categories can be
described briefly in Figure 1.23 .
In general, clustering is one of the most popular techniques in machine learning
and is used widely in large-scale data analysis. The target of clustering is partitioning a set of objects into groups such that objects in same group are similar to each
other and objects in different groups are dissimilar to each other. This technique, due
to its importance and application in reality, has a lot of investigations and various
algorithms. For example, we can use BIRCH [68], CURE [27] which are belonging
to hierarchical clustering, also known as connectivity-based clustering, for solving
problems based on the idea of objects being more related to nearby objects than to
objects farther away. If the problems are closely related to statistics, we can use
distribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DBCLASD [53]. For matter based on density clustering in which the data that is in the
3 Image

source: />


Chapter 1. Introduction

3

F IGURE 1.2: Machine Learning: Supervised vs Unsupervised

region with high density of the data space is considered to belong to the same cluster [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known densitybased clustering algorithm, or OPTICS [4] - an improvement of DBSCAN. And one
of the most common approaches for clustering is based on partition in which the basic
idea is to assign the centers of data points to some random objects, the actual centers will be reveal through several iterations until a stop condition is satisfied. Some
common algorithms of this kind are k-means [45], k-medoids [49], CLARA [37],
CLARANS [48]. For a more detail, we refer readers to the survey of clustering
algorithms by R.Xu (2005) [65] and by D. Xu (2015) [64]
In fact, there are a lot of clustering algorithms and improvements that can be
used in applications. Each one has its own benefits and drawbacks as well. The
question of choosing a suitable clustering algorithm is an important and difficult
problem that users must deal with when they have to solve situations with specific
configurations and settings. There are some research about this such as in [14], [39],
[67] where explain about the quality of clusters in some circumstances. However, in
the scope of this thesis, we do not cover this issue and various clustering algorithms,
instead of this, we fix and select one of the most popular clustering algorithm - the kmeans clustering. We will use this algorithm throughout of this report and investigate
methods that can deal with k-means clustering for large-scale data set.
Moreover, to design a complete solution that can cluster and analyze large-scale
data is still a challenge for data scientists. Many methods have been proposed for
several years to deal with machine learning for big data. One of the simplest way is
depending on infrastructure and hardware: the more powerful and modern machine
we have, the more complicated and larger amount of data we can solve. This solution
is quite easy but costs a lot of money and few people can afford this. Another option
is finding suitable algorithms to reduce the computational complexity from the input
size that may contain millions or billions of data points. There are some approach

methods such as data compression in [69], [1], data deduplication [19], dimension


Chapter 1. Introduction

4

reduction [25], [60], [51], etc. For a survey about this, readers can find more useful
information in [54]. Among big data reduction methods, data sampling is one of
the popular options that are closely related to machine learning and data mining for
researchers. The key idea of data sampling is that instead of solving problems on
the full data with large-scale size, we can find the answer for the subset of this data;
this result is then used as the baseline for finding the actual solution for original data
set. This leads us to a new difficulty: finding a subset that must be small enough
for effectively reducing computational complexity but must keep all representative
characteristics of original data. And, this difficulty is the motivation for us to do this
research and this thesis as well.

1.2

The Scope of Research

In this thesis, we will propose a solution for a problem of clustering large datasets.
We use the word "large" to indicate the data that has "big" in volume, not the whole
characteristics of big data described in previous section with 5 V’s (Volume, Variety,
Value, Velocity and Veracity) (Figure 1.1). However, the Volume, in other words, the
data size, is one of the most non-trivial difficulties that most researchers have to face
when solving a big data related problem.
For clustering algorithm, even though there are a lot of investigations and methods, we consider fixed clustering problems with the prototypical k-means clustering.
We select this because k-means is the most well-known clustering algorithm and is

widely applied in reality as well as in industry or scientific research.
While there is a wealth of prior work on clustering of small and medium sized
data sets, there are unique challenges in the massive data setting. The traditional
algorithms have a super-linear computational complexity on the size of the data set
making them infeasible when there are many data points. In the scope of this thesis,
we apply data sampling to deal with the massive data setting. A very basic approach
of this method is random sampling or uniform sampling. In fact, while uniform
sampling is feasible for some problems, there are instances where it performs very
poorly due to the naive nature of the sampling strategy. For example, real-world data
is often imbalanced and contains clusters of different sizes. As a consequence, a
small fraction of data points can be very important and have an enormous impact on
the objective function. Such imbalanced data sets are problematic for methods based
on uniform sampling since, with high probability. These methods only sample points
in large clusters and the information in small clusters is discarded. [13]
The idea of finding a relevant subset from original data to decrease the computational cost brings scientists to the concept of coreset, which was first applied in
geometric approximation by Agarwal et al. in 2004 [2], [3]. The problem of coreset
constructions for k-median and k-means clustering was then stated and investigated
by Har-Peled et al. in [28], [29]. Since that time, many coreset construction algorithms have been proposed for a wide variety of clustering problems. In this thesis,


Chapter 1. Introduction

5

by using state-of-the-art coreset methods, we propose two methods for coreset constructions for k-means clustering.
In addition to thinking algorithms in machine learning for solving big data
problems, data scientists also invent, create and then apply framework for fastening the process of executing big data. Some most popular open-source and free-touse frameworks are Apache Hadoop, Apache Spark, Apache Storm, Apache Samza,
Apache Flink, etc. Each one is designed with different architectures and has its own
strength. For a survey and more details about these frameworks, readers can find
more from [34]. In this thesis, along with data sampling via coresets, we will apply

the built-in k-means clustering algorithm of Apache Spark to shorten the runtime of
the whole problem.

1.3

Research Contributions

In this thesis, we will solve the problem of clustering large data sets by using data
sampling methods and the framework for big data - Apache Spark. Since the framework Spark is a part of technical field and it is maintained by Apache, we do not
make any change in its configurations or do not improve any thing belong to Spark
as well. Instead, our research focus on the data sampling methods which will find
the most relevant subsets, called coresets, of a full data set. Coresets, in other words,
can be described as a compact subset such that models trained on coresets will also
provide a good fit with models trained on full data set. By using coresets, we can
scale down a big data to a tiny one in order to reduce the computational cost of a
machine learning problem. With deeply research about coresets, our thesis has some
scientific and practical contributions as follows

1.3.1

Scientific Significance

In this thesis, based on prior works about coresets, we propose new algorithms of
coreset constructions for k-means clustering
• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros &
Guillaume in [58], we propose an FFT-based coreset construction. This part is
explained and proved clearly in Chapter 3
• By based on the lightweight coreset of Bachem, Lucic and Krause [12], we
propose a general model for the α - lightweight coreset, then we propose a
general lightweight coreset construction that is very fast and practical. This is

proved is Chapter 4


Chapter 1. Introduction

1.3.2

6

Practical Significance

• Due to its high runtime, our proposed FFT-based coreset construction is very
hard to be used in reality. However, through experiments with some state-ofthe-arts coreset constructions, this proposed algorithm is showed that it is able
to produce one of the best sample coresets that can be used in experiments.
• Our proposed α - lightweight coreset model is a generalization of the traditional lightweight coreset. This proposal can be used for various practical
cases, especially for situations that need to focus on multiplicative errors or
additive errors of the samples.

1.4

Organization of Thesis

The remaining of this thesis is organized as follows.
• Chapter 2. This chapter is an overview over prior works related to this thesis,
including the k-means and k-means++ algorithms, the definition of coresets, a
short brief about Apache Spark and a theorem about bounds for sample complexity.
• Chapter 3. We introduce about farthest-first-traversal algorithm as well as the
ProTraS for finding coresets. Then we propose an FFT-based algorithm for
coreset construction.
• Chapter 4. This chapter is about lightweight coreset and our general lightweight

coreset model. We also prove the correctness of this model and propose a general algorithm for this α - lightweight coreset.
• Chapter 5. This chapter shows the experimental running for clustering large
datasets. We use the α - lightweight coreset for sampling process and kMeans++
for clustering on Apache Spark framework.
• Chapter 6. We have the thesis conclusion and an ending here.

1.5

Publications relevant to this Thesis

• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang. A Comparative
Study of the Use of Coresets for Clustering Large Datasets. pp 45-55 LNCS
11814 Future Data and Security Engineering FDSE 2019.
• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang. A FarthestFirst-Traversal-based Sampling Algorithm for k-clustering. International Conference on Ubiquitous Information Management and Communication IMCOM
2020.


7

Chapter 2

Background and Related Works
In this chapter, we provide a short introduction about background and prior works
related to this thesis.
• k-Means and k-means++ Clustering
• Coresets
• Apache Spark
• Bounds & Pseudo-dimension



Chapter 2. Background and Related Works

2.1

8

k-Means and k-Means++ Clustering

2.1.1

k-Means Clustering

The k-means clustering is one of the oldest and most important questions in machine
learning. Given an integer k and a data set X ⊂ Rd , the goal is to choose k centers so
as to minimize the total squared distance between each point and its closest center.
The k-means clustering can be described as follows
Let X ⊂ Rd , the k-means clustering problems is to find a set Q ⊂ Rd with |Q| = k
such that the function φX (Q) is minimized, where
φX (Q) =

||x − q||2
∑ d(x, Q)2 = ∑ min
q∈Q

x∈X

x∈X

In 1957, an algorithm, now often referred to simply as “k-means”, was proposed
by S. Lloyd of Bell Labs; it was then published in 1982 [42]. Lloyd’s algorithm

begins with k arbitrary “centers,” typically chosen uniformly at random from the
data points. Each point is then assigned to the nearest center, and each center is
recomputed as the center of mass of all points assigned to it. These last two steps are
repeated until the process stabilizes. [6]
The Lloyd’s algorithm is described in Algorithm 1
Algorithm 1 k-Means Clustering - Lloyd’s Algorithm [42]
Require: data set X, number of clusters k
Ensure: k separated clusters
1: Randomly initialize k centers C = {c j }kj=1 ∈ Rdxk
2: while (Not Convergence) do
3:
for j := 1 → k do
4:
X j := 0/
5:
end for
6:
for i := 1 → |X| do
7:
j := arg min1≤l≤k d(xi , cl )2
8:
X j := X j ∪ {xi }
9:
end for
10:
for j := 1 → k do
11:
c j := |X1j | ∑x∈X j x
12:
end for

13: end while
14: return C = {c j }kj=1
The algorithm was then developed by Inaba et al. [33], Matousek [47], Vega
et al. [63], etc. However, one of the most advanced improvement of k-means is the
k-means++ by Author and Vassilvitskii in [6]. We will give an overview about this
algorithm in next section.


Chapter 2. Background and Related Works

2.1.2

9

k-Means++ Clustering

In the Algorithm 1, the initial set of cluster centers (line 1) is based on random
sampling where k points are selected uniformly at random from the data set. This
simple approach was fast and easy to implement. However, there are many natural
examples for which the algorithm generates arbitrarily bad clusterings. This happens
due to the conflict placement of the starting centers, and in particular, it can hold
with high probability even if the centers are chosen uniformly at random from the
data points. [6]
To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithm
named k-means++ which uses adaptive seeding based on a technique called D2 sampling to create its initial seed set before running Lloyd’s algorithm to convergence [8]. Given an existing set of centers S, the D2 - sampling strategy, as the name
suggests, samples each point x ∈ X with probability proportional to the squared distance to the selected centers, i.e.,
p(x|S) =

d(x, S)2
∑x ∈X d(x , S)2


The D2 - sampling is described in Algorithm 2
Algorithm 2 D2 Sampling for k-Means++ [6]
Require: data set X, number of clusters k
Ensure: initial set S used for k-means
1: Uniformly sample x ∈ X
2: X := {x}
3: for i := 2 → k do
2
4:
Sample x ∈ X with probability ∑ d(x,S)
2
x ∈X d(x ,S)
5:
X := X ∪ {x}
6: end for
7: return set S
The set S from this algorithm will be used to replace line 1 of the original kmeans in Algorithm 1.


×