Tải bản đầy đủ (.pdf) (236 trang)

Scalable data parallel graph algorithms from generation to management

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.79 MB, 236 trang )

Scalable Data-Parallel graph algorithms
from generation to management
Sadegh Nobari
(B.Eng.(Hons.),IUST)
(Ph.D.,NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2012
Declaration
I hereby declare that this thesis is my original
work and it has been written by me in its entirety.
I have duly acknowledged all the sources of information
which have been used in the thesis.
This thesis has also not been submitted for any degree
in any university previously.
Sadegh Nobari
23 July 2012
i
Acknowledgements
Ph.D. was a wonderful extraordinary one time in life experience . . .
I would like to say thanks
To my parents (Zeynab and Nader) and my only brother (Ghasem),
through their sacrifice
my opportunities were possible
To my advisors,
Professor St
´
ephane Bressan


Professors Anastasia Ailamaki, Panagiotis Karras, Panos Kalnis, Nikos Mamoulis and
Yannis Velegrakis
for patiently supporting me
To my committee,
Professors Tan Tiow Seng, Tan Kian-Lee, M. Tamer
¨
Ozsu and Leong Hon Wai
for gladly suffering my impenetrable prose
and helping me to better communicate
ii
To my friends,
Xuesong Lu, Song Yi, Tang Ruiming, Antoine Veillard, Quoc Trung Tran, Cao Thanh
Tung, Ehsan Kazemi, Siarhei Bykau, Mohammad Oliya, Behzad Nemat Pajouh, Thomas
Heinis, Clemens Lay, Reza Sherkat and . . .
for accompanying me
To my groups,
people in Database research and Embedded System labs of NUS, DIAS of EPFL, db-
Trento of University of Trento and Dennis Shasha’s group of NYU
for accepting me
To my wife Mozhdeh,
for redefining my senses
.
.
.
Best Wishes,
Dr. Sadegh Nobari
With a quarter-century of life experience
2012
iii
Abstract

J. J. Sylvester, in 1878, in an article on chemistry and algebra in Nature, called a math-
ematical structure to model connections between objects, ”graph”. More than a century
later, the versatility of graphs as a data model is demonstrated by the long list of appli-
cations in mathematics, science, engineering and the humanities.
Cormen, Leiserson, Rivest, and Stein describe the role of graphs and graph algo-
rithms in computer science as follows in their popular textbook: ”graphs are a pervasive
data structure in computer science, and algorithms working with them are fundamental
to the field.”
Graphs are natural data structures for modern applications. Social network data are
typically represented as graphs, semantic web is based on RDF formalism that is a graph
model, software models and program dependence in software engineering represented
via graphs. In many cases these are very large and dynamic graphs. The convergence
of applications managing large graphs and the availability of cheap parallel processing
hardware caused a renewed interest in managing very large graphs over parallel systems.
In this dissertation, we design scalable and practical graph algorithms for a selected
set of large graph generation and management problems. In particular, we provide par-
i
allel solutions for graph generation with both random and real-world graph models. Af-
terward, we propose techniques for processing large graphs in parallel, specifically for
computing the Minimum Spanning Forest and the Shortest Path between vertices.
Chapter 3 focuses on the generation of very large graphs. The nave algorithm for
generating eros graphs does not scale to large graphs. In this chapter we take a systematic
approach to the development of the PPreZER algorithm by proposing a series of seven
algorithms. The results of our study depict that our fine tuned algorithm, PPreZER, for
generating random graph data can be executed on a typical GPU on average 19 times
faster than its fastest sequential version on the CPU.
Chapter 4 moves beyond random graphs and considers the generation of real-world
graphs. This chapter considers the spatial datasets and the generation of graphs by taking
the spatial join of the elements in the two datasets. We propose an algorithm (called
HiDOP) to perform this spatial join operation efficiently. Consequently we design a data

parallel algorithm inspired from HiDOP algorithm.
Chapters 5 and 6 cover the data management part of the thesis. Two graph algorithms,
a.k.a. graph queries, are studied: Minimum Spanning Forest (Chapter 5) and All-Pairs
Shortest Path (Chapter 6). In Chapter 5, PMA, a novel data parallel algorithm this is
inspired from Bor
˙
uvka’s and Prims MSF algorithm is proposed. PMA experimentally
shows to be superior over the state of the art MSF algorithms. Chapter 6 introduces a
threshold Lto the problem definition of all-pairs shortest path such that only the paths
that have weight less than Lare found, the problem is called L-APSP. This threshold is
advantageous when only close connections are of interest, like in large social networks.
A large number of APSP algorithms are studied and for each a counterpart L-APSP
algorithm is designed and a parallel version algorithm that exploits GPU is proposed.
Finally, this dissertation has led to the proposal of four scalable data-parallel algorithms
for graph data processing.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallel processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Graph data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Generating random graphs . . . . . . . . . . . . . . . . . . . . 6
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . 7
Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Generating real-world graphs . . . . . . . . . . . . . . . . . . 8
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8
Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Graph data management . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Finding Minimum Spanning Forest . . . . . . . . . . . . . . . 9
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . 10
iii
Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Finding Shortest Path . . . . . . . . . . . . . . . . . . . . . . . 11
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13
Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Parallel processing on Graphics Processing Unit (GPU) 15
2.1 Many and Multi core architectures . . . . . . . . . . . . . . . . . . . . 15
2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The CUDA and BrookGPU programming frameworks . . . . . . . . . 16
2.4 SIMT: Single Instruction, Multiple Threads . . . . . . . . . . . . . . . 17
2.5 Parallel Thread Execution (PTX) . . . . . . . . . . . . . . . . . . . . . 21
2.6 GPU Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 GPU Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 GPU empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Programming the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9.1 Parallel Pseudo-Random Number Generator . . . . . . . . . . . 27
2.9.2 Parallel Prefix Sum . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9.3 Parallel Stream Compaction . . . . . . . . . . . . . . . . . . . 30

2.10 chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Scalable Random Graph Generation 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Baseline algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Sequential algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.1 Skipping Edges . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 ZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.3 PreLogZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 PreZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 PER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 PZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 PPreZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . 51
iv
Speedup Assessment . . . . . . . . . . . . . . . . . . . . . . . 53
Comparison among Parallel algorithms . . . . . . . . . . . . . 54
Parallelism Speedup . . . . . . . . . . . . . . . . . . . . . . . 55
Size Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Scalable Real-world graph generation 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 In-Memory Approaches . . . . . . . . . . . . . . . . . . . . . 66

4.2.2 On-disk Approaches . . . . . . . . . . . . . . . . . . . . . . . 67
Both Datasets Indexed . . . . . . . . . . . . . . . . . . . . . . 67
One Dataset Indexed . . . . . . . . . . . . . . . . . . . . . . . 67
Unindexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Touch Detection . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Motivation Examples . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.3 Motivation Experiments . . . . . . . . . . . . . . . . . . . . . 73
4.4 HiDOP: Hierarchical Data Oriented Partitioning . . . . . . . . . . . . . 75
4.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 HiDOP Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.3 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.4 Tree Building Phase . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.5 Assignment Phase . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.6 Probing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.7 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.2 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . 85
Tree Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Local Join Parameters . . . . . . . . . . . . . . . . . . . . . . 86
Join Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . 91
4.7.3 Loading the Data . . . . . . . . . . . . . . . . . . . . . . . . . 92
v
4.7.4 Varying Dataset B . . . . . . . . . . . . . . . . . . . . . . . . 93
Small Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7.5 Varying Epsilon . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.6 Neuroscience Datasets . . . . . . . . . . . . . . . . . . . . . . 96
4.7.7 Parallel HiDOP experiments . . . . . . . . . . . . . . . . . . . 99
Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . 99
Speedup Assessment . . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5 Scalable Parallel Minimum Spanning Forest Computation 103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.1 Sequential algorithms . . . . . . . . . . . . . . . . . . . . . . 106
Bor
˙
uvka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Kruskal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Reverse-Delete . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Prim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 DPMST: Bor
˙
uvka based Data Parallel MST algorithm . . . . . . . . . . 110
5.3.1 Implementation on GPU . . . . . . . . . . . . . . . . . . . . . 112
5.4 Motivation for scalability . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 PMA: Scalable Parallel MSF algorithm . . . . . . . . . . . . . . . . . 114
5.5.1 Partial Prim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5.2 Unification step . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.3 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 118
5.6 PMA implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6.1 Partial Prim implementation . . . . . . . . . . . . . . . . . . . 119

MinPMA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 120
SortPMA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 121
HybridPMA algorithm . . . . . . . . . . . . . . . . . . . . . . 121
5.6.2 Unification implementation . . . . . . . . . . . . . . . . . . . 122
5.6.3 Implementation notes . . . . . . . . . . . . . . . . . . . . . . . 123
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7.1 DPMST performance evaluation . . . . . . . . . . . . . . . . . 124
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 124
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 124
5.7.2 PMA performance evaluation . . . . . . . . . . . . . . . . . . 127
vi
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Maximum subtree size (γ) . . . . . . . . . . . . . . . . . . . . 130
Removing parallel edges . . . . . . . . . . . . . . . . . . . . . 132
Reduction rate . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Performance comparison . . . . . . . . . . . . . . . . . . . . . 133
5.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6 Scalable Parallel All-Pairs Shortest Path Computation 136
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 All-Pairs Shortest Path problem . . . . . . . . . . . . . . . . . . . . . 137
6.3 Sequential algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.1 Floyd-Warshall . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.2 Johnson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.3 Repeated Squaring . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.4 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.1 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.2 Johnson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.3 Repeated Squaring . . . . . . . . . . . . . . . . . . . . . . . . 144
6.4.4 Floyd-Warshall . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.5 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5.1 Single Source Shortest Path (SSSP) . . . . . . . . . . . . . . . 148
6.5.2 Floyd-Warshall algorithm . . . . . . . . . . . . . . . . . . . . 149
6.5.3 Repeated Squaring algorithm . . . . . . . . . . . . . . . . . . . 151
6.5.4 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 152
6.6 L-Distance Matrix Computation . . . . . . . . . . . . . . . . . . . . . 152
6.6.1 L-Pruned Parallel Floyd-Warshall algorithm . . . . . . . . . . 155
6.6.2 L-Pruned Repeated Squaring algorithm . . . . . . . . . . . . . 155
6.6.3 L-Pruned Single Source Shortest Path . . . . . . . . . . . . . . 155
6.6.4 L-Pruned Gaussian Elimination . . . . . . . . . . . . . . . . . 156
6.7 APSP Performance evaluation . . . . . . . . . . . . . . . . . . . . . . 156
6.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.7.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . 157
6.7.3 Distance Matrix Computation . . . . . . . . . . . . . . . . . . 158
6.7.4 L-distance Matrix Computation . . . . . . . . . . . . . . . . . 158
L-Pruned Floyd-Warshall algorithm . . . . . . . . . . . . . . . 160
L-Pruned Repeated Squaring algorithm . . . . . . . . . . . . . 161
L-Pruned Single Source Shortest Path algorithm . . . . . . . . 165
L-Pruned Gaussian Elimination algorithm . . . . . . . . . . . . 166
vii
6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.9 Operations for privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.9.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.10 L-opacity: Linkage-Aware Graph Anonymization . . . . . . . . . . . . 173
6.10.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 173
6.10.2 L-Opacification algorithm . . . . . . . . . . . . . . . . . . . . 179
6.10.3 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . 180
Opacity Value Computation . . . . . . . . . . . . . . . . . . . 180
Edge Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Edge Removal and Insertion . . . . . . . . . . . . . . . . . . . 183

6.10.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 184
Description of Data . . . . . . . . . . . . . . . . . . . . . . . . 187
Utility metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Comparison on Distortion . . . . . . . . . . . . . . . . . . . . 188
Comparison on EMD . . . . . . . . . . . . . . . . . . . . . . . 189
6.10.5 Comparison on Clustering Coefficients . . . . . . . . . . . . . 191
Runtime comparison . . . . . . . . . . . . . . . . . . . . . . . 192
6.10.6 Pruning Capacity in Distance Matrix Computation . . . . . . . 193
6.10.7 Data Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.10.8 L-Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.11 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7 Conclusions 197
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Graph data generation algorithms . . . . . . . . . . . . . . . . . . . . . 198
7.2.1 Random graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.2.2 Real-world graphs . . . . . . . . . . . . . . . . . . . . . . . . 199
7.3 Graph data management algorithms . . . . . . . . . . . . . . . . . . . 200
7.3.1 Minimum Spanning Forest problem . . . . . . . . . . . . . . . 200
7.3.2 All-Pairs Shortest Path problem . . . . . . . . . . . . . . . . . 201
7.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.4.1 Medium term goals . . . . . . . . . . . . . . . . . . . . . . . . 202
Dynamic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 202
Large graph processing . . . . . . . . . . . . . . . . . . . . . . 202
7.4.2 Long term goals . . . . . . . . . . . . . . . . . . . . . . . . . 203
Parallel graph processing . . . . . . . . . . . . . . . . . . . . . 203
Distributed graph processing . . . . . . . . . . . . . . . . . . . 203
References 204
viii
List of Figures
1.1 The seven K

¨
onigsberg bridges, courtesy of [78]. . . . . . . . . . . . . . 2
1.2 Number of research articles until April 2012 using graphs, extracted
from the Scopus database [11]. . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Field of research articles until April 2012 using graphs, extracted from
the Scopus database [11]. . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Cuda thread organization. . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 A set of SIMT multiprocessors. . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Executing K times a GPU kernel consist of one iteration. . . . . . . . . 26
2.4 Executing one GPU kernel consist of K iterations. . . . . . . . . . . . . 26
2.5 Phases of Parallel Prefix Sum [154]. . . . . . . . . . . . . . . . . . . . 30
2.6 Stream Compaction for 10 elements. . . . . . . . . . . . . . . . . . . . 31
3.1 f(k) for varying probabilities p. . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Running PER algorithm on 10 elements. . . . . . . . . . . . . . . . . . 48
3.3 Generating edge list via skip list in PZER. . . . . . . . . . . . . . . . . 49
3.4 Running times for all algorithms. . . . . . . . . . . . . . . . . . . . . . 52
3.5 Running times for small probability. . . . . . . . . . . . . . . . . . . . 53
3.6 Speedup for all algorithms over ER. . . . . . . . . . . . . . . . . . . . 54
3.7 Speedup for parallel algorithms over their sequential counterparts. . . . 55
3.8 Running times for parallel algorithms. . . . . . . . . . . . . . . . . . . 56
3.9 The times for pseudo-random number generator with skip for PZER and
PPreZER and check for PER. . . . . . . . . . . . . . . . . . . . . . . . 57
3.10 Speedup of the parallel algorithms against themselves for Γ
v=10K,p=0.1
graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
ix
3.11 Runtime of the parallel algorithms for varying thread-blocks for Γ
v=10K,p=0.1
graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.12 Runtime for varying graph size, p = 0.001 . . . . . . . . . . . . . . . . 60

3.13 Runtime for varying graph size, p = 0.01 . . . . . . . . . . . . . . . . 61
3.14 Runtime for varying graph size, p = 0.1 . . . . . . . . . . . . . . . . . 62
4.1 The PBSM approach partitions the space in equi-width partitions (a) and
assigns the objects to the partitions (b). . . . . . . . . . . . . . . . . . . 69
4.2 S3 algorithm [103] partitions the space increasingly fine-granular the
smaller the level of the hierarchy. When joining, cell c
I
is compared
with the corresponding cell c
O
and all cells overlapping it (shaded). . . . 70
4.3 Schema of a neuron’s morphology modeled with cylinders. . . . . . . . 71
4.4 Two datasets where S3 performs suboptimal. . . . . . . . . . . . . . . 73
4.5 Execution time of the spatial join with different approaches. . . . . . . . 74
4.6 The three phases of the HiDOP: building the tree, assignment and joining 77
4.7 The tree data structure of the HiDOP algorithm. . . . . . . . . . . . . . 79
4.8 The smaller the MBRs of the leaf nodes, the more effective filtering is. . 86
4.9 Uniform, gaussian and clustered data distributions used for the experi-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 All the algorithms on small uniform dataset with increasing the size of
dataset B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.11 Synthetic large uniform datasets varying size of the second dataset when
ǫ = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.12 Synthetic large gaussian datasets varying size of the second dataset when
ǫ = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.13 Synthetic large clustered datasets varying size of the second dataset when
ǫ = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.14 Comparing the approaches for two different ǫ on all datasets. . . . . . . 97
4.15 Comparison of all approaches for ǫ of 5 and 10 on neuroscience datasets. 97
4.16 Execution for increasingly dense spatial neuroscience datasets. . . . . . 98

4.17 Execution of HiDOP and Parallel HiDOP on large synthetic and real
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.18 Details of running PHiDOP on varying datasets. . . . . . . . . . . . . . 100
4.19 Speedup of Parallel HiDOP against its sequential version. . . . . . . . . 101
5.1 The state transition diagram of Kang and Bader’s algorithm [96]. . . . . 110
x
5.2 Illustration of the proposed data parallel MST. Given a graph (a), the
algorithm finds the minimum outgoing edge for each component in a
parallel (marked in b,c,d). Then it merges the components according to
the resulting edges (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3 The state transition diagram of the PMA algorithm. . . . . . . . . . . . 114
5.4 The graph (a) before and (b) after unifying u and v. . . . . . . . . . . . 123
5.5 Execution of DPMST on DIMACS graphs. . . . . . . . . . . . . . . . 125
5.6 Execution of DPMST on Erd
˝
os-R
´
enyi G
n=20000,0.1≤p≤0.5.
graphs . . . . . 126
5.7 Execution of DPMST on random dense graphs . . . . . . . . . . . . . 126
5.8 Execution time of HybridPMA on Erd
˝
os-R
´
enyi graphs, varying average
degree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.9 Experiments on varying average degree for four types of graph, |V| = 1M. 128
5.10 Experiments on varying the number of vertices. . . . . . . . . . . . . . 130
5.11 Execution time of PMA with varying γ. . . . . . . . . . . . . . . . . . 131

5.12 Reduction rate of different algorithms . . . . . . . . . . . . . . . . . . 133
6.1 A thread processing column j in the parallel block Floyd-Warshall algo-
rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Log plot of running time of APSP algorithms on the Enron graph with
varying number of sampled vertices . . . . . . . . . . . . . . . . . . . 158
6.3 Log plot of running time of APSP algorithms on a synthetic Erd
˝
os-R
´
enyi
graph with varying probability of inclusion . . . . . . . . . . . . . . . 159
6.4 Log plot of running time of APSP algorithms on the Wiki graph with
varying number of sampled vertices . . . . . . . . . . . . . . . . . . . 160
6.5 Log plot of running time of APSP algorithms on a synthetic Watts and
Strogatz graph with varying average degree . . . . . . . . . . . . . . . 161
6.6 Log plot of running time of L-APSP algorithm on the Enron graph with
2048 vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.7 Log plot of running time of L-APSP algorithm on a complete graph with
1024 vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.8 Log plot of running time of L-APSP algorithm on an sparse (inclusion
probability 0.1) Erd
˝
os-R
´
enyi graph with 1024 vertices . . . . . . . . . . 163
6.9 Log plot of running time of L-APSP algorithm on a the Wiki graph with
1024 vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.10 Log plot of running time of L-APSP algorithm on a dense (average de-
gree 256) Watts and Strogatz graph with 1024 vertices . . . . . . . . . 164
6.11 Log plot of running time of L-APSP algorithm on a sparse (average

degree 16) Watts and Strogatz graph with 1024 vertices . . . . . . . . . 164
6.12 An Example Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
xi
6.13 Graph for given 3-SAT problem in Theorem 3 . . . . . . . . . . . . . . 178
6.14 Path length matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.15 GD numbers and Opacity Matrix . . . . . . . . . . . . . . . . . . . . . 181
6.16 Graph edit distance ratio (Distortion) vs. Confidence(θ) . . . . . . . . . 186
6.17 EMD of degree distributions vs. Confidence(θ) . . . . . . . . . . . . . 189
6.18 EMD of Geodesic distributions vs. Confidence(θ)) . . . . . . . . . . . 190
6.19 Mean of the differences of Clustering Coefficients vs. Confidence(θ) . . 192
6.20 Runtime comparison of Gnutella network when varying number of nodes 192
6.21 Runtime of different heuristics for graphs of different size and density
vs. Confidence(θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.22 Impact of L-based Pruning . . . . . . . . . . . . . . . . . . . . . . . . 194
xii
List of Tables
4.1 Selectivity of the datasets (×E
−6
) . . . . . . . . . . . . . . . . . . . . . 92
5.1 Runtime of algorithms on dense small graphs (Times are in millisecond) 125
5.2 Runtime of different CPU and GPU algorithms on real-world networks. 129
6.1 Description of the original datasets . . . . . . . . . . . . . . . . . . . . 187
6.2 Data set properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3 L-Coherency of different data sets . . . . . . . . . . . . . . . . . . . . 195
xiii
List of Algorithms
1 PLCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Original ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 ER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 ZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 PreLogZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 PreZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 PER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9 PZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10 PPreZER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
11 HiDOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
12 Tree Building Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13 Assignment Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
14 Probing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
15 Data Parallel HiDOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
16 PJOIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
17 Bor
˙
uvka’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
18 Prim’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
19 Data Parallel MST algorithm . . . . . . . . . . . . . . . . . . . . . . . . 112
20 PMA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
21 Partial Prim algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
22 MinPMA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xiv
23 SortPMA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
24 Unifying algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
25 Floyd-Warshal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 139
26 Single Edge SP extension algorithm . . . . . . . . . . . . . . . . . . . . 141
27 Repeated squaring APSP algorithm . . . . . . . . . . . . . . . . . . . . 141
28 GE-APSP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
29 Parallel repeated squaring APSP algorithm . . . . . . . . . . . . . . . . 145
30 Parallel Block Floyd-Warshal algorithm . . . . . . . . . . . . . . . . . . 146
31 Optimized Parallel Block Floyd-Warshal algorithm . . . . . . . . . . . . 147

32 L-pruned Floyd-Warshal algorithm . . . . . . . . . . . . . . . . . . . . 153
33 Pointer-based L-pruned F-W algorithm . . . . . . . . . . . . . . . . . . 154
34 max LO algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
35 Edge Removal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 182
36 Edge Removal/Insertion algorithm . . . . . . . . . . . . . . . . . . . . . 185
xv
CHAPTER 1
Introduction
1.1 Graph
J. J. Sylvester, in 1878, in an article on chemistry and algebra in Nature[162], called a
mathematical structure to models connections between objects, ”graph”. Albeit, in 1736,
this model has been used by Leonhard Euler for K¨onigsberg Bridge Problem [59]. Euler
resolved this question by proving that there is no walk that crosses each of the seven
K
¨
onigsberg bridges, as illustrated in Figure 1.1, exactly once. A graph G is defined as
a pair (V, E), where V is a set of vertices (i.e., nodes), and E is a set of edges between
the vertices. The adjacency relation for graph G is E ⊆ {(u, v)|u, v ∈ V}. When the
graph is undirected, the adjacency relation defined by the edges is symmetric, so E ⊆
{{u, v}|u, v ∈ V}. Weighted graph (edge-weight graph) is a graph with assigned weight
for each edge of it as w(u, v). Graph can be further generalized to Hypergraph where
E is called hyperedges. Then, E is a set of non-empty subsets of V. Therefore, E is a
subset of P(V) \ {∅}, where P(V) is the power set of V. Cormen, Leiserson, Rivest and
1
Stein, in their popular textbook [51], describe the role of graphs and graph algorithms in
computer science as follows:
”Graphs are a pervasive data structure in computer science, and algorithms
working with them are fundamental to the field.”[51]
Figure 1.1: The seven K
¨

onigsberg bridges, courtesy of [78].
Figures 1.2 and 1.3 illustrate the number of research articles using graphs and their
fields, respectively
1
. The results further justify versatile representativeness of graph data
structure. For instance, in computer science, graph is an abstract data type for repre-
senting social networks [159, 26], data management [18], web graphs
2
[17, 20] as well
as flow control and program verification [45]. In mathematics, to study the knot the-
ory [16] or group theory [35]. Transportation, traffic control and road networks in the
field of engineering [177], protein and brain simulation in the field of bioinformatics
are also modeled by graphs [29]. To understand the dynamics of a physical process as
well as complicated atomic structures in physics [145]. In social sciences graph has
been employed to explore diffusion and to extract communities through analyzing social
networks [170, 153].
1
The data is extracted from Scopus online library, an indexing database of peer-reviewed articles [11]
2
To study the properties of the web, it is usually modeled as a graph, known as Web graph[40]
2
441
1022
3570
15517
34640
72781
110562
181094
197411

201757
0
50,000
100,000
150,000
200,000
250,000
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
Number of articles
Figure 1.2: Number of research articles until April 2012 using graphs, extracted from
the Scopus database [11].
Because of this versatile representativeness of graphs, social networks and the web
are among many other phenomena and artifacts that can be modeled as large graphs
to be analyzed through graph algorithms. However, given the size of the underlying
graphs, fundamental operations such as path finding algorithms becomes challenging
[132, 20, 27]. In 2011, the number of internet users reach to 2,267,233,742 among
6,930,055,154 people worldwide
3
that are contributing to produce enormous amount
of data in graph form. Among these large scale networks, for instance, Facebook as
a social and Linkedin as a professional network or web graphs and graphs of emails
have seen explosive growth rate. In January 2011, LinkedIn contained 101 million users
with a growth rate of 3 million users per month. In December 2011, the social graph of
Facebook, contains more than 750 million active users and an average friend count of 130
at the end of December 2011. In April 2012, in the worldwide there were 676,919,707
websites and 3.3 billion email accounts [14].
1.2 Parallel processing
Graphics Processing Units (GPUs) were fundamentally designed for fast rendering of
images for display. Nevertheless, the introduction of programmable rendering pipelines
3

/>3
Computer Science
26%
Mathematics
22%
Engineering
21%
Bioinformatics
16%
Physics and
Astronomy
8%
Social Sciences
5%
Business
1%
Miscellaneous
1%
Figure 1.3: Field of research articles until April 2012 using graphs, extracted from the
Scopus database [11].
let the shader programmers to develop non-graphical computations for these GPUs. Em-
ploying GPUs for general purpose data processing requires the knowledge of how to
use textures as a place for data and how to ask the shaders not for generating pixels but
for processing the data on the textures. This nonintuitive process has been evolved by
introducing parallel programming architectures. This evolution results to broader use
of the GPUs in various fields, specially in the domain of data processing. Nowadays,
these readily available GPUs that are known as many core architectures, are ubiquitous
and cheap [117]. GPUs are commonly installed on today’s home computers, worksta-
tions, consoles, and gaming devices. They can afford operating thousands of concurrent
threads [4]. Therefore, designing parallel algorithms for GPUs becomes one of the most

studied approaches for improving the performance of the algorithms [98]. Several pieces
4
of work have exploited the GPU’s ubiquity to suggest high-performance, general data
processing algorithms therefor [72, 73, 108]. However, in contrast to the multi-core ar-
chitecture, Central Processing Unit (CPU), GPUs are designed for the fine-grained data
parallel algorithms [134]. Therefore, the algorithms designed for the so-called many-
core GPUs requires a different tuning, i.e. Single Instruction Multiple Threads (SIMT),
in comparison to the algorithms designed for multi-core CPUs.
1.3 Contributions
Given the proliferation of the gigantic graph form data and also the increasing demand
of processing graph data on one hand, and the ubiquity of these parallel processors on
the other hand, call for development of scalable and fast graph processing algorithms.
Therefore, data parallel algorithms [74] comes into play, to achieve this objective. In
this dissertation, we explore the difficulties of adapting the fundamental yet practical
graph algorithms, with database applications for these massively parallel processors in
order to design scalable graph algorithms. We study both graph data generation prob-
lems and graph data management problems [18, 182]. Our contribution in this research
is designing scalable algorithms for the above mentioned problems. We address both
random and real-world graph models. The results of our study depict the usability of
GPUs in graph data processing. For instance, through experiments, we show that our
fine tuned algorithm for generating random graph data can be executed on a typical GPU
on average 19 times faster than its fastest sequential version on the CPU.
After introducing scalable algorithms for generating graph, we use the proposed tech-
niques to design scalable algorithms for processing the graphs. Path problem is a well-
known graph processing algorithm [30]. Furthermore, computing Minimum Spanning
Forest and computing shortest path between vertices are good examples of greedy and
dynamic programming algorithms, respectively [51]. Therefore, in this thesis we address
these two problems for processing the graphs. We empirically analyze the strengths and
5
weaknesses of the previous solutions for each problem and explore the trade-offs that

can be made. For each problem we devise a novel solution that can scale more than the
state-of-the-art solutions. In the following sections 1.4 and 1.5 we respectively describe
the above problems in this thesis in greater detail.
1.4 Graph data generation
This thesis first studies the algorithms for generating graphs. Graphs may be generated
from a random process, i.e. random graphs, or from modeling a real-world data, e.g.
rat’s brain. In the following subsections we address each respectively.
1.4.1 Generating random graphs
In random graph generation, two simple, elegant, and general mathematical models are
instrumental. The former, noted as G(v, e), chooses a graph uniformly at random from
the set of graphs with v vertices and e edges. The latter, noted as G(v, p), chooses a
graph uniformly at random from the set of graphs with v vertices where each edge has
the same independent probability p to exist. Paul Erd
˝
os and Alfr
´
ed R
´
enyi proposed the
G(v, e) model [58], while E. N. Gilbert proposed, at the same time, the G(v, p) model
[68]. Nevertheless, both are commonly referred to as Erd
˝
os-R
´
enyi models.
Application
The above mentioned models have been widely utilized in many fields, e.g. communi-
cation engineering [53, 64, 118], biology [119, 126] and social network studies [62, 90,
133]. The so-called Erd
˝

os-R
´
enyi models are also used for sampling. Namely, G(v, e) is
a uniform random sampling of e elements from a set of v. Therefore, a sampling pro-
cess can be effectively simulated using the random generation process as a component.
These two models can be easily adapted to model directed and undirected graphs with
and without self loops, as well as bi- and multipartite graphs.
6

×