Tải bản đầy đủ (.pdf) (39 trang)

SPLASH2013 indrajitroy rforbigdata Presto Big Data Analysis Beyond Hadoop

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.24 MB, 39 trang )

R for Big Data
Indrajit Roy, HP Labs
October 2013

Team:
Shivaram

Erik

Kyungyong

Alvin

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Rob

Vanish


A tale of three researchers
(Systems + PL) talk about data mining problems!
1

3

Systems
Data science
2

Programming


languages
2

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


A Big Data story
Once upon a time, a customer in distress had….
… 2+ billion rows of financial data (TBs of data)
… wanted to model defaults on mortgage and credit cards
… by running regression analysis

… Alas!
… traditional databases don’t support regression analysis
… custom code can take from hours to days

Moral of the story:
Customers need platform+programming model for complex analysis

3

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Big Data has many facets

Volume
>7 TB/day

4


Variety
>1B user graph
>40B photos

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Velocity
>1M customer
transactions/hr


Just “Big” is not the issue

Storage is not a problem.
Petabytes can be handled
by DBs

Volume

5

Volume

Variety

Complex analytics at scale

Volume


Velocity

Event processing at scale.
Not today’s talk.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Big data, complex algorithms
PageRank
(Dominant eigenvector)
Recommendations
Machine learning (Matrix
+ Graph
algorithms
factorization)
Anomaly detection
Iterative Linear Algebra
Operations
(Top-K eigenvalues)

User Importance
(Vertex Centrality)
6

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Example: PageRank using matrices
Simplified algorithm repeat { p = M*p }


Linear Algebra
Matrices
p Operations
M on Sparse
p
Power method
Dominant eigenvector

7

M = Web graph matrix
p = PageRank vector

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


(Efficiency)

Variety

Towards Distributed R

Machine
learning,
images,
graphs
SQL,
database


Scale+
Complex
Analytics

R/Matlab

RDBMS
(col. store)

search,
sort

Hadoop

Volume
(Scalability)
8

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

* very simplified view


Large scale analytics frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
Process each record in parallel
Use case: Computing sufficient statistics, analytics queries

Graph-centric frameworks – Pregel/GraphLab (2010)
Process each vertex in parallel

Use case: Graphical models

Array-based frameworks
Process blocks of array in parallel
Use case: Linear Algebra Operations

Our
approach*

*Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, R. Schreiber. Eurosys 2013.
9

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Enter the world of R

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


R: Arrays for analysis
What is R?
R is a programming language and environment for statistical
computing
• Array-oriented
• Millions of users, thousands of free packages

Traditional applications of R


Machine Learning
11

Graph Algorithms

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Bioinformatics


Why is R popular?
Extremely data driven
• Load data, analyze using functions
• From sum, mean, median to regression, PageRank, and others
• Plot!

Simple data structures: Arrays and Data-frames
CRAN is the ultimate APP store for algorithms
• Almost everything is a package. Even GUI
• Community driven
ã Use install.packages() to get your favorite package

12

â Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Example 1: Arrays
> A<-array(1:4, dim=c(2,2))
>A

[,1] [,2]
[1,] 1 3
[2,] 2 4
> mean(A)
[1] 2.5
> A[2,]<-NA
>A
[,1] [,2]
[1,] 1 3
[2,] NA NA
13

2x2 Matrix

Statistics

Indexing & missing
values

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Example 2: Functions
> foo<-function(x){
+ print (x)
+ print (y)
+}
> y<-10
> foo(1)
[1] 1

[1] 10

x= formal variable
y= free variable

Almost functional.
Superassignment (<<-) has
global effect
14

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Example 3: Classes and objects
> setClass("person", representation (name= "character", age =
"numeric"))

> setMethod(“show”, signature(“person”), function(object){
+ print(x@name)
+ })
Class and methods

Supports inheritance
>x<-new("person", name="Gru", age=40)
> show(x)
Objects
[1] "Gru"

15


© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


But R is …

Not parallel
Not distributed
Limited by dataset size

16

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Enter Distributed R

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Challenge 1: R has limited support for parallelism
• R is single-threaded
• Multi-process solutions offered by extensions

• Threads/processes share data through pipes or network
− Time-inefficient (sending copies)
− Space-inefficient (extra copies)
Server 1

Server 2


R process

R process

data

copy of
data

R process

R process
copy of
data

copy of
data

local copy
network copy
network copy
18

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Challenge 2: R is memory bound
• Data needs to fit DRAM
• Current research solution:
− Uses custom bigarray objects with limited functionality

− Even simple operations like x+y may not work

19

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Challenge 3: Sparse datasets cause load imbalance

Block density (normalized )

LiveJournal

ClueWeb-1B

10000

Computation + communication imbalance !
1000
100
10
1

20

Netflix

1

11


21

31

41

51
61
Block ID

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

71

81

91


Distributed R for Big Data
Challenges in scaling R
Programming model
Mechanisms
Applications and results

21

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.



Enhancement #1: Distributed data structures
• Relies on user defined partitioning
• Also support for distributed data-frames

darray
22

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Enhancement #2: Distributed loop
• Express computations over partitions
• Execute across the cluster

foreach
23

f (x)

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.


Distributed PageRank
N

s

P1


P1

P2

P2

s



PN/s

PN/s

P

M

P1

P1

P2

P2






PN/s

PN/s

P_old

M  darray(dim=c(N,N),blocks=(s,N))
P  darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
function(p=splits(P,i),m=splits(M,i)
x=splits(P_old),z=splits(Z,i)) {
p  (m*x)+z
update(p)
})
P_old  P
24 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
}

N

Z

Create Distributed Array


Distributed PageRank
N

s


P1

P1

P2

P2

s



PN/s

PN/s

P

M

P1

P1

P2

P2






PN/s

PN/s

P_old

M  darray(dim=c(N,N),blocks=(s,N))
P  darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
function(p=splits(P,i),m=splits(M,i)
x=splits(P_old),z=splits(Z,i)) {
p  (m*x)+z
update(p)
})
P_old  P
25 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
}

N

Z

Execute function in a
cluster

Pass array partitions



×