R for Big Data
Indrajit Roy, HP Labs
October 2013
Team:
Shivaram
Erik
Kyungyong
Alvin
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Rob
Vanish
A tale of three researchers
(Systems + PL) talk about data mining problems!
1
3
Systems
Data science
2
Programming
languages
2
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A Big Data story
Once upon a time, a customer in distress had….
… 2+ billion rows of financial data (TBs of data)
… wanted to model defaults on mortgage and credit cards
… by running regression analysis
… Alas!
… traditional databases don’t support regression analysis
… custom code can take from hours to days
Moral of the story:
Customers need platform+programming model for complex analysis
3
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Big Data has many facets
Volume
>7 TB/day
4
Variety
>1B user graph
>40B photos
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Velocity
>1M customer
transactions/hr
Just “Big” is not the issue
Storage is not a problem.
Petabytes can be handled
by DBs
Volume
5
Volume
Variety
Complex analytics at scale
Volume
Velocity
Event processing at scale.
Not today’s talk.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Big data, complex algorithms
PageRank
(Dominant eigenvector)
Recommendations
Machine learning (Matrix
+ Graph
algorithms
factorization)
Anomaly detection
Iterative Linear Algebra
Operations
(Top-K eigenvalues)
User Importance
(Vertex Centrality)
6
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example: PageRank using matrices
Simplified algorithm repeat { p = M*p }
Linear Algebra
Matrices
p Operations
M on Sparse
p
Power method
Dominant eigenvector
7
M = Web graph matrix
p = PageRank vector
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
(Efficiency)
Variety
Towards Distributed R
Machine
learning,
images,
graphs
SQL,
database
Scale+
Complex
Analytics
R/Matlab
RDBMS
(col. store)
search,
sort
Hadoop
Volume
(Scalability)
8
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
* very simplified view
Large scale analytics frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
Process each record in parallel
Use case: Computing sufficient statistics, analytics queries
Graph-centric frameworks – Pregel/GraphLab (2010)
Process each vertex in parallel
Use case: Graphical models
Array-based frameworks
Process blocks of array in parallel
Use case: Linear Algebra Operations
Our
approach*
*Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, R. Schreiber. Eurosys 2013.
9
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Enter the world of R
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
R: Arrays for analysis
What is R?
R is a programming language and environment for statistical
computing
• Array-oriented
• Millions of users, thousands of free packages
Traditional applications of R
Machine Learning
11
Graph Algorithms
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Bioinformatics
Why is R popular?
Extremely data driven
• Load data, analyze using functions
• From sum, mean, median to regression, PageRank, and others
• Plot!
Simple data structures: Arrays and Data-frames
CRAN is the ultimate APP store for algorithms
• Almost everything is a package. Even GUI
• Community driven
ã Use install.packages() to get your favorite package
12
â Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example 1: Arrays
> A<-array(1:4, dim=c(2,2))
>A
[,1] [,2]
[1,] 1 3
[2,] 2 4
> mean(A)
[1] 2.5
> A[2,]<-NA
>A
[,1] [,2]
[1,] 1 3
[2,] NA NA
13
2x2 Matrix
Statistics
Indexing & missing
values
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example 2: Functions
> foo<-function(x){
+ print (x)
+ print (y)
+}
> y<-10
> foo(1)
[1] 1
[1] 10
x= formal variable
y= free variable
Almost functional.
Superassignment (<<-) has
global effect
14
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example 3: Classes and objects
> setClass("person", representation (name= "character", age =
"numeric"))
> setMethod(“show”, signature(“person”), function(object){
+ print(x@name)
+ })
Class and methods
Supports inheritance
>x<-new("person", name="Gru", age=40)
> show(x)
Objects
[1] "Gru"
15
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
But R is …
Not parallel
Not distributed
Limited by dataset size
16
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Enter Distributed R
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Challenge 1: R has limited support for parallelism
• R is single-threaded
• Multi-process solutions offered by extensions
• Threads/processes share data through pipes or network
− Time-inefficient (sending copies)
− Space-inefficient (extra copies)
Server 1
Server 2
R process
R process
data
copy of
data
R process
R process
copy of
data
copy of
data
local copy
network copy
network copy
18
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Challenge 2: R is memory bound
• Data needs to fit DRAM
• Current research solution:
− Uses custom bigarray objects with limited functionality
− Even simple operations like x+y may not work
19
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Challenge 3: Sparse datasets cause load imbalance
Block density (normalized )
LiveJournal
ClueWeb-1B
10000
Computation + communication imbalance !
1000
100
10
1
20
Netflix
1
11
21
31
41
51
61
Block ID
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
71
81
91
Distributed R for Big Data
Challenges in scaling R
Programming model
Mechanisms
Applications and results
21
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Enhancement #1: Distributed data structures
• Relies on user defined partitioning
• Also support for distributed data-frames
darray
22
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Enhancement #2: Distributed loop
• Express computations over partitions
• Execute across the cluster
foreach
23
f (x)
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Distributed PageRank
N
s
P1
P1
P2
P2
s
…
PN/s
PN/s
P
M
P1
P1
P2
P2
…
…
PN/s
PN/s
P_old
M darray(dim=c(N,N),blocks=(s,N))
P darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
function(p=splits(P,i),m=splits(M,i)
x=splits(P_old),z=splits(Z,i)) {
p (m*x)+z
update(p)
})
P_old P
24 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
}
N
Z
Create Distributed Array
Distributed PageRank
N
s
P1
P1
P2
P2
s
…
PN/s
PN/s
P
M
P1
P1
P2
P2
…
…
PN/s
PN/s
P_old
M darray(dim=c(N,N),blocks=(s,N))
P darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
function(p=splits(P,i),m=splits(M,i)
x=splits(P_old),z=splits(Z,i)) {
p (m*x)+z
update(p)
})
P_old P
25 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
}
N
Z
Execute function in a
cluster
Pass array partitions