Tải bản đầy đủ (.pdf) (46 trang)

R JVMSummit R in Java FastR an implementation of the R language

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (14.53 MB, 46 trang )

R
in
Java
FastR: an implementation of the R language
Petr
Maj

Tomas
Kalibera

Jan
Vitek

Floréal
Morandat

Helena
Kotthaus

Purdue University & Oracle Labs
/>
Java cup


10

Morandat et al.

Java cup

substitute. As the object system is built on those, we will only hint at its defini-



tion. The syntax of Core R, shown in Fig. 1, consists of expressions, denoted by e,
ranging over numeric literals, string literals, symbols, array accesses, blocks, function
declarations, function calls, variable assignments, variable super-assignments, array
assignments, array super-assignments, and attribute extraction and assignment. Expressions also include values, u, and partially reduced function calls, ⌫(a), which are not
used in the surface syntax of the language but are needed during evaluation. The parameters of a function declaration, denoted by f, can be either variables or variables
with a default value, an expression e. Symmetrical arguments of calls, denoted a, are
expressions which may be named by a symbol. We use the notation a to denote the
possibly empty sequence a1 . . . an . Programs compute over a heap, denoted H, and a
stack, S, as shown in Fig. 2. For simplicity, the heap difH::= ; | H[◆/F ]
ferentiates between three kinds of addresses: frames, ◆,
| H[ /e ] | H[ /⌫] promises, , and data objects, ⌫. The notation H[◆/F ]
| H[⌫/↵ ]
denotes the heap H extended with a mapping from ◆
↵::= ⌫? ⌫? u ::= | ⌫ to F . The metavariable ⌫? denotes ⌫ extended with the
::= num[n] | str[s]
distinguished reference ? which is used for missing val| gen[⌫] | f.e,
ues. Metavariable ↵ ranges over pairs of possibly missing
0
F ::= [] | F [x/u]
addresses, ⌫? ⌫?
. The metavariable u ranges over both
::= [] | ◆ ⇤
promises and data references. Data objects, ↵ , consist
S::= [] | e ⇤ S
of a primitive value  and attributes ↵. Primitive values can be either an array of numerics, num[n1 . . . nn ],
Fig. 2. Data
an array of strings, str[s1 . . . sn ], an array of references
gen[⌫1 . . . ⌫n ], or a function, f.e, , where is the function’s environment. A frame, F , is a mapping from a symbol to a promise or data
reference. An environment, , is a sequence of frame references. Finally, a stack, S,

is a sequence of pairs, e , such that e is the current expression and is the current
environment.
Evaluating the Design of R
11

What we do…

• TimeR — an instrumentation-based profiler for GNU-R
• TracR — a trace analysis framework for GNU-R
• CoreR — a formal semantics for a fragment of R
• TestR — a testing framework for the R language
• FastR — a new R virtual machine written in Java
12

Morandat et al.

The ! relation has fourteen rules dealing with expressions, shown in Fig. 5, along
with some auxiliary definitions given in Fig. 18 (where s and g denote functions that
convert the type of their argument to a string and vector respectively). The first two
rules deal with numeric and string literals. They simply allocate a vector of length one
of the corresponding type with the specified value in it. By default, attributes for these
values are empty. A function declaration, [F UN], allocates a closure in the heap and

[N UM ]

0

e ;H ! e ;H
⇤ S; H = C[e0 ]


C[e]

[E XP ]

0

⇤ S; H 0

H( ) = e
⇤ S; H = e 0 ⇤ C[ ]

C[ ]
[F ORCE F]

getfun(H, , x) =
⇤ S; H =
⇤ C[x(a)]

C[x(a)]

[G ET F]

C[x(a)]

⇤ S; H

getfun(H, , x) = ⌫
⇤ S; H = C[⌫(a)]

26


[I NV F]

⇤ S; H 0

R[⌫]

0

⇤ C[⌫ 0 (a)]

⇤ S; H =

cpy(H, ⌫) = H , ⌫

0

[F IND ]

0

⌫ fresh ↵ = ? ?
H 0 = H[⌫/ f.e, ↵ ]
function(f) e ; H ! ⌫; H 0
[G ET P]

H( ) = ⌫
; H ! ⌫; H 0

=◆⇤

H(◆) = F
F = F [x/⌫ 0 ]
x < ⌫ ; H ! ⌫; H 00

(H, x) = ⌫ 0

0

= ◆ ⇤ 0 assign(x, ⌫ 0 ,
x << ⌫ ; H ! ⌫; H 00

0

[A SS ]

H 00 = H 0 [◆/F 0 ]
[DA SS ]

, H 0 ) = H 00
[G ET ]

readn(⌫, H) = m get(⌫ 0 , m, H) = ⌫ 00 , H 0
x[[⌫]] ; H ! ⌫ 00 ; H 0
[S ET L]

cpy(H, ⌫ 0 ) = H 0 , ⌫ 00
= ◆ ⇤ 0 ◆(H 0 , x) = ⌫ 000
readn(⌫, H 0 ) = m set(⌫ 000 , m, ⌫ 00 , H 0 ) = H 00
x[[⌫]] < ⌫ 0 ; H ! ⌫ 0 ; H 00


[R ET P]

Morandat et al.

0

cpy(H, ⌫) = H 0 , ⌫ 0

⇤ S; H

H(⌫) = f.e, 0 args(f, a, , 0 , H) = F, 00 , H 0
C[⌫(a)] ⇤ S; H = e 00 ⇤ C[⌫(a)] ⇤ S; H 0
H 0 = H[ /⌫]
0
R[⌫] ⇤ C[ ] ⇤ S; H = C[ ]

⌫ fresh ↵ = ? ?
H 0 = H[⌫/str[s]↵ ]
s ; H ! ⌫; H 0

(H, x) = u
x ; H ! u; H

⇤ S; H

[F UN ]

[S TR ]

⌫ fresh ↵ = ? ?

H 0 = H[⌫/num[n]↵ ]
n ; H ! ⌫; H 0

[F ORCE P]

0

[R ET F]

⇤ S; H 0

C[⌫]

H(⌫) = ↵
H(⌫) = ↵

Evaluation Contexts:
C ::= [] | x < C | x[[C]] | x[[e]] < C | x[[C]] < ⌫ | {C; e} | {⌫; C}
| attr(C, e) | attr(⌫, C) | attr(e, e) < C | attr(C, e) < ⌫ | attr(⌫, C) < ⌫
R ::= [] | {⌫; R}

[S ET G]

0
cpy(H, ⌫ 0 ) = H 0 , ⌫ 00
= ◆ ⇤ 0 H 0 (◆) = F
x 62 F
(H 0 , x) = ⌫ 000
cpy(H 0 , ⌫ 000 ) = H 00 , ⌫ 0000 F 0 = F [x/⌫ 0000 ] H 000 = H 00 [◆/F 0 ]
readn(⌫, H) = m set(⌫ 0000 , m, ⌫ 00 , H 000 ) = H 0000

x[[⌫]] < ⌫ 0 ; H ! ⌫ 0 ; H 0000
0
0
↵ = ⌫? ⌫?
index(⌫ 0 , ⌫?
, H) = n
attr(⌫, ⌫ 0 ) ; H ! ⌫ 00 ; H

[G ETA]

get(⌫? , n, H) = ⌫ 00

[R EPL A]

0
0
↵ = ⌫? ⌫?
index(⌫ 0 , ⌫?
, H) = n set(⌫, n, ⌫ 00 , H) = H 0
attr(⌫, ⌫ 0 ) < ⌫ 00 ; H ! ⌫ 00 ; H 0
0

[S ETA]

0
cpy(H, ⌫ 00 ) = H 0 , ⌫ 000 H 0 (⌫) = ⌫? ⌫?
index(⌫ 0 , ⌫?
, H 0 ) = ? reads(⌫ 0 , H 0 ) = s
0
0

0
0
H 0 (⌫? ) = gen[⌫]↵ H 0 (⌫?
) = str[s]↵
H 00 = H 0 [⌫? /gen[⌫⌫ 000 ]↵ ][⌫?
/str[ss]↵ ]
attr(⌫, ⌫ 0 ) < ⌫ 00 ; H ! ⌫ 00 ; H 00

[S ET B]

cpy(H, ⌫ 00 ) = H 0 , ⌫ 3 H 0 (⌫) = ? ? ⌫ 4 , ⌫ 5 fresh reads(⌫ 0 , H 0 ) = s
H 00 = H 0 [⌫ 4 /gen[⌫ 3 ]? ? ][⌫ 5 /str[s]? ? ]
attr(⌫, ⌫ 0 ) < ⌫ 00 ; H ! ⌫ 00 ; H 00

Fig. 5. Reduction relation ! .

Fig. 3. Reduction relation =) .

Fig. 3. Reduction relation =) .

Morandat, Hill, Osvald, Vitek. Evaluating the Design of the R Language. ECOOP’12
=◆⇤

=◆⇤

0

0

[G ET F2]


[G ET F1]

◆(H, x) = ⌫ H(⌫) = f.e,
getfun(H, , x) = ⌫

= ◆ ⇤ 0 ◆(H, x) = ⌫ H(⌫) 6= f.e,
getfun(H, , x) = getfun(H, 0 , x)

00

[G ET F3]

◆(H, x) =
H( ) = ⌫
getfun(H, , x) = ⌫
=◆⇤

0

H(⌫) = f.e,

00

=◆⇤

0

[G ET F4]


◆(H, x) =
H( ) = e
getfun(H, , x) =
[G ET F5]

◆(H, x) =

H( ) = ⌫

00

H(⌫) 6= f.e,

00

00


Why?

… language for data analysis and graphics
… used in statistics, biology, finance …
… books, conferences, user groups
… 4,338 packages
… 3 millions users


Scripting data
read data into variables
make plots

compute summaries
more intricate modeling
develop simple functions
to automate analysis



R history
• 1976
• 1993

S

John Chambers @ Bell Labs, then S-Plus
(closed-source owned by Tibco)

R

Ross Ihaka and Robert Gentleman,
started R as new language at the
University of Auckland, NZ

• Today, The R project




Core team ~ 20 people, released under
GPL license. Continued development of
language & libraries: namespaces (‘11),

bytecode (‘11), indexing beyond 2GB (‘13)


What R is…

• vectorized
• functional
• object-oriented
• lazy
• portable
• interactive

What R isn’t…

• fast
• low-footprint
• concurrent
• distributed
• formally specified
• standardized



Avg

S−12

S−11

S−10


S−9

S−8

S−7

S−6

S−5

S−4

S−3

S−2

S−1

1

5

10

50

C / Python / R

500


The programming language shootout

Intel X5460. 3.16GHz, Linux 2.6.34. R 2.12.1, GCC v4.4.5

Python
R

Fig. 6. Slowdown of Performance
Python and R, normalized to C


0000

he

10000
1000
100
10

S−12

S−11

S−10

S−9

S−8


S−7

S−6

S−5

S−4

S−3

S−2

S−1

1

C/R

The programming language shootout

Intel X5460. 3.16GHz, Linux 2.6.34. R 2.12.1, GCC v4.4.5

e created by pairlist().
C
R User data
R internal
ntioned above, they are
used by the R VM. In
so

consumes
significant
amounts of memory. Unlike C
e standard
library
only
significant
amounts
of memory.
Unlike Cand garbage
MB
ted,
alltouser
data
in
R must
be heap allocated
ee
calls
pairlist
, the
data
in code
R must
heap
allocated
and garbage
CRAN
onlybe
eight,

pamounts
memory
usage
in
C
(calls
to
malloc)
and data allocated
of
memory.
Unlike
C
oconductor
none.
The
R
sage
in
C
(calls
to
malloc)
and
data allocated
R
allocation
is
split
between

vectors
(which are typically
stthem
be heap
allocated
es
to represent
code and garbage
is mostly
split between
vectors
(which are
typically
e
used
by
the
interpreter
for,
e.g., arguments to
passtoand
processand
funcalls
malloc)
data allocated
ed
by that
the R
interpreter
for, e.g.,

argumentsmore
to data than
hows
allocates
orders
of
magnitude
ll
arguments.
It
is
interween vectors (which are typically
o note that
the
time spent
allocates
orders
of
magnitude
more
data
than
cases,
the
internal
data
required
is
more
than

the user data.
terpreter
for,
e.g.,
arguments
to
cating lists
isrequired
greater than
ernal
data
is
more
than
the
user
data.
mplemented
by
a
copy-on-write
(COW)
mechanism.
Thus,
ders
of
magnitude
more
data
than

e spent on vectors. Cons
y a copy-on-write
(COW)
are
andand
only
duplicated
if there isThus,
actually a need
e 56shared
byte
takethe
quired
islong,
more
than
usermechanism.
data.
des
only
duplicated
if there
is actually
a needalgorithm is
memory
footprint.
Even Thus,
though
the COW
GB

on(COW)
average
in the
-write
mechanism.
ut
benchmarks.
ootprint.
Even is
though
37%
ofif arguments
are the
copied.
icated
there
actually
aCOW
need algorithm is
other reason
for
the large
uments
are
copied.
t()
.
enisthough
the COW
nt

that all numeric
data algorithm is
R User data
R internal
re boxed into a vector.C
opied.
be
C
R User data
R internal
Inof vectors
%
allocated in
Heap
Allocated
Memory
conductor
vignettes
conFig.
8.
Heap
allocated
memory
(MB
log
scale). C vs. R.
ly
R User data
R internal
y a single numeric value.



1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

Builtin
External
Lookup
Duplicate
Allocate vector
Allocate cons

mm
mm
mm
mm
mm
alloc.cons
alloc.cons
alloc.cons
alloc.cons
alloc.cons
alloc.list

alloc.list
alloc.list
alloc.list
alloc.list
alloc.vector
alloc.vector
alloc.vector
alloc.vector
alloc.vector
duplicate
duplicate
duplicate
duplicate
duplicate
lookup
lookup
lookup
lookup
lookup
match
match
match
match
match
external
external
external
external
external
builtin

builtin
builtin
builtin
builtin
arith
arith
arith
arith
arith
special
special
special
special
special

0.1
0.0

where the
urn to more
grams. Fig. 7
n of execuioconductor
th ProfileR.
a Bioconkey obserry managen average of
me. Memory
was further
me spent in
18.7%), al3.6%), vecduplications
lue semannt in built-


Garbage collection

mm
alloc.c
alloc.l
alloc.v
duplic
lookup
match
extern
builtin
arith
specia


1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0

Bioconductor vignettes

Intel X5460. 3.16GHz, Linux 2.6.34. R 2.12.1, GCC v4.4.5


d where the
turn to more
ograms. Fig. 7
wn of execuBioconductor
with ProfileR.
ts a Bioconhe key obsermory managean average of
ime. Memory
was further
time spent in
(18.7%), al(3.6%), vecduplications
value semanTime breakdown
ent in builtsents the true Fig. 7. Time breakdown of Bioconductor vignettes.

mm
alloc.cons
alloc.list
alloc.vector
duplicate
lookup
match
external
builtin
arith
special


How is R used?
•Extract core semantics by testing
- R has no official semantics

- A single reference implementation

•Observational study based on a large corpus
- Many open source programs come with “vignettes”
- Dynamic analysis gives under-approximated behaviors
- Static analysis gives over-approximation


POPULATION
ast group is the base library that is bundled with

se datasets.
e
Name
Bioc. Shoot. Misc. CRAN Base
#
Package
630
11
7
1238
27
t
# Vignettes 100
11
4


e R LOC
1.4M

973 1.3K 2.3M 91K
- C LOC
2M
0
0 2.9M 50K
0
Fig. 5. Purdue R Corpus.
o
e as it makes them harder to analyze. We retained


Vectors
x <- c(2,7,9,NA,5)
c(1,2,3) + x[1:3]
with(fd,carb*den)

with.default
<-function(data,exp,...)
eval(substitute(exp),
data,
parent.frame)))

x[is.na(x)] <- 0


Functions
q<-function(x=5)x*x*x

q()
q(2)


with(fd,carb*den)

q(x=4)

with.default
<-function(data,exp,...)
eval(substitute(exp),
data,
parent.frame)))

p<-function(x=5,...,y=x+1)


d an existing data structure to operate on, thus they are always side effecting.
ey account for 22% of all side effects and 12% of all assignments.

19

f(1, 2)
f(x=1,y=2)

20−39

18

17

16


15

14

13

12

11

9

10

8

7

6

5

4

3

2

1


Position
Keyword
Variadic

0

R symbol lookup is 1G
ensitive. This feature,
either Lisp nor Scheme
s exercised in less than 1M
unction name lookups.
even though this num1K
, the number of symlly checked is 3.6 on
The only symbols for 1
s feature actually mat- 1G
the Bioconductor vic and file, both pop- 1M
bles names and built-in

f(y=1,x=2)
f(2,x=1)

1K

in Bioconductor. (Log scale)

f(x=1,2)

255+

c(1,2,3,4)

200−201

140−159
171−196

120−139

80−99

100−119

60−79

40−59

20−39
20−39

19
19

18
18

17
17

16
16


15
15

14
14

13
13

11

12
12

11

10
10

9
9

8
8

7
7

6
6


5
5

4
4

3
3

2
2

1
1

0

0

rs. The R function
n syntax is expressive
expressivity is widely 1
1G
99% of the calls, at
rguments are passed,
percentage of calls 1M
7 arguments is 99.74%
12). Functions that are
is average are typically 1K

h positional arguments.
mber of parameters insers are more likely to 1
nction parameters by
milarly, variadic paramto be called with large Fig. 12. Histogram of the number of function arguments


Promises
assert<-function(C,P)
if (C) print(P)
assert(
x==42, with.default
print(“Oops”))
with(fd,carb*den)
<-function(data,exp,...)

eval(substitute(exp),
data,
parent.frame)))


12
8

f(x=1,y=2)

6

f(y=1,x=2)

4


10

f(1, 2)

f(2,x=1)

2

f(x=1,2)
0

c(1,2,3,4)
80

85

90

95

100

% of promises
evaluated
/
vignette
(a) Promises evaluated (in %)

(b



Forcing promises
x <-

F

x[12] <- F
F ; e
{e ; F}


Scoping
Lexical scoping with context sensitive name resolution

c <- 42
c(1,2,3)

c <- 42
d <- c
d(1,2,3)


less than 0.05% context sensitive
function name lookups
only symbols that rely on it are c
and file


Referential transparency

assert(y[[1]]==5)

f(y)
with(fd,carb*den)

with.default
<-function(data,exp,...)
eval(substitute(exp),
data,
f<-function(b){b[[1]]<-0} parent.frame)))

assert(y[[1]]==5)


[G ET P]

[F IND ]

H( ) = ⌫
; H ! ⌫; H 0

(H, x) = u
x ; H ! u; H

Assignment

= H 0, ⌫0

= ◆ ⇤ 0 H(◆) = F
F 0 = F [x/⌫ 0 ]

x < ⌫ ; H ! ⌫; H 00

x [ 42
<- ⌫ y,
= ◆ ⇤ ] assign(x,

py(H, ⌫) = H 0 , ⌫ 0

0

x << ⌫

(H, x) = ⌫ 0

0

0

[A SS ]

H 00 = H 0 [◆/F 0 ]
[DA SS ]

, H 0 ) = H 00

; H ! ⌫; H 00

[G ET ]

readn(⌫, H) = m get(⌫ 0 , m, H) = ⌫ 00 , H 0

x[[⌫]] ; H ! ⌫ 00 ; H 0
[S ET L]

cpy(H, ⌫ 0 ) = H 0 , ⌫ 00
= ◆ ⇤ 0 ◆(H 0 , x) = ⌫ 000
readn(⌫, H 0 ) = m set(⌫ 000 , m, ⌫ 00 , H 0 ) = H 00
x[[⌫]] < ⌫ 0 ; H ! ⌫ 0 ; H 00

⌫ 0 ) = H 0 , ⌫ 00
=◆⇤
py(H 0 , ⌫ 000 ) = H 00 , ⌫ 0000
readn(⌫, H) = m

0

[S ET G]

0
H 0 (◆) = F
x 62 F
(H 0 , x) = ⌫ 000
F 0 = F [x/⌫ 0000 ] H 000 = H 00 [◆/F 0 ]
set(⌫ 0000 , m, ⌫ 00 , H 000 ) = H 0000


[G ET P]

[F IND ]

H( ) = ⌫

; H ! ⌫; H 0

(H, x) = u
x ; H ! u; H

Assignment
cpy(H, ⌫) = H 0 , ⌫ 0

[A SS ]

= ◆ ⇤ 0 H(◆) = F
F 0 = F [x/⌫ 0 ]
x < ⌫ ; H ! ⌫; H 00

H 00 = H 0 [◆/F 0 ]
[DA SS ]

y <- c(…)
x
f
{
(H, x) = ⌫
readn(⌫, H) = m get(⌫ , m, H) = ⌫ , H
; H ! ⌫]; H [ 42
y

cpy(H, ⌫) = H 0 , ⌫ 0


<<

= ◆ ⇤ 0 assign(x, ⌫ 0 ,
⌫ ; H ! ⌫; H 00

0

0

00

0

, H 0 ) = H 00
00

[G ET ]
0

0

[S ET L]

cpy(H, ⌫ 0 ) = H 0 , ⌫ 00
= ◆ ⇤ 0 ◆(H 0 , x) = ⌫ 000
readn(⌫, H 0 ) = m set(⌫ 000 , m, ⌫ 00 , H 0 ) = H 00
x[[⌫]] < ⌫ 0 ; H ! ⌫ 0 ; H 00
0

00


0

0

0
0
↵ = ⌫? ⌫?
index(⌫ 0 , ⌫?
, H) = n
attr(⌫, ⌫ 0 ) ; H ! ⌫ 00 ; H

0

0

[S ET G]

cpy(H, ⌫ ) = H , ⌫
=◆⇤
H (◆) = F
x 62 F
(H , x) = ⌫ 000
cpy(H 0 , ⌫ 000 ) = H 00 , ⌫ 0000 F 0 = F [x/⌫ 0000 ] H 000 = H 00 [◆/F 0 ]
readn(⌫, H) = m set(⌫ 0000 , m, ⌫ 00 , H 000 ) = H 0000
x[[⌫]] < ⌫ 0 ; H ! ⌫ 0 ; H 0000
H(⌫) = ↵

0


[G ETA]

get(⌫? , n, H) = ⌫ 00


[N UM ]

[S TR ]

Assignment⌫ fresh

resh ↵ = ? ?
= H[⌫/num[n]↵ ]
; H ! ⌫; H 0

⌫ fresh ↵ = ? ?
H 0 = H[⌫/ f.e, ↵ ]
function(f) e ; H ! ⌫

↵ = ??
H 0 = H[⌫/str[s]↵ ]
s ; H ! ⌫; H 0

[G ET P]

[F IND ]

H( ) = ⌫
; H ! ⌫; H 0


(H, x) = u
x ; H ! u; H

x [ 42 ] <<-

H, ⌫) = H 0 , ⌫ 0

y

= ◆ ⇤ 0 H(◆) = F
F 0 = F [x/⌫ 0 ]
x < ⌫ ; H ! ⌫; H 00

cpy(H, ⌫) = H 0 , ⌫ 0
x <<
(H, x) = ⌫ 0

= ◆ ⇤ 0 assign(x, ⌫ 0 ,
⌫ ; H ! ⌫; H 00

0

H 00 = H 0 [◆
[DA SS ]

, H 0 ) = H 00
[G ET ]

readn(⌫, H) = m get(⌫ 0 , m, H) = ⌫ 00 , H 0
x[[⌫]] ; H ! ⌫ 00 ; H 0


cpy(H, ⌫ 0 ) = H 0 , ⌫ 00

=◆⇤

0

[S ET L]

◆(H 0 , x) = ⌫ 000


×