Berkeley RAD Lab:
Research in
Internet-scale Computing Systems
Randy H. Katz
28 March 2007
2
Five Year Mission
•
Observation: Internet systems complex, fragile, manually
managed, evolving rapidly
–
To scale Ebay, must build Ebay-sized company
–
To scale YouTube, get acquired by a Google-sized company
•
Mission: Enable a single person to create, evolve, and
operate the next-generation IT service
–
“The Fortune 1 Million” by enabling rapid innovation
•
Approach: Create core technology spanning systems,
networking, and machine learning
•
Focus: Making datacenter easier to manage to enable
one person to Analyze, Deploy, Operate a scalable IT
service
3
Jan 07 Announcements by
Microsoft and Google
•
Microsoft and Google race to build next-gen DCs
–
Microsoft announces a $550 million DC in TX
–
Google confirm plans for a $600 million site in NC
–
Google two more DCs in SC; may cost another $950
million about 150,000 computers each
•
Internet DCs are the next computing platform
•
Power availability drives deployment decisions
4
Datacenter is the Computer
•
Google program == Web search, Gmail,…
•
Google computer ==
Warehouse-sized
facilities and workloads
likely more common
Luiz Barroso’s talk at RAD Lab 12/11/06
Sun Project Blackbox
10/17/06
Compose datacenter from 20 ft. containers!
–
Power/cooling for 200 KW
–
External taps for electricity,
network, cold water
–
250 Servers, 7 TB DRAM,
or 1.5 PB disk in 2006
–
20% energy savings
–
1/10th? cost of a building
5
See web2.wsj2.com/ruby_on_rails_11_web_20_on_rocket_fuel.htm
See />Datacenter Programming
System
•
Ruby on Rails: open source Web framework
optimized for programmer happiness and
sustainable productivity:
–
Convention over configuration
–
Scaffolding: automatic, Web-based, UI to stored data
–
Program the client: write browser-side code in Ruby, compile to
Javascript
–
“Duck Typing/Mix-Ins”
•
Proven Expressiveness
–
Lines of code Java vs. RoR: 3:1
–
Lines of configuration Java vs. RoR: 10:1
•
More than a fad
–
Java on Rails, Python on Rails, …
6
Datacenter Synthesis + OS
•
Synthesis: change DC via written specification
–
DC Spec Language compiled to logical configuration
•
OS: allocate, monitor, adjust during operation
–
Director using machine learning, Drivers send commands
Synth
OS
7
“System” Statistical
Machine Learning
•
S
2
ML Strengths
–
Handle SW churn: Train vs. write the logic
–
Beyond queuing models: Learns how to handle/make
policy between steady states
–
Beyond control theory: Coping with complex cost
functions
–
Discovery: Finding trends, needles in data haystack
–
Exploit cheap processing advances: fast enough to
run online
•
S
2
ML as an integral component of DC OS
8
Datacenter Monitoring
•
S
2
ML needs data to analyze
•
DC components come with sensors already
–
CPUs (performance counters)
–
Disks (SMART interface)
•
Add sensors to software
–
Log files
–
D-trace for Solaris, Mac OS
•
Trace 10K++ nodes within and between DCs
–
*Trace: App-oriented path recording framework
–
X-Trace: Cross-layer/-domain including network layer
9
Middleboxes in Today’s DC
•
Middle boxes inserted on
physical path
–
Policy via plumbing
–
Weakest link: 1 point of
failure, bottleneck
–
Expensive to upgrade
and introduce new
functionality
•
Identity-based Routing
Layer: policy not plumbing
to route classified packets
to appropriate middlebox
services
High Speed Network
load
balancer
intrusion
detector
firewall
10
First Milestone:
DC Energy Conservation
•
DCs limited by power
–
For each dollar spent on servers, add $0.48 (2005)/$0.71
(2010) for power/cooling
–
$26B spent to power and cool servers in 2005 grows to
$45B in 2010
•
Attractive application of S
2
ML
–
Bringing processor resources on/off-line: Dynamic
environment, complex cost function, measurement- driven
decisions
•
Preserve 100% Service Level Agreements
•
Don’t hurt hardware reliability
•
Then conserve energy
•
Conserve energy and improve reliability
–
MTTF: stress of on/off cycle vs. benefits of off-hours
11
DC Networking and Power
•
Within DC racks, network equipment often the “hottest”
components in the hot spot
•
Network opportunities for power reduction
–
Transition to higher speed interconnects (10 Gbs) at DC scales
and densities
–
High function/high power assists embedded in network element
(e.g., TCAMs)
12
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Thermal Image of Typical
Cluster Rack
Rack
Switch
M. K. Patterson, A. Pratt, P. Kumar,
“From UPS to Silicon: an end-to-end evaluation of datacenter efficiency”, Intel Corporation
13
DC Networking and Power
•
Selectively power down ports/portions of net elements
•
Enhanced power-awareness in the network stack
–
Power-aware routing and support for system virtualization
•
Support for datacenter “slice” power down and restart
–
Application and power-aware media access/control
•
Dynamic selection of full/half duplex
•
Directional asymmetry to save power,
e.g., 10Gb/s send, 100Mb/s receive
–
Power-awareness in applications and protocols
•
Hard state (proxying), soft state (caching),
protocol/data “streamlining” for power as well as b/w reduction
•
Power implications for topology design
–
Tradeoffs in redundancy/high-availability vs. power consumption
–
VLANs support for power-aware system virtualization
14
Why University Research?
•
Imperative that future technical leaders learn to deal
with scale in modern computing systems
•
Draw on talented but inexperienced people
–
Pick from worldwide talent pool for students & faculty
–
Don’t know what they can’t do
•
Inexpensive allows focus on speculative ideas
–
Mostly grad student salaries
–
Faculty part time
•
Tech Transfer engine
–
Success = Train students to go forth and replicate
–
Promiscuous publication, including source code
–
Ideal launching point for startups
15
Why a New Funding Model?
•
DARPA has exiting long-term research in experimental
computing systems
•
NSF swamped with proposals, yielding even more
conservative decisions
•
Community emphasis on theoretical vs. experimental-
oriented systems-building research
•
Alternative: turn to Industry for funding
–
Opportunity to shape research agenda
16
QuickTime ™ a nd a
TIFF (Uncompre s s e d) de compre s s or
a re ne e de d to s e e this picture .
New Funding Model
•
30 grad students + 5 undergrads+ 6 faculty + 4 staff
•
Foundation Companies: $500K/yr for 5 years
–
Google, Microsoft, Sun Microsystems
–
Prefer founding partner technology in prototypes
–
Many from company attend retreats, advise on directions, head start
on research results
–
Putting IP in Public Domain so partners use but not sued
•
Large Affiliates $100K/yr: Fujitsu, HP, IBM, Siemens
•
Small Affiliates $50K/yr: Nortel, Oracle
•
State matching programs add $1M/year: MICRO, Discovery
17
Summary
•
“DC is the Computer”
–
OS: ML+VM, Net: Identity-based Routing, FS: Web
Storage
–
Prog Sys: RoR, Libraries: Web Services
–
Development Environment: RAMP (simulator), AWE
(tester), Web 2.0 apps (benchmarks)
–
Debugging Environment: *Trace + X-Trace
•
Milestones
–
DC Energy Conservation + Reliability Enhancement
–
Web 2.0 Apps in RoR
18
Conclusions
•
Develop-Analyze-Deploy-Operate modern systems at
Internet scale
–
Ruby-on-Rails for rapid applications development
–
Declarative datacenter for correct-by-construction system
configuration and operation
–
Resource management by System Statistical Machine Learning
–
Virtual Machines and Network Storage for flexible resource
allocation
–
Power reduction and reliability enhancement by fast power-
down/restart for processing nodes
–
Pervasive monitoring, tracing, simultation, workload generation for
runtime analysis/operation
19
Discussion Points
•
Jointly designed datacenter testbed
–
Mini-DC consisting of clusters, middleboxes, and
network equipment
–
Representative network topology
•
Power-aware networking
–
Evaluation of existing network elements
–
Platform for investigating power reduction schemes in
network elements
•
Mutual information exchange
–
Network storage architecture
–
System Statistical Machine Learning
20
Ruby on Rails = DC PL
•
Reasons to love Ruby on Rails
1. Convention over Configuration
•
Rails framework feature enabled by Ruby language
feature (Meta Object Programming)
2. Scaffolding: automatic, Web based, (pedestrian)
User Interface to stored data
3. Program the client: v 1.1 write browser-side code
in Ruby then compile to Javascript
4. “Duck Typing/Mix-Ins”
•
Looks like string, responds like string, it’s a string!
•
Mix-in improvement over multiple inheritance
21
DC Monitoring
•
Imagine a world where path information always
passed along so that can always track user
requests throughout system
•
Across apps, OS, network components and
layers, different computers on LAN, …
–
Unique request ID
–
Components touched
–
Time of day
–
Parent of this request
22
*Trace: The 1% Solution
•
*Trace Goal: Make Path Based Analysis have low
overhead so it can be always on inside datacenter
–
“Baseline” path info collection with ≤ 1% overhead
–
Selectively add more local detail for specific requests
•
*Trace: an end-to-end path recording framework
–
Capture & timestamp a unique requestID across all system
components
–
“Top level” log contains path traces
–
Local logs contain additional detail,
correlated to path ID
–
Built on X-trace
23
X-Trace: comprehensive tracing
through Layers, Networks, Apps
•
Trace connectivity of distributed
components
–
Capture causal connections
between requests/responses
•
Cross-layer
–
Include network and middleware
services such as IP and LDAP
•
Cross-domain
–
Multiple datacenters, composed
services, overlays, mash-ups
–
Control to individual
administrative domains
• “Network path” sensor
–
Put individual
requests/responses, at
different network layers, in the
context of an end-to-end
request
24
Actuator:
Policy-based Routing Layer
•
Assign ID to incoming packets (hash + table lookup)
•
Route based on IDs, not locations (i.e., not IP addr)
–
Sets up logical paths without changing network topology
•
Set of common middle boxes get single ID
–
No single weakest link: robust, scalable throughput
Identity-based Routing Layer
Firewall
(ID
F
)
Load-
Balancer
(ID
LB
)
Intrusion-
Detection
(ID
ID
)
Service
(ID
S
)
(ID
ID
,ID
S
)
pkt
(ID
F
,ID
LB
)
pkt
pkt
•
So simple
can be done
in FPGA?
•
More general
than MPLS
25
Other RAD Lab Projects
•
Research Accelerator for MP (RAMP)
= DC simulator
•
Automatic Workload Evaluator (AWE)
= DC tester
•
Web Storage (GFS, Bigtable, Amazon S3)
= DC File System
•
Web Services (MapReduce, Chubby)
= DC Libraries