Tải bản đầy đủ (.pdf) (76 trang)

effective multi tenant distributed systems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.98 MB, 76 trang )


Strata



Effective Multi-Tenant Distributed Systems
Challenges and Solutions when Running Complex Environments
Chad Carson and Sean Suchter


Effective Multi-Tenant Distributed Systems
by Chad Carson and Sean Suchter
Copyright © 2017 Pepperdata, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (). For more information,
contact our corporate/institutional sales department: 800-998-9938 or
Editor: Nicole Taché and Debbie Hardin
Production Editor: Nicholas Adams
Copyeditor: Octal Publishing Inc.
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
October 2016: First Edition
Revision History for the First Edition
2016-10-10: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Effective Multi-Tenant
Distributed Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all


responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions contained in
this work is at your own risk. If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others, it is your responsibility
to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-96183-4
[LSI]


Chapter 1. Introduction to Multi-Tenant
Distributed Systems
The Benefits of Distributed Systems
The past few decades have seen an explosion of computing power. Search engines, social networks,
cloud-based storage and computing, and similar services now make seemingly infinite amounts of
information and computation available to users across the globe.
The tremendous scale of these services would not be possible without distributed systems.
Distributed systems make it possible for many hundreds or thousands of relatively inexpensive
computers to communicate with one another and work together, creating the outward appearance of a
single, high-powered computer. The primary benefit of a distributed system is clear: the ability to
massively scale computing power relatively inexpensively, enabling organizations to scale up their
businesses to a global level in a way that was not possible even a decade ago.

Performance Problems in Distributed Systems
As more and more nodes are added to the distributed system and interact with one another, and as
more and more developers write and run applications on the system, complications arise. Operators
of distributed systems must address an array of challenges that affect the performance of the system as
a whole as well as individual applications’ performance.
These performance challenges are different from those faced when operating a data center of
computers that are running more or less independently, such as a web server farm. In a true
distributed system, applications are split into smaller units of work, which are spread across many

nodes and communicate with one another either directly or via shared input/output data.
Additional performance challenges arise with multi-tenant distributed systems, in which different
users, groups, and possibly business units run different applications on the same cluster. (This is in
contrast to a single, large distributed application, such as a search engine, which is quite complex and
has intertask dependencies but is still just one overall application.) These challenges that come with
multitenancy result from the diversity of applications running together on any node as well as the fact
that the applications are written by many different developers instead of one engineering team focused
on ensuring that everything in a single distributed application works well together.

Scheduling
One of the primary challenges in a distributed system is in scheduling jobs and their component
processes. Computing power might be quite large, but it is always finite, and the distributed system


must decide which jobs should be scheduled to run where and when, and the relative priority of those
jobs. Even sophisticated distributed-system schedulers have limitations that can lead to
underutilization of cluster hardware, unpredictable job run times, or both. Examples include assuming
the worst-case resource usage to avoid overcommitting, failing to plan for different resource types
across different applications, and overlooking one or more dependencies, thus causing deadlock or
starvation.
The scheduling challenges become more severe on multi-tenant clusters, which add fairness of
resource access among users as a scheduling goal, in addition to (and often in conflict with) the goals
of high overall hardware utilization and predictable run times for high-priority applications. Aside
from the challenge of balancing utilization and fairness, in some extreme cases the scheduler might go
too far in trying to ensure fairness, scheduling just a few tasks from many jobs for many users at once.
This can result in latency for every job on the cluster and cause the cluster to use resources
inefficiently because the system is trying to do too many disparate things at the same time.

Hardware Bottlenecks
Beyond scheduling challenges, there are many ways a distributed system can suffer from hardware

bottlenecks and other inefficiencies. For example, a single job can saturate the network or disk I/O,
slowing down every other job. These potential problems are only exacerbated in a multi-tenant
environment—usage of a given hardware resource such as CPU or disk is often less efficient when a
node has many different processes running on it. In addition, operators cannot tune the cluster for a
particular access pattern, because the access patterns are both diverse and constantly changing.
(Again, contrast this situation with a farm of servers, each of which is independently running a single
application, or a large cluster running a single coherently designed and tuned application like a search
engine.)
Distributed systems are also subject to performance problems due to bottlenecks from centralized
services used by every node in the system. One common example is the master node performing job
admission and scheduling; others include the master node for a distributed file system storing data for
the cluster as well as common services like domain name system (DNS) servers.
These potential performance challenges are exacerbated by the fact that a primary design goal for
many modern distributed systems is to enable large numbers of developers, data scientists, and
analysts to use the system simultaneously. This is in stark contrast to earlier distributed systems such
as high-performance computing (HPC) systems in which the only people who could write programs to
run on the cluster had a systems programming background. Today, distributed systems are opening up
enormous computing power to people without a systems background, so they often don’t understand
or even think about system performance. Such a user might easily write a job that accidentally brings
a cluster to its knees, affecting every other job and user.

Lack of Visibility Within Multi-Tenant Distributed Systems


Because multi-tenant distributed systems simultaneously run many applications, each with different
performance characteristics and written by different developers, it can be difficult to determine
what’s going on with the system, whether (and why) there’s a problem, which users and applications
are the cause of any problem, and what to do about such problems.
Traditional cluster monitoring systems are generally limited to tracking metrics at the node level; they
lack visibility into detailed hardware usage by each process. Major blind spots can result—when

there’s a performance problem, operators are unable to pinpoint exactly which application caused it,
or what to do about it. Similarly, application-level monitoring systems tend to focus on overall
application semantics (overall run times, data volumes, etc.) and do not drill down to performancelevel metrics for actual hardware resources on each node that is running a part of the application.
Truly useful monitoring for multi-tenant distributed systems must track hardware usage metrics at a
sufficient level of granularity for each interesting process on each node. Gathering, processing, and
presenting this data for large clusters is a significant challenge, in terms of both systems engineering
(to process and store the data efficiently and in a scalable fashion) and the presentation-level logic
and math (to present it usefully and accurately). Even for limited, node-level metrics, traditional
monitoring systems do not scale well on large clusters of hundreds to thousands of nodes.

The Impact on Business from Performance Problems
The performance challenges described in this book can easily lead to business impacts such as the
following:
Inconsistent, unpredictable application run times
Batch jobs might run late, interactive applications might respond slowly, and the ingestion and
processing of new incoming data for use by other applications might be delayed.
Underutilized hardware
Job queues can appear full even when the cluster hardware is not running at full capacity. This
inefficiency can result in higher capital and operating expenses; it can also result in significant
delays for new projects due to insufficient hardware, or even the need to build out new datacenter space to add new machines for additional processing power.
Cluster instability
In extreme cases, nodes can become unresponsive or a distributed file system (DFS) might
become overloaded, so applications cannot run or are significantly delayed in accessing data.
Aside from these obvious effects, performance problems also cause businesses to suffer in subtler but
ultimately more significant ways. Organizations might informally “learn” that a multi-tenant cluster is
unpredictable and build implicit or explicit processes to work around the unpredictability, such as the
following:
Limit cluster access to a subset of developers or analysts, out of a concern that poorly written jobs



will slow down or even crash the cluster for everyone.
Build separate clusters for different groups or different workloads so that the most important
applications are insulated from others. Doing so increases overall cost due to inefficiency in
resource usage, adds operational overhead and cost, and reduces the ability to share data across
groups.
Set up “development” and “production” clusters, with a committee or other cumbersome process
to approve jobs before they can be run on a production cluster. Adding these hurdles can
dramatically hinder innovation, because they significantly slow the feedback loop of learning from
production data, building and testing a new model or new feature, deploying it to production, and
learning again.1
These responses to unpredictable performance can limit a business’s ability to fully benefit from the
potential of distributed systems. Eliminating performance problems on the cluster can improve
performance of the business overall.

Scope of This Book
In this book, we consider the performance challenges that arise from scheduling inefficiencies,
hardware bottlenecks, and lack of visibility. We examine each problem in detail and present solutions
that organizations use today to overcome these challenges and benefit from the tremendous scale and
efficiency of distributed systems.

Hadoop: An Example Distributed System
This book uses Hadoop as an example of a multi-tenant distributed system. Hadoop serves as an ideal
example of such a system because of its broad adoption across a variety of industries, from healthcare
to finance to transportation. Due to its open source availability and a robust ecosystem of supporting
applications, Hadoop’s adoption is increasing among small and large organizations alike.
Hadoop is also an ideal example because it is used in highly multi-tenant production deployments
(running jobs from many hundreds of developers) and is often used to simultaneously run large batch
jobs, real-time stream processing, interactive analysis, and customer-facing databases. As a result, it
suffers from all of the performance challenges described herein.
Of course, Hadoop is not the only important distributed system; a few other examples include the

following:2
Classic HPC clusters using MPI, TORQUE, and Moab
Distributed databases such as Oracle RAC, Teradata, Cassandra, and MongoDB
Render farms used for animation
Simulation systems used for physics and manufacturing


Terminology
Throughout the book, we use the following sets of terms interchangeably:
Application or job
A program submitted by a particular user to be run on a distributed system. (In some systems, this
might be termed a query.)
Container or task
An atomic unit of work that is part of a job. This work is done on a single node, generally running
as a single (sometimes multithreaded) process on the node.
Host, machine, or node
A single computing node, which can be an actual physical computer or a virtual machine.
1 We

saw an example of the benefits of having an extremely short feedback loop at Yahoo in 2006–
2007, when the sponsored search R&D team was an early user of the very first production Hadoop
cluster anywhere. By moving to Hadoop and being able to deploy new click prediction models
directly into production, we increased the number of simultaneous experiments by five times or more
and reduced the feedback loop time by a similar factor. As a result, our models could improve an
order of magnitude faster, and the revenue gains from those improvements similarly compounded that
much faster.
2 Various

distributed systems are designed to make different tradeoffs among Consistency,
Availability, and Partition tolerance. For more information, see Gilbert, Seth, and Nancy Ann Lynch.

“Perspectives on the CAP Theorem.” Institute of Electrical and Electronics Engineers, 2012
( and />

Chapter 2. Scheduling in Distributed
Systems
Introduction
In distributed computing, a scheduler is responsible for managing incoming container requests and
determining which containers to run next, on which node to run them, and how many containers to run
in parallel on the node. (Container is a general term for individual parts of a job; some systems use
other terms such as task to refer to a container.) Schedulers range in complexity, with the simplest
having a straightforward first-in–first-out (FIFO) policy. Different schedulers place more or less
importance on various (often conflicting) goals, such as the following:
Utilizing cluster resources as fully as possible
Giving each user and group fair access to the cluster
Ensuring that high-priority or latency-sensitive jobs complete on time
Multi-tenant distributed systems generally prioritize fairness among users and groups over optimal
packing and maximal resource usage; without fairness, users would be likely to maximize their own
access to the cluster without regard to others’ needs. Also, different groups and business units would
be inclined to run their own smaller, less efficient cluster to ensure access for their users.
In the context of Hadoop, one of two schedulers is most commonly used: the capacity scheduler and
the fair scheduler. Historically, each scheduler was written as an extension of the simple FIFO
scheduler, and initially each had a different goal, as their names indicate. Over time, the two
schedulers have experienced convergent evolution, with each incorporating improvements from the
other; today, they are mostly different in details. Both schedulers have the concept of multiple queues
of jobs to be scheduled, with admission to each queue determined based on user- or operatorspecified policies.
Recent versions of Hadoop1 perform two-level scheduling, in which a centralized scheduler running
on the ResourceManager node assigns cluster resources (containers) to each application, and an
ApplicationMaster running in one of those containers uses the other containers to run individual tasks
for the application. The ApplicationMaster manages the details of the application, including
communication and coordination among tasks. This architecture is much more scalable than Hadoop’s

original one-level scheduling, in which a single central node (the JobTracker) did the work of both
the ResourceManager and every ApplicationMaster.
Many other modern distributed systems like Dryad and Mesos have schedulers that are similar to
Hadoop’s schedulers. For example, Mesos also supports a pluggable scheduler interface much like
Hadoop,2 and it performs two-level scheduling,3 with a central scheduler that registers available


resources and assigns them to applications (“frameworks”).

Dominant Resource Fairness Scheduling
Historically, most schedulers considered only a single type of hardware resource when deciding
which container to schedule next—both in calculating the free resources on each node and in
calculating how much a given user, group, or queue was already using (e.g., from the point of view of
fairness in usage). In the case of Hadoop, only memory usage was considered.
However, in a multi-tenant distributed system, different jobs and containers generally have widely
different hardware usage profiles—some containers require significant memory, whereas some use
CPU much more heavily (see Figure 2-1). Not considering CPU usage in scheduling meant that the
system might be significantly underutilized, and some users would end up getting more or less than
their true fair share of the cluster. A policy called Dominant Resource Fairness (DRF)4 addresses
these limitations by considering multiple resource types and expressing the usage of each resource in
a common currency (the share of the total allocation of that resource), and then scheduling based on
the resource each container is using most heavily.


Figure 2-1. Per-container physical memory usage versus CPU usage during a representative period of time on a production
cluster. Note that some jobs consume large amounts of memory while using relatively little CPU; others use significant CPU but
relatively little memory.

In Hadoop, operators can configure both the Fair Scheduler and the Capacity Scheduler to consider
both memory and CPU (using the DRF framework) when considering which container to launch next

on a given node.5

Aggressive Scheduling for Busy Queues
Often a multi-tenant cluster might be in a state where some but not all queues are full; that is, some
tenants currently don’t have enough work to use their full share of the cluster, but others have more
work than they are guaranteed based on the scheduler’s configured allocation. In such cases, the
scheduler might launch more containers from the busy queues to keep the cluster fully utilized.


Sometimes, after those extra containers are launched, new jobs are submitted to a queue that was
previously empty; based on the scheduler’s policy, containers from those jobs should be scheduled
immediately, but because the scheduler has already opportunistically launched extra containers from
other queues, the cluster is full. In those cases, the scheduler might preempt those extra containers by
killing some of them in order to reflect the desired fairness policy (see Figure 2-2). Preemption is a
common feature in schedulers for multi-tenant distributed systems, including both popular Hadoop
schedulers (capacity and fair).
Because preemption inherently results in lost work, it’s important for the scheduler to strike a good
balance between starting many opportunistic containers to make use of idle resources and avoiding
too much preemption and the waste that it causes. To help reduce the negative impacts of preemption,
the scheduler can slightly delay killing containers (to avoid wasting the work of containers that are
almost complete) and generally chooses to kill containers that have recently launched (again, to avoid
wasted work).

Figure 2-2. When new jobs arrive in Queue A, they might be scheduled if there is sufficient unused cluster capacity, allowing
Queue A to use more than its guaranteed share. If jobs later arrive in Queue B, the scheduler might then preempt some of the
Queue A jobs to provide Queue B its guaranteed share.

A related concept is used by Google’s Borg system,6 which has a concept of priorities and quotas; a
quota represents a set of hardware resource quantities (CPU, memory, disk, etc.) for a period of time,
and higher-priority quota costs more than lower-priority quota. Borg never allocates more

production-priority quota than is available on a given cluster; this guarantees production jobs the
resources they need. At any given time, excess resources that are not being used by production jobs
can be used by lower-priority jobs, but those jobs can be killed if the production jobs’ usage later
increases. (This behavior is similar to another kind of distributed system, Amazon Web Services,
which has a concept of guaranteed instances and spot instances; spot instances cost much less than
guaranteed ones but are subject to being killed at any time.)


Special Scheduling Treatment for Small Jobs
Some cluster operators provide special treatment for small or fast jobs; in a sense, this is the opposite
of preemption. One example is LinkedIn’s “fast queue” for Hadoop, which is a small queue that is
used only for jobs that take less than an hour total to run and whose containers each take less than 15
minutes.7 If jobs or containers violate this limit, they are automatically killed. This feature provides
fast response for smaller jobs even when the cluster is bogged down by large batch jobs; it also
encourages developers to optimize their jobs to run faster.
The Hadoop vendor MapR provides somewhat similar functionality with its ExpressLane,8 which
schedules small jobs (as defined by having few containers, each with low memory usage and small
input data sizes) to run on the cluster even when the cluster is busy and has no additional capacity for
normal jobs. This is also an interesting example of using the input data size as a cue to the scheduler
about how fast a container is likely to be.

Workload-Specific Scheduling Considerations
Aside from the general goals of high utilization and fairness across users and queues, schedulers
might take other factors into account when deciding which containers to launch and where to run them.
For example, a key design point of Hadoop is to move computation to the data. (The goal is to not just
get the nodes to work as hard as they can, but also get them to work more efficiently.) The scheduler
tries to accomplish this goal by preferring to place a given container on one of the nodes that have the
container’s input HDFS data stored locally; if that can’t be done within a certain amount of time, it
then tries to place the container on the same rack as a node that has the HDFS data; if that also can’t
be done after waiting a certain amount of time, the container is launched on any node that has

available computing resources. Although this approach increases overall system efficiency, it
complicates the scheduling problem.
An example of a different kind of placement constraint is the support for pods in Kubernetes. A pod is
a group of containers, such as Docker containers, that are scheduled at the same time on the same
node. Pods are frequently used to provide services that act as helper programs for an application.
Unlike the preference for data locality in Hadoop scheduling, the colocation and coscheduling of
containers in a pod is a hard requirement; in many cases the application simply would not work
without the auxiliary services running on the same node.
A weaker constraint than colocation is the concept of gang scheduling, in which an application
requires all of its resources to run concurrently, but they don’t need to run on the same node. An
example is a distributed database like Impala, which needs to have all of its “query fragments”
running in order to serve queries. Although some distributed systems’ schedulers support gang
scheduling natively, Hadoop doesn’t currently support gang scheduling; applications that require
concurrent containers mimic gang scheduling by keeping containers alive but idle until all of the
required containers are running. This workaround clearly wastes resources because these idle
containers hold resources and stop other containers from running. However, even when gang


scheduling is done “cleanly” by the scheduler, it can lead to inefficiencies because the scheduler
needs to avoid fully loading the cluster with other containers to ensure that enough space will
eventually be available for the entire gang to be scheduled.
As a side note, workflow schedulers such as Oozie are given information about the dependencies
among jobs in a complex workflow that must happen in order; the workflow scheduler then submits
the individual jobs to the distributed system on behalf of the user. A workflow scheduler can take into
account the required inputs and outputs of each stage (including inputs that depend on some off-cluster
process to write new data to the cluster), the time of day the workflow should be started, awareness
of the full directed acyclic graph (DAG) of the entire workflow, and similar constraints. Generally,
the workflow scheduler is distinct from the distributed system’s own scheduler that determines
exactly where and when containers are launched on each node, but there are cases when overall
scheduling can be much more efficient if workflow scheduling and resource scheduling are

combined.9

Inefficiencies in Scheduling
Although schedulers have become more sophisticated over time, they continue to suffer from
inefficiencies related to the diversity of workloads running on multi-tenant distributed systems. These
inefficiencies arise from the need to avoid overcommitting memory when doing up-front scheduling, a
limited ability to consider all types of hardware resources, and challenges in considering the
dependencies among all jobs and containers within complicated workflows.

The Need to be Conservative with Memory
Distributed system schedulers generally make scheduling decisions based on conservative
assumptions about the hardware resources—especially memory—required by each container. These
requirements are usually declared by the job author based on the worst-case usage, not the actual
usage. This difference is critical because often different containers from the same job have different
actual resource usage, even if they are running identical code. (This happens, for example, when the
input data for one container is larger or otherwise different from the input data for other containers,
resulting in a need for more processing or more space in memory.)
If a node’s resources are fully scheduled and the node is “unlucky” in the mix of containers it’s
running, the node can be overloaded; if the resource that is overloaded is memory, the node might run
out of memory and crash or start swapping badly. In a large distributed system, some nodes are bound
to be unlucky in this way, so if the scheduler does not use conservative resource usage estimates, the
system will nearly always be in a bad state.
The need to be conservative with memory allocation means that most nodes will be underutilized
most of the time; containers generally do not often use their theoretical maximum memory, and even
when they do, it’s not for the full lifetime of the container (see Figure 2-3). (In some cases, containers
can use even more than their declared maximum. Systems can be more or less stringent about


enforcing what the developer declares—some systems kill containers when they exceed their
maximum memory, but others do not.10)


Figure 2-3. Actual physical memory usage compared to the container size (the theoretical maximum) for a typical container.
Note that the actual usage changes over time and is much smaller than the reserved amount.

To reduce the waste associated with this underutilization, operators of large multi-tenant distributed
systems often must perform a balancing act, trying to increase cluster utilization without pushing
nodes over the edge. As described in Chapter 4, software like Pepperdata provides a way to increase
utilization for distributed systems such as Hadoop by monitoring actual physical memory usage and
dynamically allowing more or fewer processes to be scheduled on a given node, based on the current
and projected future memory usage on that node.

Inability to Effectively Schedule the Use of Other Resources
Similar inefficiencies can occur due to the natural variation over time in the resource usage for a
single container, not just variation across containers. For a given container, memory usage tends to
vary by a factor of two or three over the lifetime of the container, and the variation is generally
smooth. CPU usage varies quite a bit over time, but the maximum usage is generally limited to a
single core. In contrast, disk I/O and network usage frequently vary by orders of magnitude, and they


spike very quickly. They are also effectively unlimited in how much of the corresponding resource
they use: one single thread on one machine can easily saturate the full network bandwidth of the node
or use up all available disk I/O operations per second (IOPS) and bandwidth from dozens of disks
(including even disks on multiple machines, when the thread is requesting data stored on another
node). See Figure 2-4 for the usage of various resources for a sample job. The left column shows
overall usage for all map tasks (red, starting earlier) and reduce tasks (green, starting later). The right
column shows a breakdown by individual task. (For this particular job, there is only one reduce task.)
Because CPU, disk, and network usage can change so quickly, it is impossible for any system that
only does up-front scheduling to optimize cluster utilization and provide true fairness in the use of
hardware resources.




Figure 2-4. The variation over time in usage of different hardware resources for a typical MapReduce job. (source:
Pepperdata)

Some schedulers (such as those in Hadoop) characterize computing nodes in a fairly basic way,
allocating containers to a machine based on its total RAM and the number of cores. A more powerful
scheduler would be aware of different hardware profiles (such as CPU speed and the number and
type of hard drives) and match the workload to the right machine. (A somewhat related approach is
budget-driven scheduling for heterogeneous clusters, where each node type might have both different
hardware profiles and different costs.11) Similarly, although modern schedulers use DRF to help
ensure fairness across jobs that have different resource usage characteristics, DRF does not optimize
efficiency; an improved scheduler could use the cluster as a whole more efficiently by ensuring that
each node has a mix of different types of workloads, such as CPU-heavy workloads running alongside
data-intensive workloads that use much more disk I/O and memory. (This multidimensional packing
problem is NP-hard,12 but simple heuristics could help performance significantly.)

Deadlock and Starvation
In some cases, schedulers might choose to start some containers in a job’s DAG even before the
preceding containers (the dependencies) have completed. This is done to reduce the total run time of
the job or spread out resource usage over time.

NOTE
In the interest of concreteness, the discussion in this section uses map and reduce containers, but similar effects can happen
any time a job has some containers that depend on the output of others; the problems are not specific to MapReduce or
Hadoop.

An example is Hadoop’s “slow start” feature, in which reduce containers might be launched before
all of the map containers they depend on have completed. This behavior can help minimize spikes in
network bandwidth usage by spreading out the heavy network traffic of transferring data from

mappers to reducers. However, starting a reduce container too early means that it might end up just
sitting on a node waiting for its input data (from map containers) to be generated, which means that
other containers are not able to use the memory the reduce container is holding, thus affecting overall
system utilization.13
This problem is especially common on very busy clusters with many tenants because often not all map
containers from a job can be scheduled in quick succession; similarly, if a map container fails (for
example, due to node failure), it might take a long time to get rescheduled, especially if other, higherpriority jobs have been submitted after the reducers from this job were scheduled. In extreme cases
this can lead to deadlock, when the cluster is occupied by reduce containers that are unable to
proceed because the containers they depend on cannot be scheduled.14 Even if deadlock does not
occur, the cluster can still be utilized inefficiently, and overall job completion can be unnecessarily


slow as measured by wall-clock time, if the scheduler launches just a small number of containers
from each of many users at one time.
A similar scheduling problem is starvation, which can occur on a heavily loaded cluster. For
example, consider a case in which one job has containers that each need a larger amount of memory
than containers from other jobs. When one of the small containers completes on a node, a naive
scheduler will see that the node has a small amount of memory available, but because it can’t fit one
of the large containers there, it will schedule a small container to run. In the extreme case, the larger
containers might never be scheduled. In Hadoop and other systems, the concept of a reservation
allows an application to reserve available space on a node, even if the application can’t immediately
use it.15 (This behavior can help avoid starvation, but it also means that the overall utilization of the
system is lower, because some amount of resources might be reserved but unused at any particular
time.)

Waste Due to Speculative Execution
Operators can configure Hadoop to use speculative execution, in which the scheduler can observe
that a given container seems to be running more slowly than is typical for that kind of container and
start another copy of that container on another node. This behavior is primarily intended to avoid
cases in which a particular node is performing badly (usually due to a hardware problem) and an

entire job could be slowed down due to just one straggler container.16
While speculative execution can reduce job completion time due to node problems, it wastes
resources when the container that is duplicated simply had more work to do than other containers and
so naturally ran longer. In practice, experienced operators typically disable speculative execution on
multi-tenant clusters, both because there is generally inherent container variation (not due to hardware
problems) and because the operators are constantly watching for bad hardware, so speculative
execution does not enhance performance.

Summary
Over time, distributed system schedulers have grown in sophistication from a very simple FIFO
algorithm to add the twin goals of fairness across users and increased cluster utilization. Those two
goals must be balanced against each other; on multi-tenant distributed systems, operators often
prioritize fairness. They do so to reduce the level of user-visible scheduling issues as well as to keep
multiple business units satisfied to use shared infrastructure rather than running their own separate
clusters. (In contrast, configuring the scheduler to maximize utilization could save money in the short
term but waste it in the long term, because many small clusters are less efficient than one large one.)
Schedulers have also become more sophisticated by better taking into account multiple hardware
resource requirements (for example, not considering only memory) and effectively treating different
kinds of workloads differently when scheduling decisions are made. However, they still suffer from
limitations, for example being conservative in resource allocation to avoid instability due to


overcommitting resources such as memory. That conservatism can keep the cluster stable, but it
results in lower utilization and slower run times than the hardware could actually support. Software
solutions that make real-time, fine-grained decisions about resource usage can provide increased
utilization while maintaining cluster stability and providing more predictable job run times.
1 The

new architecture is referred to as Yet Another Resource Negotiator (YARN) or MapReduce v2.
See />2 See


/>
3 See

/>
4 Ghodsi,

Ali, et al. “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types.”
NSDI. Vol. 11. 2011. />5 See

/>and />6 Verma,

Abhishek, et al. “Large-scale cluster management at Google with Borg.” Proceedings of the
Tenth European Conference on Computer Systems. ACM, 2015.
/>7 See

slide 9 of />
8 See

/>
9 Nurmi,

Daniel, et al. “Evaluation of a workflow scheduler using integrated performance modelling
and batch queue wait time prediction.” Proceedings of the 2006 ACM/IEEE conference on
Supercomputing. ACM, 2006. />10 For

example, Google’s Borg kills containers that try to exceed their declared memory limit.
Hadoop by default lets containers go over, but operators can configure it to kill such containers.
11 Wang,


Yang and Wei Shi. “Budget-driven scheduling algorithms for batches of MapReduce jobs in
heterogeneous clouds.” IEEE Transactions on Cloud Computing 2.3 (2014): 306-319.
/>12 See,

for example, Chekuri, Chandra and Sanjeev Khanna. “On multidimensional packing
problems.” SIAM journal on computing 33.4 (2004): 837-851.
/>13 See

/>14 See

for an example.

15 See

Sulistio, Anthony, Wolfram Schiffmann, and Rajkumar Buyya. “Advanced reservation-based


scheduling of task graphs on clusters.” International Conference on High-Performance Computing.
Springer Berlin Heidelberg, 2006. For
related recent work in Hadoop, see Curino, Carlo et al. “Reservation-based Scheduling: If You’re
Late Don’t Blame Us!” Proceedings of the ACM Symposium on Cloud Computing. ACM, 2014.
/>16 This

is different from the standard use of the term “speculative execution” in which pipelined
microprocessors sometimes execute both sides of a conditional branch before knowing which branch
will be taken.


Chapter 3. CPU Performance
Considerations

Introduction
Historically, large-scale distributed systems were designed to perform massive amounts of numerical
computation, for example in scientific simulations run on high-performance computing (HPC)
platforms. In most cases, the work done on such systems was extremely compute intensive, so the
CPU was often the primary bottleneck.
Today, distributed systems tend to run applications for which the large scale is driven by the size of
the input data rather than the amount of computation needed—examples include both special-purpose
distributed systems (such as those powering web search among billions of documents) and generalpurpose systems such as Hadoop. (However, even in those general systems, there are still some cases
such as iterative algorithms for machine learning where making efficient use of the CPU is critical.)
As a result, the CPU is often not the primary bottleneck limiting a distributed system; nevertheless, it
is important to be aware of the impacts of CPU on overall speed and throughput.
At a high level, the effect of CPU performance on distributed systems is driven by three primary
factors:
The efficiency of the program that’s running, at the level of the code as well as how the work is
broken into pieces and distributed across nodes.
Low-level kernel scheduling and prioritization of the computational work done by the CPU, when
the CPU is not waiting for data.
The amount of time the CPU spends waiting for data from memory, disk, or network.
These factors are important for the performance even of single applications running on a single
machine; they are just as important, and even more complicated, for multi-tenant distributed systems
due to the increased number and diversity of processes running on those systems, and their varied
input data sources.

Algorithm Efficiency
Of course, as with any program, when writing a distributed application, it is important to select a
good algorithm (such as implementing algorithms with N*log(N) complexity instead of N2, using joins
efficiently, etc.) and to write good code; the best way to spend less time on the CPU is to avoid
computation in the first place. As with any computer program, developers can use standard
performance optimization and profiling tools (open source options include gprof, hprof, VisualVM,



and Perf4J; Dynatrace is one commercial option) to profile and optimize a single instance of a
program running on a particular machine.
For distributed systems, it can be equally important (if not more so) to break down the work into units
effectively. For example, with MapReduce programs, some arrangements of map-shuffle-reduce steps
are more efficient than others. Likewise, whether using MapReduce, Spark, or another distributed
framework, using the right level of parallelism is important. For example, because every map and
reduce task requires a nontrivial amount of setup and teardown work, running too many small tasks
can lead to grossly inefficient overhead—we’ve seen systems with thousands of map tasks that each
require several seconds for setup and teardown but spend less than one second on useful computation.
In the case of Hadoop, open source tools like Dr. Elephant1 (as well as some commercial tools)
provide performance measurement and recommendations to improve the overall flow of jobs,
identifying problems such as a suboptimal breakdown of work into individual units.

Kernel Scheduling
The operating system kernel (Linux, for example) decides which threads run where and when,
distributing a fixed amount of CPU resource across threads (and thus ultimately across applications).
Every N (~5) milliseconds, the kernel takes control of a given core and decides which thread’s
instructions will run there for the next N milliseconds. For each candidate thread, the kernel’s
scheduler must consider several factors:
Is the thread ready to do anything at all (versus waiting for I/O)?
If yes, is it ready to do something on this core?
If yes, what is its dynamic priority? This computation takes several factors into account, including
the static priority of the process, how much CPU time the thread has been allocated recently, and
other signals depending on the kernel version.
How does this thread’s dynamic priority compare to that of other threads that could be run now?
The Linux kernel exposes several control knobs to affect the static (a priori) priority of a process;
nice and control groups (cgroups) are the most commonly used. With cgroups, priorities can be set,
and scheduling affected, for a group of processes rather than a single process or thread; conceptually,
cgroups divide the access to CPU across the entire group. This division across groups of processes

means that applications running many processes on a node do not receive unfair advantage over
applications with just one or a few processes.
In considering the impact of CPU usage, it is helpful to distinguish between latency-sensitive and
latency-insensitive applications:
In a latency-sensitive application, a key consideration is the timing of the CPU cycles assigned to
it. Performance can be defined by the question “How much CPU do I get when I need it?”


×