Tải bản đầy đủ (.pdf) (146 trang)

Effective multi tenant distributed systems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.87 MB, 146 trang )


Strata



Effective Multi-Tenant
Distributed Systems
Challenges and Solutions when Running Complex Environments

Chad Carson and Sean Suchter


Effective Multi-Tenant Distributed Systems
by Chad Carson and Sean Suchter
Copyright © 2017 Pepperdata, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(). For more information, contact our
corporate/institutional sales department: 800-998-9938 or

Editor: Nicole Taché and Debbie Hardin
Production Editor: Nicholas Adams
Copyeditor: Octal Publishing Inc.
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
October 2016: First Edition



Revision History for the First Edition
2016-10-10: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Effective
Multi-Tenant Distributed Systems, the cover image, and related trade dress
are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure
that the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-96183-4
[LSI]


Chapter 1. Introduction to MultiTenant Distributed Systems


The Benefits of Distributed Systems
The past few decades have seen an explosion of computing power. Search
engines, social networks, cloud-based storage and computing, and similar
services now make seemingly infinite amounts of information and
computation available to users across the globe.
The tremendous scale of these services would not be possible without
distributed systems. Distributed systems make it possible for many hundreds
or thousands of relatively inexpensive computers to communicate with one

another and work together, creating the outward appearance of a single, highpowered computer. The primary benefit of a distributed system is clear: the
ability to massively scale computing power relatively inexpensively, enabling
organizations to scale up their businesses to a global level in a way that was
not possible even a decade ago.


Performance Problems in Distributed Systems
As more and more nodes are added to the distributed system and interact with
one another, and as more and more developers write and run applications on
the system, complications arise. Operators of distributed systems must
address an array of challenges that affect the performance of the system as a
whole as well as individual applications’ performance.
These performance challenges are different from those faced when operating
a data center of computers that are running more or less independently, such
as a web server farm. In a true distributed system, applications are split into
smaller units of work, which are spread across many nodes and communicate
with one another either directly or via shared input/output data.
Additional performance challenges arise with multi-tenant distributed
systems, in which different users, groups, and possibly business units run
different applications on the same cluster. (This is in contrast to a single,
large distributed application, such as a search engine, which is quite complex
and has intertask dependencies but is still just one overall application.) These
challenges that come with multitenancy result from the diversity of
applications running together on any node as well as the fact that the
applications are written by many different developers instead of one
engineering team focused on ensuring that everything in a single distributed
application works well together.


Scheduling

One of the primary challenges in a distributed system is in scheduling jobs
and their component processes. Computing power might be quite large, but it
is always finite, and the distributed system must decide which jobs should be
scheduled to run where and when, and the relative priority of those jobs.
Even sophisticated distributed-system schedulers have limitations that can
lead to underutilization of cluster hardware, unpredictable job run times, or
both. Examples include assuming the worst-case resource usage to avoid
overcommitting, failing to plan for different resource types across different
applications, and overlooking one or more dependencies, thus causing
deadlock or starvation.
The scheduling challenges become more severe on multi-tenant clusters,
which add fairness of resource access among users as a scheduling goal, in
addition to (and often in conflict with) the goals of high overall hardware
utilization and predictable run times for high-priority applications. Aside
from the challenge of balancing utilization and fairness, in some extreme
cases the scheduler might go too far in trying to ensure fairness, scheduling
just a few tasks from many jobs for many users at once. This can result in
latency for every job on the cluster and cause the cluster to use resources
inefficiently because the system is trying to do too many disparate things at
the same time.


Hardware Bottlenecks
Beyond scheduling challenges, there are many ways a distributed system can
suffer from hardware bottlenecks and other inefficiencies. For example, a
single job can saturate the network or disk I/O, slowing down every other job.
These potential problems are only exacerbated in a multi-tenant environment
— usage of a given hardware resource such as CPU or disk is often less
efficient when a node has many different processes running on it. In addition,
operators cannot tune the cluster for a particular access pattern, because the

access patterns are both diverse and constantly changing. (Again, contrast
this situation with a farm of servers, each of which is independently running a
single application, or a large cluster running a single coherently designed and
tuned application like a search engine.)
Distributed systems are also subject to performance problems due to
bottlenecks from centralized services used by every node in the system. One
common example is the master node performing job admission and
scheduling; others include the master node for a distributed file system
storing data for the cluster as well as common services like domain name
system (DNS) servers.
These potential performance challenges are exacerbated by the fact that a
primary design goal for many modern distributed systems is to enable large
numbers of developers, data scientists, and analysts to use the system
simultaneously. This is in stark contrast to earlier distributed systems such as
high-performance computing (HPC) systems in which the only people who
could write programs to run on the cluster had a systems programming
background. Today, distributed systems are opening up enormous computing
power to people without a systems background, so they often don’t
understand or even think about system performance. Such a user might easily
write a job that accidentally brings a cluster to its knees, affecting every other
job and user.


Lack of Visibility Within Multi-Tenant
Distributed Systems
Because multi-tenant distributed systems simultaneously run many
applications, each with different performance characteristics and written by
different developers, it can be difficult to determine what’s going on with the
system, whether (and why) there’s a problem, which users and applications
are the cause of any problem, and what to do about such problems.

Traditional cluster monitoring systems are generally limited to tracking
metrics at the node level; they lack visibility into detailed hardware usage by
each process. Major blind spots can result — when there’s a performance
problem, operators are unable to pinpoint exactly which application caused it,
or what to do about it. Similarly, application-level monitoring systems tend to
focus on overall application semantics (overall run times, data volumes, etc.)
and do not drill down to performance-level metrics for actual hardware
resources on each node that is running a part of the application.
Truly useful monitoring for multi-tenant distributed systems must track
hardware usage metrics at a sufficient level of granularity for each interesting
process on each node. Gathering, processing, and presenting this data for
large clusters is a significant challenge, in terms of both systems engineering
(to process and store the data efficiently and in a scalable fashion) and the
presentation-level logic and math (to present it usefully and accurately). Even
for limited, node-level metrics, traditional monitoring systems do not scale
well on large clusters of hundreds to thousands of nodes.


The Impact on Business from Performance
Problems
The performance challenges described in this book can easily lead to business
impacts such as the following:
Inconsistent, unpredictable application run times
Batch jobs might run late, interactive applications might respond slowly,
and the ingestion and processing of new incoming data for use by other
applications might be delayed.
Underutilized hardware
Job queues can appear full even when the cluster hardware is not
running at full capacity. This inefficiency can result in higher capital and
operating expenses; it can also result in significant delays for new

projects due to insufficient hardware, or even the need to build out new
data-center space to add new machines for additional processing power.
Cluster instability
In extreme cases, nodes can become unresponsive or a distributed file
system (DFS) might become overloaded, so applications cannot run or
are significantly delayed in accessing data.
Aside from these obvious effects, performance problems also cause
businesses to suffer in subtler but ultimately more significant ways.
Organizations might informally “learn” that a multi-tenant cluster is
unpredictable and build implicit or explicit processes to work around the
unpredictability, such as the following:
Limit cluster access to a subset of developers or analysts, out of a
concern that poorly written jobs will slow down or even crash the cluster
for everyone.
Build separate clusters for different groups or different workloads so
that the most important applications are insulated from others. Doing so
increases overall cost due to inefficiency in resource usage, adds


operational overhead and cost, and reduces the ability to share data
across groups.
Set up “development” and “production” clusters, with a committee or
other cumbersome process to approve jobs before they can be run on a
production cluster. Adding these hurdles can dramatically hinder
innovation, because they significantly slow the feedback loop of
learning from production data, building and testing a new model or new
feature, deploying it to production, and learning again.1
These responses to unpredictable performance can limit a business’s ability
to fully benefit from the potential of distributed systems. Eliminating
performance problems on the cluster can improve performance of the

business overall.


Scope of This Book
In this book, we consider the performance challenges that arise from
scheduling inefficiencies, hardware bottlenecks, and lack of visibility. We
examine each problem in detail and present solutions that organizations use
today to overcome these challenges and benefit from the tremendous scale
and efficiency of distributed systems.


Hadoop: An Example Distributed System
This book uses Hadoop as an example of a multi-tenant distributed system.
Hadoop serves as an ideal example of such a system because of its broad
adoption across a variety of industries, from healthcare to finance to
transportation. Due to its open source availability and a robust ecosystem of
supporting applications, Hadoop’s adoption is increasing among small and
large organizations alike.
Hadoop is also an ideal example because it is used in highly multi-tenant
production deployments (running jobs from many hundreds of developers)
and is often used to simultaneously run large batch jobs, real-time stream
processing, interactive analysis, and customer-facing databases. As a result, it
suffers from all of the performance challenges described herein.
Of course, Hadoop is not the only important distributed system; a few other
examples include the following:2
Classic HPC clusters using MPI, TORQUE, and Moab
Distributed databases such as Oracle RAC, Teradata, Cassandra, and
MongoDB
Render farms used for animation
Simulation systems used for physics and manufacturing



Terminology
Throughout the book, we use the following sets of terms interchangeably:
Application or job
A program submitted by a particular user to be run on a distributed
system. (In some systems, this might be termed a query.)
Container or task
An atomic unit of work that is part of a job. This work is done on a
single node, generally running as a single (sometimes multithreaded)
process on the node.
Host, machine, or node
A single computing node, which can be an actual physical computer or a
virtual machine.
1

We saw an example of the benefits of having an extremely short feedback loop at Yahoo in 2006–
2007, when the sponsored search R&D team was an early user of the very first production Hadoop
cluster anywhere. By moving to Hadoop and being able to deploy new click prediction models
directly into production, we increased the number of simultaneous experiments by five times or
more and reduced the feedback loop time by a similar factor. As a result, our models could improve
an order of magnitude faster, and the revenue gains from those improvements similarly
compounded that much faster.

2

Various distributed systems are designed to make different tradeoffs among Consistency,
Availability, and Partition tolerance. For more information, see Gilbert, Seth, and Nancy Ann
Lynch. “Perspectives on the CAP Theorem.” Institute of Electrical and Electronics Engineers, 2012
( and />


Chapter 2. Scheduling in
Distributed Systems


Introduction
In distributed computing, a scheduler is responsible for managing incoming
container requests and determining which containers to run next, on which
node to run them, and how many containers to run in parallel on the node.
(Container is a general term for individual parts of a job; some systems use
other terms such as task to refer to a container.) Schedulers range in
complexity, with the simplest having a straightforward first-in–first-out
(FIFO) policy. Different schedulers place more or less importance on various
(often conflicting) goals, such as the following:
Utilizing cluster resources as fully as possible
Giving each user and group fair access to the cluster
Ensuring that high-priority or latency-sensitive jobs complete on time
Multi-tenant distributed systems generally prioritize fairness among users and
groups over optimal packing and maximal resource usage; without fairness,
users would be likely to maximize their own access to the cluster without
regard to others’ needs. Also, different groups and business units would be
inclined to run their own smaller, less efficient cluster to ensure access for
their users.
In the context of Hadoop, one of two schedulers is most commonly used: the
capacity scheduler and the fair scheduler. Historically, each scheduler was
written as an extension of the simple FIFO scheduler, and initially each had a
different goal, as their names indicate. Over time, the two schedulers have
experienced convergent evolution, with each incorporating improvements
from the other; today, they are mostly different in details. Both schedulers
have the concept of multiple queues of jobs to be scheduled, with admission

to each queue determined based on user- or operator-specified policies.
Recent versions of Hadoop1 perform two-level scheduling, in which a
centralized scheduler running on the ResourceManager node assigns cluster
resources (containers) to each application, and an ApplicationMaster running


in one of those containers uses the other containers to run individual tasks for
the application. The ApplicationMaster manages the details of the
application, including communication and coordination among tasks. This
architecture is much more scalable than Hadoop’s original one-level
scheduling, in which a single central node (the JobTracker) did the work of
both the ResourceManager and every ApplicationMaster.
Many other modern distributed systems like Dryad and Mesos have
schedulers that are similar to Hadoop’s schedulers. For example, Mesos also
supports a pluggable scheduler interface much like Hadoop,2 and it performs
two-level scheduling,3 with a central scheduler that registers available
resources and assigns them to applications (“frameworks”).


Dominant Resource Fairness Scheduling
Historically, most schedulers considered only a single type of hardware
resource when deciding which container to schedule next — both in
calculating the free resources on each node and in calculating how much a
given user, group, or queue was already using (e.g., from the point of view of
fairness in usage). In the case of Hadoop, only memory usage was
considered.
However, in a multi-tenant distributed system, different jobs and containers
generally have widely different hardware usage profiles — some containers
require significant memory, whereas some use CPU much more heavily (see
Figure 2-1). Not considering CPU usage in scheduling meant that the system

might be significantly underutilized, and some users would end up getting
more or less than their true fair share of the cluster. A policy called Dominant
Resource Fairness (DRF)4 addresses these limitations by considering
multiple resource types and expressing the usage of each resource in a
common currency (the share of the total allocation of that resource), and then
scheduling based on the resource each container is using most heavily.


Figure 2-1. Per-container physical memory usage versus CPU usage during a representative period of
time on a production cluster. Note that some jobs consume large amounts of memory while using
relatively little CPU; others use significant CPU but relatively little memory.

In Hadoop, operators can configure both the Fair Scheduler and the Capacity
Scheduler to consider both memory and CPU (using the DRF framework)
when considering which container to launch next on a given node.5


Aggressive Scheduling for Busy Queues
Often a multi-tenant cluster might be in a state where some but not all queues
are full; that is, some tenants currently don’t have enough work to use their
full share of the cluster, but others have more work than they are guaranteed
based on the scheduler’s configured allocation. In such cases, the scheduler
might launch more containers from the busy queues to keep the cluster fully
utilized.
Sometimes, after those extra containers are launched, new jobs are submitted
to a queue that was previously empty; based on the scheduler’s policy,
containers from those jobs should be scheduled immediately, but because the
scheduler has already opportunistically launched extra containers from other
queues, the cluster is full. In those cases, the scheduler might preempt those
extra containers by killing some of them in order to reflect the desired

fairness policy (see Figure 2-2). Preemption is a common feature in
schedulers for multi-tenant distributed systems, including both popular
Hadoop schedulers (capacity and fair).
Because preemption inherently results in lost work, it’s important for the
scheduler to strike a good balance between starting many opportunistic
containers to make use of idle resources and avoiding too much preemption
and the waste that it causes. To help reduce the negative impacts of
preemption, the scheduler can slightly delay killing containers (to avoid
wasting the work of containers that are almost complete) and generally
chooses to kill containers that have recently launched (again, to avoid wasted
work).


Figure 2-2. When new jobs arrive in Queue A, they might be scheduled if there is sufficient unused
cluster capacity, allowing Queue A to use more than its guaranteed share. If jobs later arrive in Queue
B, the scheduler might then preempt some of the Queue A jobs to provide Queue B its guaranteed
share.

A related concept is used by Google’s Borg system,6 which has a concept of
priorities and quotas; a quota represents a set of hardware resource quantities
(CPU, memory, disk, etc.) for a period of time, and higher-priority quota
costs more than lower-priority quota. Borg never allocates more productionpriority quota than is available on a given cluster; this guarantees production
jobs the resources they need. At any given time, excess resources that are not
being used by production jobs can be used by lower-priority jobs, but those
jobs can be killed if the production jobs’ usage later increases. (This behavior
is similar to another kind of distributed system, Amazon Web Services,
which has a concept of guaranteed instances and spot instances; spot
instances cost much less than guaranteed ones but are subject to being killed
at any time.)



Special Scheduling Treatment for Small Jobs
Some cluster operators provide special treatment for small or fast jobs; in a
sense, this is the opposite of preemption. One example is LinkedIn’s “fast
queue” for Hadoop, which is a small queue that is used only for jobs that take
less than an hour total to run and whose containers each take less than 15
minutes.7 If jobs or containers violate this limit, they are automatically killed.
This feature provides fast response for smaller jobs even when the cluster is
bogged down by large batch jobs; it also encourages developers to optimize
their jobs to run faster.
The Hadoop vendor MapR provides somewhat similar functionality with its
ExpressLane,8 which schedules small jobs (as defined by having few
containers, each with low memory usage and small input data sizes) to run on
the cluster even when the cluster is busy and has no additional capacity for
normal jobs. This is also an interesting example of using the input data size as
a cue to the scheduler about how fast a container is likely to be.


×