Tải bản đầy đủ (.pdf) (37 trang)

IT training data where you want it ebook khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.66 MB, 37 trang )



Data Where You Want It

Geo-Distribution of Big Data
and Analytics

Ted Dunning and Ellen Friedman

Beijing

Boston Farnham Sebastopol

Tokyo


Data Where You Want It
by Ted Dunning and Ellen Friedman
Copyright © 2017 Ted Dunning and Ellen Friedman. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or


Editor: Shannon Cutt
Copyeditor: Holly Bauer Forsyth
February 2017:


Interior Designer: David Futato
Cover Designer: Randy Comer

First Edition

Revision History for the First Edition
2017-02-15: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Where You
Want It, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.

978-1-491-98354-6
[LSI]


Table of Contents

Why Geo-Distribution Matters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Goals of Modern Geo-Distributed Systems
Moving Data: Replication and Mirroring
Clouds and Geo-distribution

Global Data Governance
Containers for Big Data
Use Case: Geo-Replication in Telecom
It’s Actually Broken Even If It Works Most of the Time
Use Case: Shared Inventory in Ad Tech

3
4
10
13
16
20
20
22

Additional Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

v



Why Geo-Distribution Matters

“Data where you want it; compute where you need it.”

Thirty years ago, if someone in North America or Europe men‐
tioned they had accepted a position at a Tokyo-based firm, your first
question likely would’ve been, “When are you relocating to Tokyo?”
Today the question has become, “Are you planning to relocate?”
Remote communication combined with frequent flights have left the

question open in global business teams.
Just as we now think differently about how people work together, so
too a shift is needed in how we build and use global data infrastruc‐
ture in order to address modern business challenges. We need sys‐
tems that allow data to live where it should. We should be able to
think of data—on premise or in the cloud—as part of a global sys‐
tem. In short, our data infrastructure should give us data that can be
accessed where, when, and by whom we want, and not by anyone else.
The idea of working with data centers that are geographically distant
isn’t new. There’s an emerging need among many big data organiza‐
tions for globally distributed but still cohesive data that meets the
challenges of consistency, low latency, ease of administration, and
low cost, even at huge scale.
In the past, many people built organizations around a central head‐
quarters plus regional teams that functioned independently, each
with its own regional database, possibly reporting back to HQ
monthly or weekly. Data was copied and backed up at a remote loca‐
tion for disaster recovery, typically daily, particularly if the data was
critical.

1


But these hierarchical approaches aren’t good enough anymore.
With a global view via the Internet, people expect to touch and
respond to business from anywhere, at any time. To really get this
done, you need data to reside in many places, with low latency coor‐
dination of updates, and still be robust against communication fail‐
ures on the Internet. Data infrastructure often needs to be shared by
many applications. We may need a data object to live in more than

one place—that’s geo-distributed data. This includes cloud comput‐
ing, because cloud is really just one more “place,” as suggested by
Figure 1-1.

Figure 1-1. Emerging technologies address the need for data that can
be shared and updated globally, at massive scale,with very low latency
and high consistency. There’s also a need for fine-grained control over
who has access. Here, on-premise data centers are shown as rectangles
that share data directly to distant locations or form a hybrid system
with public or private cloud.
The challenges posed by the required scale and speed alone are sub‐
stantial. For example, IoT sensor data systems commonly move data
at a rate of tens of gigabits per second, and some exceed terabits per
second.
The need for truly geo-distributed data—both in terms of storage
and computation—requires new approaches, and these new
approaches are the topic of this report. These approaches have
emerged bit by bit rather than all at once, but the accumulated
2

|

Why Geo-Distribution Matters


change is now large enough to warrant even experienced practition‐
ers to take new look.
Previously, systems that needed data to be available globally would
have explicitly copied data from place to place instead of using
platform-level capabilities to automatically propagate changes. In

practice, however, it pays to think of data as a global system in which
the same data objects are shared across different locations. This geodistribution of data combined with appropriate design patterns can
make it much simpler to build applications and can result in more
reliable, scalable systems that span the globe. Similarly, the manage‐
ment of computation in global systems has historically been very
haphazard. This is improving with containers, which allow precise
deployment of a precisely known distribution of code.
In this report, we describe the requirements and challenges of such a
system as well as examples of specific technologies designed to meet
them. We also include a collection of real-world use cases where
low-latency geo-distribution of very large-scale data and computa‐
tion provide a competitive edge.

Goals of Modern Geo-Distributed Systems
As we examine some of the emerging technologies that address the
need for geo-distributed systems, keep these goals in mind. Many
modern systems need to take into account the facts that:
• Data comes from many sources, in many forms, and from many
locations
• Sometimes data has to stay in a particular location—for exam‐
ple, to meet government regulations or to optimize perfor‐
mance
• In many other cases, a global view is required—for example, for
reporting or machine learning in which global models are built
from much more than just local activity
• Central data often needs to be distributed in order to be shared,
as for inventory control, model deployment, accurate predictive
maintenance for large-scale industrial systems, or for disaster
recovery
• Computation (and the associated data) sometimes needs to be

near the edge, such as in industrial control systems, and simul‐

Goals of Modern Geo-Distributed Systems

|

3


taneously in a central analytics facility that has access to data
from across the entire enterprise
• To stay competitive in modern settings, data agility and a micro‐
services approach may be required
With these demands, how do new technologies meet the challenges?

Global View: Data Storage and Computation
One reason for geo-distribution of data is to put data at a remote site
as part of a disaster recovery plan, but movement of data between
active data centers is also key for efficient use of data in many situa‐
tions. It’s generally more efficient to access data locally, so storing
the same data in multiple places and being able to replicate updates
with low latency are valuable capabilities. One key is to be able to
specify how and where data should end up without saying exactly
which bits to move. We discuss new options for data movement in
the next section of this report. Although local access is generally
desirable, you should have the choice of accessing data remotely as
well, and we discuss some of the ways that can be done more easily
in “Global Data Governance” on page 13.
Data movement is the most important issue in geo-distributed sys‐
tems, but computation is a factor, too. This is especially true where

very high volume and rate of data production occurs, as with IoT
sensor data. Edge computing is becoming increasingly important in
such situations. In the free report, Data: Emerging Trends and Tech‐
nologies, Alistair Croll refers to “…a renewed interest in computing
at the edges—Cisco calls it ‘fog computing…’”. An application that
has been developed and tested at a central data center needs to be
deployed to multiple locations. How can you do this efficiently and
with confidence that it will perform as needed at the new locations?
We address that topic in “Containers for Big Data” on page 16.

Moving Data: Replication and Mirroring
Remote mirroring and geo-distributed data replication can be done
in a variety of ways. Traditional methods for data movement were
not designed to provide the large scale, low latency, and low cost
that modern systems demand. We’ll examine capabilities needed to
do this efficiently, the challenges in doing so, and how some emerg‐
ing technologies begin to address those challenges.
4

|

Why Geo-Distribution Matters


Remote Mirroring
One of the most basic reasons for moving data across a distance is to
mirror data to a cluster at a remote site as part of a disaster recovery
plan. Data mirroring is generally a scheduled event. In efficient sys‐
tems, after the initial mirror is established, subsequent mirrors are
incremental. This incremental updating is particularly desirable for

large-scale systems because moving fewer bytes decreases the time
required and chance for error.
It’s also useful to guarantee that the mirror copy is a fully consistent
image of the source. That is to say that if multiple files are changing
in the source unit being mirrored, there needs to be a mechanism to
ensure that the mirror copy reflects the exact state of the source at a
point in time, rather than different states for different files at differ‐
ent times as the mirroring was done.
An example of how this can be accomplished efficiently is seen in
the design of the mirroring process of the MapR Converged Data
Platform. Mirroring is done at the level of a volume, which acts like
a directory and can contain files, directories, NoSQL database tables,
and message streams. First, point-in-time snapshots of the source
and destination volumes are made. Blocks in the source that have
changed since the last mirror update are transferred to the destina‐
tion snapshot. Once all changed blocks have been transferred, the
destination mirror and the destination snapshot are swapped in a
single atomic operation. The use of a snapshot also means that the
source volume remains available for reads and writes throughout
the mirroring process. The effect is point-in-time updates to the
mirror destination.

Remote Replication
Another common requirement with geo-distributed data is to
update a remote database with near–real time replication. Tradition‐
ally, this was often done by exporting report tables in regional data
centers and then copying them to central centers to be imported
into a central database. To get data on a more real-time basis, change
data capture (CDC) systems were used to set up master-slave repli‐
cation of databases. This required copying a snapshot of the local

table to the remote system followed by starting a log-shipping sys‐
tem to copy updates to the remote replica. The initial copy could
take considerable time, so several rounds of incremental copying
Moving Data: Replication and Mirroring

|

5


might be needed to get an up-to-date clone of the database. Setting
these systems up has traditionally been a bit tricky, and they are
often difficult to keep up.
Increasingly, however, systems need to respond to changing situa‐
tions in seconds or milliseconds. Similarly, it isn’t just a matter of
moving data to the center; data has to flow outward as well. Like‐
wise, substantial amounts of computation need to happen at the
edge near the source of data.
These new requirements are increasing the complexity of the repli‐
cation patterns, and that is making it harder to maintain data consis‐
tency in these systems. Also, we want to be able to separate
concerns, leaving content questions to application developers and
data motion questions to administrators who should not need to
know much about exactly what data is being moved. Systems also
must be resilient to interruptions caused by network partitions (and
maintain consistency as much as possible), so we generally want
multi-master replication, where updates can be made to different
copies of the data. There’s also the issue of near–real time replication
of data objects other than databases, such as message streams. These
capabilities are just beginning to be available in some emerging big

data technologies, as we discuss in later in the report.

Why Multi-Master Geo-Replication Matters
Multi-master, bi-directional data replication across data centers can
reduce latency, simplify architectures, and protect against data loss
due to network issues, as shown in Figure 1-2. This style of replica‐
tion allows faster access to data, but replication loops must be avoi‐
ded.
If a system does not have a built-in way to detect and break loops,
manual effort is required to change replication patterns when fail‐
ures happen in order to maintain consistency. In MapR table repli‐
cation, for example, updates remember their origin so that update
loops can be avoided. Table 1-1 shows some benefits of multi-master
updates.

6

|

Why Geo-Distribution Matters


Figure 1-2. Advantages of multi-master geo-replication. In masterslave replication (A), data sources only write to one master and rely on
replication to move data to the other location. But a network partition
could easily prevent insertion, resulting in data loss. In multi-master
replication (B), data is ingested near the source with less chance for
loss. Bi-directional replication updates both databases in near–real
time.
Table 1-1. Types of geo-distributed replication for different tools
Tool

MySQL

Type of data replicated Multi-master
DB
X

Oracle GoldenGate

DB

X

NuoDB

DB



MapR Converged Data Platform Files, DB, streams



Cassandra

DB



PostgreSQL


DB

X

Moving Data: Replication and Mirroring

|

7


Conflict Resolution: The Question of Consistency
Achieving consistency in a distributed system while maintaining
performance (low latency) is not easy. In addition, maintaining
availability requires some compromises. Consider two approaches
taken by big data technologies that offer multi-master replication:
Cassandra and MapR Converged Data Platform.

Eventual consistency: Cassandra
Cassandra deals with the tradeoffs between consistency, availability,
and performance by allowing replicas of data to be temporarily
inconsistent in order to increase availability of the database. A con‐
figurable option allows you to specify how many replicas must be
written as well as how many must be read. This is true for local clus‐
ters or with geo-replication. With the the default setting, these num‐
bers are set low to improve performance, at some cost in
consistency. It is assumed that in practice, applications will use
parameters such that inconsistencies will be detected on read and
corrected—providing eventual consistency.
Part of the difficulty with this approach is the partial dependency on

the correct configuration for consistency: you have to configure the
applications correctly or it may not work. For this reason, the
default setting may be considered somewhat dangerous, and if data
loss is a particular concern, then non-default settings are preferred
(see the Cassandra documentation or testing sites such as
Aphyr.com for more in-depth explanations). The goal of Cassandra’s
design is to generally prioritize availability over consistency.

Strict local consistency: MapR Converged Data Platform
MapR’s Converged Data Platform takes a different tack by making
consistency within a data center non-negotiable while allowing table
replication to fall behind on network failures. Conflict resolution
between delayed updates from different sources is done at the plat‐
form level using timestamp resolution for updates—the last write
wins. Replication between tables is achieved at the lowest level by
shipping logs to replicas, but the origin of each update is recorded so
that loops in the replication pattern can be broken.
To achieve strict local consistency, there can be a slight reduction in
availability (possibly seconds per year) if machines or local network
links fail, which may not be a concern.
8

|

Why Geo-Distribution Matters


Beyond Database Replication: Streaming Data
The increasing interest in streaming data from continuous events,
such as IoT sensor data or clickstream data from web traffic, under‐

lines the need for easy, reliable geo-distributed replication of mes‐
sage streams. As with real-time replication of database updates, it’s
preferable to have the capability for multi-master stream replication
across data centers.
Apache Kafka is a popular tool for stream transport. Streaming data
is assigned to a topic, which in turn is divided into partitions for
load balancing and improved performance. Order of messages is
maintained within each topic partition. Kafka is generally run on a
separate cluster from the stream processing application technologies
such as Apache Spark Streaming, Apache Apex, or Apache Flink,
plus the main data persistence. In the Kafka cluster, a single topic
partition replica must fit on a single machine, but multiple parti‐
tions generally reside on each machine.
Kafka addresses the need to move data between data centers via a
program called MirrorMaker that consumes messages from a topic
and re-publishes them to a remote Kafka cluster. MirrorMaker can’t
replicate a single topic bi-directionally (so no multi-master replica‐
tion), and message offsets are not maintained between copies.
Another technology for streaming data message transport is MapR
Streams. The MapR Converged Data Platform supports streams,
directories, files, and tables in the same system. Typically, the same
cluster is used for both data persistence and stream processing
applications (such as Spark, Apex, Flink, etc.). MapR Streams sup‐
ports the Kafka API, and like Kafka, streaming data is assigned to a
topic that has topic partitions. Unlike Kafka, a topic partition in
MapR is distributed across the cluster. Furthermore, with MapR,
many topics are collected together into an object known as a Stream.
Policies such as time-to-live, data access control via ACES (Access
Control Expressions), or geo-replication are applied at the Stream
level.

Like MapR direct table replication, geo-replication of MapR Streams
is near–real time, bi-directional, and multi-master with loop avoid‐
ance. The correct sequence of messages in a topic partition is main‐
tained during geo-replication; message and consumer offsets are
preserved. This is true for replication between on-premise clusters
in distant data centers as well as between an on-premise cluster and
Moving Data: Replication and Mirroring

|

9


a cloud cluster for hybrid cloud streaming architecture. Other exam‐
ples of geo-replication of MapR streams are found in the telecom
and ad-tech use cases at the end of this report.

Clouds and Geo-distribution
Cloud computing is one of the major forces driving common adop‐
tion of geo-distributed computing. The simple reason is that public
clouds allow just about anybody to have multiple data centers. And
once you have multiple data centers, the practical choice is to either
have good support for geo-distributed data or wind up with data
chaos.

The Core Trend to Cloud
The reason public clouds lead to multiple data centers is that it has
always been desirable to have multiple data centers (for disaster
recovery if for no other reason), and public clouds make it easier to
have multiple data centers since you don’t have to go to unfamiliar

places to provision the hardware or staff them. Global business pres‐
ence and widely distributed sources of data make multiple data cen‐
ters even more attractive since you can have data close to your users,
thus making it easier for them to get data as well as provide it. In
contrast, having a single centralized data center introduces long
latencies and decreases reliability due to the distance that data must
traverse.
Not all clouds are public clouds. The point of cloud computing
doesn’t require that the cloud be run by an external team. The really
core idea is to treat computing as a fungible commodity that can
quickly and efficiently be repurposed to different needs using virtu‐
alization or container technology. More and more companies are
reorganizing their on-premise computing as private clouds to
increase resource utilization. There are also specialized clouds avail‐
able that meet special needs such as heightened levels of security, tel‐
ecommunications support, or prebuilt healthcare systems.

Cloud Neutrality
As more options become available for cloud computing, both pub‐
licly and privately, the concept of cloud-neutrality is becoming very
important. The idea is that having multiple cloud options only
makes a difference if you can change your mind and aren’t com‐
10

|

Why Geo-Distribution Matters


pletely locked into just one of them. If your applications are cloudneutral, then you can take advantage of the competition between

various public cloud vendors and in-house facilities by using each
kind just for what they do best. In addition, having all of your cloudbased computation handled by a single vendor runs the risk of cata‐
strophic failures due to a common platform vulnerability or a
coordinated failure. Spreading critical functions across multiple
vendors can mitigate this risk.
The common result of all this is that many large companies have
one or more on-premise data centers and are increasingly looking to
also have significant presence in the clouds provided by one or even
multiple vendors.

Cloud Bursting and Load Leveling
Commercial clouds have a very different cost model (by the hour or
minute) than the fixed-asset depreciation model of on-premise data
centers or private clouds. This difference means that commercial
clouds can be very cost effective for loads that have a high peak-tovalley ratio, or that are intermittent with low-duty cycle. On-premise
systems are often much more cost effective for relatively constant
processing loads with high utilization. Architectures such as the one
shown in Figure 1-3 can make use of the arbitrage between these
cost models by pushing the variable compute loads into a commer‐
cial cloud and retaining constant loads in-house.

Clouds and Geo-distribution

|

11


Figure 1-3. This diagram shows how a variety of compute models can
be used together if geo-distributed data movement is available. An onpremise data center can auto-magically replicate data to a core cloudbased cluster. That core cluster can burst to larger sizes when necessary

in response to changing compute needs. Similar techniques can be used
to extend data structures to or from a private cloud.
During peak loads, additional cloud resources can be recruited to
handle the higher loads (a temporary cloud burst), then these
resources can be released as the peak drops. A core cloud cluster
typically remains after the burst to provide a durable home for any
data required for the next burst.
The residue of work that remains after off-loading variable loads
into a commercial cloud is a very steady compute load that makes
ideal use of the cost advantages of the on-premise or private cloud
computing. The overall cost savings can easily be 2:1 or more rela‐
tive to a pure on-premise or a pure commercial cloud strategy.
Making this hybrid cloud architecture work, however, requires the
ability to replicate data between private cloud, on-premise, core
cloud, and burst compute systems. Without good platform support,
this can be a show stopper for these architectures. Strict cloud neu‐
trality of at least some applications is also very helpful.

12

|

Why Geo-Distribution Matters


Hybrid Cloud Architectures
Unfortunately, while cloud technology makes it easy to have the
equivalent of many data centers, it doesn’t help much with the prob‐
lems of getting data to the right place or with controlling access.
Some cloud vendors provide a wide array of services that deal with

some of these problems, and there are many databases that solve at
least part of the problem. The core problem, however, is the com‐
plexity of managing data transfers. Since hybrid clouds necessarily
cross cloud provider boundaries, locking applications to a single
cloud vendor by making heavy use of that vendor’s services is not a
viable option in hybrid cloud architectures.

How geo-distributed data helps with hybrid clouds
Hybrid clouds promise substantial cost and flexibility advantages
over either private clouds or conventional on-premise data centers;
however, the daunting complexity of making sure that the right data
is available to the right service in the right place at the right time can
make it infeasible to make use of hybrid cloud architectures—that is,
it can make it infeasible if you don’t have a good geo-distributed
data platform.
Modern geo-distributed data systems allow the complexity of these
data motions to be managed more easily by handling a large number
of issues at the platform level that would otherwise require special
purpose application code. Getting this right is crucial because it
allows application developers to focus on building applications
rather than dealing with an increasingly complex ad hoc solution to
data motion.

Global Data Governance
Effective data governance isn’t a product or even a single product
feature. It is a complex process and mindset that requires a range of
controls and monitoring. When you add geo-distribution of data
into the mix, you need to be ready with the right tools. Without
them, a difficult problem can become massively harder. This section
describes how geo-distribution makes data governance more impor‐

tant than ever before and how you can deal with the problems
raised.

Global Data Governance

|

13


Let’s start with the basics. To have good data governance, you need
at least the following capabilities:
• A viable plan for disaster recovery
• An auditable history of what data you have, where it came from,
and when
• An auditable history of how all derivative data was created, how,
by whom, and when
• Expressive and uniform access controls across all your data
• Tools that allow selective and application-specific encryption
and masking of individual data elements
• Tools that help you find data that should have been masked or
encrypted but isn’t
• Platform-level controls that let you comply with data sover‐
eignty laws that require that data stays where it started and anticorruption regulations that require enough data to leave the
country to be able to detect rogue operations
• Global metrics from the entire system—you can’t manage what
you don’t measure.
In a geo-distributed data world, this all becomes harder, partly due
to increased complexity, but also partly due simply to scale. In fact,
even analyzing just the audit logs from a global-scale data system

requires a small cluster. Some clusters have a trillion files—just get‐
ting a list of all of them may take a significant amount of time.
Simple solutions that depend on isolated operation and non-scalable
implementations are no longer sufficient. Starkly put, governance
has to be as global as your data is.
If you go wrong at the architectural and platform level, proper data
governance may become completely impossible. For instance, if
every application development group requires a separate production
cluster due to lack of good multi-tenancy, and if every kind of data
persistence (database, file, or stream) requires a separate cluster
such that you multiply those clusters by three or more, then a mod‐
erate to large enterprise can easily wind up with hundreds or thou‐
sands of independently managed clusters all configured differently,
with different security models and different governance require‐
ments. That way truly madness lies (or so King Lear might tell us).

14

|

Why Geo-Distribution Matters


If you plan well, however, many of these requirements can be satis‐
fied directly at the platform level. For instance, robust platform-level
disaster recovery planning (such as via automated incremental mir‐
roring), full audit logs, uniform access control, control over data
movement, and global metrics can all be provided by the data plat‐
form itself, and with the right platform, this can all apply uniformly
to files, streams, and tables no matter where they are located. In fact,

with the right platform, hundreds of applications can co-reside on
the same cluster. Moreover, if files, tables, and streams all share the
same security model on the same platform, you may only need to
manage and govern a few clusters. Reducing the number of clusters
by having files, tables, and streams all under one multi-tenant roof
can make a huge difference.

Global Namespace
The process of effective data governance is made much easier and
less error prone if your geo-distributed system can provide a global
namespace. In that case, you would use the same pathname for an
object regardless of whether you were getting access from your local
cluster or a remote on-premise or cloud cluster—think how conve‐
nient that would be.
The global namespace advantage is taken even further in the case of
the MapR Converged Data Platform. In that case, files, tables, and
streams can reside in the same directory. An application that needs
to use a message stream and a NoSQL database, for instance, can
access both via a shared path to the directory. This global namespace
gives you a unique and universal reference for every data object in
your entire system, regardless of where it is located or from where
you (or your application) are accessing the data.
Having a global namespace also lets you put related data objects
together in a directory. Control over the directory permissions lets
you limit access to the files both via the permissions on the files
themselves as well as the directory.

Data Sovereignty and Geo-Distribution
Data sovereignty refers to national regulations that require that per‐
sonally identifiable or other private data that originates within a par‐

ticular country is required to remain in that country. As such, it may
be surprising to think of data sovereignty as an aspect of data gover‐
Global Data Governance

|

15


nance that is closely related to geo-distributed data, but data sover‐
eignty forces companies to have data presence in multiple countries.
Other non-restricted data, such as reports or anonymized data, may
be shared data between centers in a global company.
As with the rest of distributed governance, having a clean and sim‐
ple data platform with fine-grained control of data and job place‐
ment makes a big difference when it comes to complying with data
sovereignty restrictions.

Containers for Big Data
When you start talking about running applications in different loca‐
tions rather than in one carefully curated environment, a number of
problems crop up. The two most obvious problems are the correct
distribution of exactly the intended version of the application (down
to each bit) and providing all of the difficult-to-see environmental
dependencies of that application. Not surprisingly, specialized sys‐
tems have been developed to address these issues. Among the com‐
peting systems that have been developed, Docker seems to be the
emerging consensus.
Container technologies such as Docker address the issues of run‐
ning applications in controlled environments even when different

applications require different environments. The exact bits that
make up the applications and their dependencies can be packaged,
distributed, and run in a reliable and repeatable fashion. Containers
don’t need to run a completely independent copy of the operating
system and related components such as the network stack. This is in
strong contrast with virtualization, where each virtual machine runs
an independent copy of the operating system and everything else.

Key Problem: Stateful Containers
This all works very well with stateless containers—that is, containers
that don’t need to maintain large amounts of data themselves. With
stateful containers that need to retain a lot of data, however, it is a
whole different ballgame. The key problem to solve is that data
bandwidth has to scale with the number of containers, without huge
costs and without impairing launch time. Most traditional solutions
miss the mark in terms at least one of the requirements for scaling,
cost, or launch speed.

16

|

Why Geo-Distribution Matters


For instance, one obvious potential solution is to embed the data in
the running containers themselves. This is a problem since the data
has to outlive any given container and thus can’t be embedded in the
container image itself. Discarding the data on exit doesn’t work
either because the data in a stateful container typically needs to out‐

live the container. By symmetry, data that outlives one container
should exist until the replacement container is fully running. That,
in turn, means that this new container will need to be filled with that
data on launch, which makes scalability difficult. Furthermore, even
if you do load the data into the new container, data local to a con‐
tainer can be hard to share. In short, this doesn’t work well.
A second option is to run applications in containers that include
some distributed data store as well. This is a substantial problem
because the application lifecycle is fundamentally different from the
data lifecycle. It takes time, potentially a long time, after the con‐
tainer is launched before it can contribute to data storage tasks,
whereas we usually want the application to work immediately, with
all participants in the distributed data system working effectively at
all times. Restarting many containers with new application code
shouldn’t impair access to long-lived data, but collocating infrastruc‐
tural data storage code with stateful applications will result in
exactly that impairment.
Likewise, block-level storage (such as a SAN system) provided out‐
side the containers isn’t the answer because individual containers
have to form local file systems on top of the block storage. This takes
time and is inflexible and difficult to share between containers. Even
worse, data bandwidth doesn’t scale with the number of containers
because all of the data is actually hosted on the SAN rather than in
the container. Data locality is also a huge problem because the data
isn’t local to any containers. This leads to the previous problem of
data lifecycle in which data must be loaded and unloaded.
Other conventional data stores like a network attached storage
(NAS) system or a conventional database are complete non-starters
as large-scale stateful container storage because such systems just
can’t scale to meet the demands of large numbers of stateful contain‐

ers.

Containers for Big Data

|

17


How to Make Stateful Containers Work for Big Data
The modern response to the problem of supporting stateful contain‐
ers is to use a scalable and distributed data platform to which all
containers have access. It’s often desirable to collocate the data plat‐
form with the same machines that run the containers. Due to the
mass of data, the scalable storage layer is often deployed alongside
the base operating system that supports containers instead of inside
containers, but deploying the data platform inside special-purpose
containers is also possible. This results in a system something like
what is shown in Figure 1-4.

Figure 1-4. Containers running stateful and stateless applications can
be deployed by a container management system like Kubernetes, but
can also access a scalable data platform. Applications running in con‐
tainers can communicate with each other directly or via the data plat‐
form. The application state can reside logically in the data platform,
but can often be physically collocated with the applications.
Having special-purpose datastores like Apache Cassandra that are
individually scalable provides some of the benefits of this design, but
that design falls short in many respects. One of the key problems is
that a database like Cassandra only provides one level of namespace

(table names) and only provides one kind of persistence, namely
tables. Similarly, databases can’t really provide file access, nor are
they really suitable for streaming data.
Similar problems occur with special-purpose file stores like HDFS.
Problems also crop up with HDFS specifically in that the shortcircuit read used to increase performance with HDFS is disabled for
processes in containers. This can severely limit the read perfor‐
18

|

Why Geo-Distribution Matters


mance of container-based applications, even for data that is located
on the same machine as the container doing the read.
Crucial to the idea of having a data platform to support stateful
applications is that the data platform should directly allow applica‐
tions access to high-level persistence objects such as files, tables, and
message streams and organize these objects using some kind of
namespace.

Example: Handling State and Containers on MapR
The MapR Converged Data Platform is an example of a system that
can support stateful applications in containers as described here. A
conventional MapR cluster is created, and then Docker is installed
either on the nodes of the MapR cluster or on other nodes if net‐
work I/O is sufficiently high. Docker containers based on an image
that contains the MapR DB, Streams, and Posix file access are then
used to hold the stateful services. These services can access files,
tables, or streams uniformly with the exact same file names from

any container without even knowing where the container is located
on the network. File location information is available to a coordina‐
tion system such as Kubernetes or Apache Mesos to allow special
containers to be placed close to any desired data resources.
Typically, containers will not be launched directly, but instead will
be launched by Kubernetes or Mesos. Whatever framework is used
injects any necessary user credentials and cluster address informa‐
tion into the container. Because the containers themselves don’t
require any state, they can launch quickly and be productive within
seconds.
In addition, the MapR file system can be used to store and deploy
container images. Again, the universal namespace helps by allowing
any legal user ID to access the images by the same path names
regardless of the location of the machine storing the Docker images.

Summary
Containers make a perfect match for a geo-distributed data, particu‐
larly because they help build stable deployment platforms for serv‐
ices. Conversely, some geo-distributed data systems such as MapR
work well for running large swarms of containers.

Containers for Big Data

|

19


×