Tải bản đầy đủ (.pdf) (187 trang)

Hadoop in the enterprise architecture a guide to successful integration early release

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (13.25 MB, 187 trang )

1. 1. Clusters
1. Building Solutions
2. Single vs. Many Clusters
3. Multitenancy
4. Backup & Disaster Recovery
5. Cloud Services
6. Provisioning
7. Summary
2. 2. Compute & Storage
1. Computer architecture for Hadoopers
1. Commodity servers
2. Non-Uniform Memory Access
3. Server CPUs & RAM
4. The Linux Storage Stack
2. Server Form Factors
1. 1U
2. 2U
3. 4U
4. Form Factor Price Comparison
3. Workload Profiles
1. Other Form Factors
4. Cluster Configurations and Node Types
1. Master Nodes
2. Worker Nodes
3. Utility Nodes
4. Edge Nodes
5. Small Cluster Configurations
6. Medium Cluster Configurations
7. Large Cluster Configurations
3. 3. High Availability
1. Planning for Failure


2. What do we mean by High Availability?
1. Lateral or Service HA
2. Vertical or Systemic HA
3. Automatic or Manual Failover
3. How available does it need to be?


1. Service Level Objectives
2. Percentages
3. Percentiles
4. Operating for High Availability
1. Monitoring
2. Playbooks
5. High Availability Building Blocks
1. Quorums
2. Load Balancing
3. Database HA
4. Ancillary Services
6. High Availability of Hadoop Services
1. General considerations
2. ZooKeeper
3. HDFS
4. YARN
5. HBase High Availability
6. KMS
7. Hive
8. Impala
9. Solr
10. Oozie
11. Flume

12. Hue
13. Laying out the Services



Hadoop in the Enterprise:
Architecture
A Guide to Successful Integration
Jan Kunigk, Lars George, Paul Wilkinson, Ian Buss


Hadoop in the Enterprise:
Architecture
by Jan Kunigk , Lars George , Paul Wilkinson , and Ian Buss
Copyright © 2017 Jan Kunigk, Lars George, Ian Buss, and Paul Wilkinson. All
rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles (
). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
.
Editor: Nicole Tache
Production Editor: FILL IN PRODUCTION EDITOR
Copyeditor: FILL IN COPYEDITOR
Proofreader: FILL IN PROOFREADER
Indexer: FILL IN INDEXER
Interior Designer: David Futato

Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
September 2017: First Edition


Revision History for the First
Edition
2017-03-22: First Early Release
See for release
details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hadoop
in the Enterprise: Architecture, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the author(s) have used good faith efforts to ensure that
the information and instructions contained in this work are accurate, the
publisher and the author(s) disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-96927-4
[FILL IN]


Chapter 1. Clusters
Big Data and Apache Hadoop are by no means trivial in practice, as there are
many moving parts and each requires its own set of considerations. In fact,
each component in Hadoop, for example HDFS, is supplying distributed

processes that have their own peculiarities and a long list of configuration
parameters that all may have an impact on your cluster and use-case. Or maybe
not. You need to whittle down everything in painstaking trial and error
experiments, or consult what you can find in regards to documentation. In
addition, new releases of Hadoop—but also your own data pipelines built on
top of that—requires careful retesting and verification that everything holds
true and works as expected. We will discuss practical solutions to this and
many other issues throughout this book, invoking what the authors have learned
(and are still learning) about implementing Hadoop clusters and Big Data
solutions at enterprises, both large and small.
One thing though is obvious, Hadoop is a global player, and the leading
software stack when it comes to Big Data storage and processing. No matter
where you are in the world, you all may struggle with the same basic questions
around Hadoop, its setup and subsequent operations. By the time you are
finished reading this book, you should be much more confident in conceiving a
Hadoop based solution that may be applied to various and exciting new usecases.
In this chapter, we kick things off with a discussion about cluster environments,
which is a topic often overlooked as it is assumed that the successful proof-ofconcept cluster delivering the promised answers is also the production
environment running the new solution at scale, automated, reliable, and
maintainable—which is often far from the truth.


Building Solutions
Developing for Hadoop is quite unlike common software development, as you
are mostly concerned with building not a single, monolithic application but
rather a concerted pipeline of distinctive pieces, which in the end are to
deliver the final result. Often this is insight into the data that was collected,
and on which is built further products, such as recommendation or other realtime decision making engines. Hadoop itself is lacking graphical data
representation tools, though there are some ways to visualize information
during discovery and data analysis, for example, using Apache Zeppelin or

similar with charting support built-in.
In other words, the main task in building Hadoop-based solutions is to apply
Big Data Engineering principles, that comprise the following selection (and,
optionally, creation) of suitable
hard- and software components,
data sources and preparation steps,
processing algorithms,
access and provisioning of resulting data, and
automation of processes for production.
As outlined in Figure 1-1, the Big Data engineer is building a data pipeline,
which might include more traditional software development, for example, to
write an Apache Spark job that uses the supplied MLlib applying a linear
regression algorithm to the incoming data. But there is much more that needs to
be done to establish a whole chain of events that leads to the final result, or the
wanted insight.


Figure 1-1. Big Data Engineering

A data pipeline comprises, in very generic terms,
the task of ingesting the incoming data, and staging it for processing,
processing the data itself in an automated fashion, triggered by time or
data events, and
delivering the final results (as in, new or enriched datasets) to the
consuming systems.
These tasks are embedded into an environment, one that defines the boundaries
and constraints in which to develop the pipeline (see Figure 1-2). In practice
the structure of this environment is often driven by the choice of Hadoop
distribution, placing an emphasis on the included Apache projects that form the
platform. In recent times, distribution vendors are more often going their own

way and selecting components that are similar to others, but are not
interchangeable (for example choosing Apache Ranger vs. Apache Sentry for
authorization within the cluster). This does result in vendor dependency, no
matter if all the tools are open-source or not.


Figure 1-2. Solutions are part of an environment

The result is, that an environment is usually a cluster with a specific Hadoop
distribution (see [Link to Come]), running one or more data pipelines on top of
it, which are representing the solution architecture. Each solution is embedded
into further rules and guidelines, for example the broader topic if governance,
which includes backup (see [Link to Come]), metadata and data management,
lineage, security, auditing, and other related tasks. During development though,
or during rapid prototyping, say for a proof-of-concept project, it is common
that only parts of the pipeline are built. For example, it may suffice to stage the
source data in HDFS, but not devise a fully automated ingest setup. Or the final
provisioning of the results is covered by integration testing assertions, but not
connected to the actual consuming systems.
No matter what the focus of the development is, in the end a fully planned data
pipeline is a must to be able to deploy the solution in the production
environment. It is common for all of the other environments before that to
reflect the same approach, making the deployment process more predictable.
Figure 1-3 summarizes the full Big Data Engineering flow, where a mixture of
engineers work on each major stage of the solution, including the automated


ingest and processing, as well as final delivery of the results. The solution is
then bundled into a package that also contains metadata, determining how
governance should be applied to the included data and processes.



Figure 1-3. Developing data pipelines

Ideally, the deployment and handling is backed by common development
techniques, such as continuous integration, automating the testing of changes
after they are committed by developers, and for new release after they have
been sanctioned by the Big Data engineers. The remaining question is, do you
need more than one environment, or, in other words, cluster?


Single vs. Many Clusters
When adding Hadoop to an existing IT landscape, a very common question is,
how many clusters are needed? Especially in the established and common
software development process1 we see sandboxed environments that allow for
separate teams to do their work without interrupting each other. We are now
confronted with two competing issues:
Roll out of new and updated applications and data pipelines, and
roll out of new platform software releases.
The former is about making sure that new business logic performs as expected
while it is developed, tested, and eventually deployed. Then there is the latter,
which is needed when the platform itself changes, for example with a new
Hadoop release. Updating the application code is the easier one obviously, as
it relies on (or rather, often implies for) all environments to run the same
platform version. Rolling out a new platform version requires careful planning
and might interrupt or delay the application development, since it may require
the code to be compiled against a newer version of the dependent libraries. So
what do we see in practice? Pretty much everything!
Indeed, we have seen users with a single cluster for everything, all the way to
having a separate cluster for 3 or 4 of the development process stages,

including development, testing, quality assurance (QA), user-acceptance (UA),
staging/pre-production, and production. The driving factors are mostly cost
versus convenience and features: it requires many more servers to build all of
the mentioned environments, and that might be a prohibitive factor. The
following list typical combinations:
Single Cluster for Everything
Everything on one cluster, no room for errors, and possibly downtime
when platform upgrades are needed.


This is in practice usually not an option. What is obvious though is that
there is often an assumption having a proof-of-concept (PoC) cluster that
worked well is the same as having a production cluster—not to mention
all the other possible environments. A single PoC cluster built from scrap
servers, or on an insufficient number of nodes (two or three servers do
not make a Hadoop cluster, they start at five of more machines) is not
going to suffice. Proper planning and implementation has to go into setting
up Hadoop clusters, where networking is usually the greatest cost factor
and often overlooked.
Two Clusters (Dev/Test/QA/Pre-Production, and Production)
Separate everything else from production, allows testing of new releases
and platform versions before roll out. Difficult to roll back, if at all
possible.
Having two proper planned clusters is the minimal setup to run successful
Big Data solutions, with reduced business impact compared to having
only a single cluster. But you are overloading a single environment and
will have significant congestion between, for example, development and
testing teams.
Three Clusters (Dev, Test/QA/PreProd, and Prod)
Basic setup with most major roles separated, allows to be flexible

between development, staging, and production.
With three clusters the interdependencies are greatly reduced, but not fully
resolved, as there are still situations where the shared resources have to
be scheduled exclusively for one or another team.
Four Clusters (Dev, Test, PreProd/QA, and Prod)
Provides the greatest flexibility, as every team is given their own
resources.
If the goal is to allow the engineering groups to do their work in a timely
manner, you will have to have four cluster environments set up and available at


all times. Everything else, while possible, is including minor to major
compromises and/or restrictions. Figure 1-4 shows all of the various
environments that might be needed for Big Data engineering.


Figure 1-4. Environments needed for Big Data Engineering

A common option to reduce cost is to specify the clusters according to their
task, for example, as shown here:
Environment

Configuration

Could be a local development machine, or a small instance with
Development 4-5 virtual machine (see “Cloud Services”). Only functional
tests are performed.
Test/QA/Pre- Same capacity as production cluster, but with, for example,
Production fewer nodes, or fewer cores/memory.
Production


Full size cluster as required by the use-cases.


One further consideration for determining how many clusters your project
needs is how and what data is provided to each environment. This is mainly a
question of getting production data to the other clusters so that they can perform
their duty. This most likely entails data governance and security decisions, as
PII (personally identifiable information) data might need to be secured and/or
redacted (for example, cross out digits in Social Security Numbers). In regards
to controlling costs, it is also quite often the case that non-production clusters
only receive a fraction of the complete data. This reduces storage and, with it,
processing needs, but also means that only the production environment is
exposed to the full workload, making earlier load tests in the smaller
environments more questionable or at least difficult.
Note

It is known from Facebook, which uses Hadoop to a great extent, that the live
traffic can be routed to a test cluster and even amplified to simulate any
anticipated seasonal or growth related increase. This implies that the test
environment is at least as powerful as the existing production environment. Of
course, this could also be used to perform a validation (see [Link to Come]) of
a new, and possibly improved, production platform.
The latest trend is to fold together some of those environments and make use of
the multitenancy feature of Hadoop. For example, you could use the “two
cluster” setup above, but shift the pre-production role onto the production
cluster. This helps to utilize the cluster better if there is enough spare capacity
in terms of all major resources, that is disk space, I/O, memory, and CPU. On
the other hand, you are now forced to handle pre-production very carefully so
as not to impact the production workloads.

Finally, a common question is how to extrapolate cluster performance based on
smaller non-production environments. While it is true that Hadoop mostly
scales linearly for its common workloads, there is also some initial cost to get
true parallelization going. This manifests itself in that very small “clusters”
(we have seen three node clusters installed with the entire Hadoop stack) are
often much more fickle than expected. You may see issues that do not show at
all when you have, say, 10 nodes or more. As for extrapolation of
performance, testing a smaller cluster with a subset of the data will give you


some valuable insight. You should be able to determine from there what to
expect of the production cluster. But since Hadoop is a complex, distributed
system, with many moving parts, scarce resources such as CPU, memory,
network, disk space and general I/O, as well as possibly being shared across
many tenants, you have to once again be very careful evaluating your
predictions. If you had equally sized test/QA/pre-production and production
clusters, mimicking the same workloads closely, only then you have more
certainty.
Overall these possibilities have to be carefully evaluated as “going back and
forth” is often not possible after the cluster reaches a certain size, or is tied
into a production pipeline that should not be disturbed. Plan early and with
plenty of due diligence. Plan also for the future, as in ask yourself how you
grow the solution as the company starts to adopt Big Data use-cases.


Multitenancy
Having mentioned sharing a single cluster in an attempt to reduce the number of
environments needed, by means of the Hadoop built-in multitenancy features,
we have to also discuss its caveats. The fact is that Hadoop is a fairly young
software stack, just turning 10 years old in 2016. It is also a fact, that the

majority of users have Hadoop loaded with very few use-cases, and if they
have, those use-cases are of very similar (of not the same) nature. For
example, it is no problem today to run a Hadoop cluster red-hot with
MapReduce and Spark jobs, using YARN as the only cluster resource manager.
This is a very common setup and used in many large enterprises throughout the
world. In addition, one can enable control groups (cgroups) to further isolate
CPU and I/O resources from each YARN application to another. So what is the
problem?
With the growth and adoption of Hadoop in the enterprise, the list of features
that were asked for led to a state where Hadoop is stretching itself to cover
other workloads as well, for example MPP-style query, or search engines.
These compete with the resources controlled by YARN and it may happen that
these collide in the process. Shoehorning in long-running processes, commonly
known as services, into a mostly batch oriented architecture is difficult to say
the least. Looking at efforts such as Llama2 or the more recent LLAP3 show
how non-managed resources are carved out of the larger resource pool to be
ready for low-latency, ad-hoc requests, which is something different to
scheduled job requirements.
At to that the fact that HDFS has no accounting features built in, which makes
colocated service handling nearly impossible. For example, HBase is using the
same HDFS resources as Hive, MapReduce, or Spark. Building a charge-back
model on top of missing account is futile, leaving you with no choice but to
eventually separate low-latency use-cases from batch or other, higher-latency
interactive ones. The multitenancy features in Hadoop are mainly focused
around authorization of requests, but not on dynamic resource tracking. When
you run a MapReduce job as user foobar that reads from HBase, which in turn
reads from HDFS, it is impossible to limit the I/O for the specific user, as


HDFS only sees hbase causing the traffic.

Some distributions allow the static separation of resources into cgroups at the
process level. For example, you could allocate 40% of I/O and CPU to YARN,
and the rest to HBase. If you only read from HBase using YARN applications,
this separation is useless for the above reasons. If you then further mix this
with Hadoop applications that natively read HDFS but may or may not use
impersonation—a feature that makes the actual user for whom the job is
executed visible to the lower-level systems, such as HDFS—the outcome of
trying to mix workloads is rather unpredictable. While Hadoop is improving
over time, this particular deficiency has not seen much support by the larger
Hadoop contributor groups.
You are left with the need to possible partition your cluster to physically
separate specific workloads (see Figure 1-5). This can be done if enough
resources in terms of server hardware is available. If not, you will have to
spend extra budget to provision such a setup. You now also have another
environment to take care of and make part of the larger maintenance process. In
other words, you may be forced to replicate the same split setup in the earlier
environments, such as in pre-production, testing, or even development.


Figure 1-5. Workloads may force to set up separate production clusters


Backup & Disaster Recovery
Once you get started with Hadoop, there comes the point where you ask
yourself: If I want to keep my data safe, what works when you are dealing with
multiple petabytes of data? This is as varied as the question of how many
environments you need, in regards to engineering Big Data solutions. And yet
again we have seen all kinds from “no backup at all” to “cluster to cluster”
replication. For starters, volume is an issue at some point, but so is one of the
other “V”s of Big Data: velocity. If you batch load large chunks of data, you

can handle backup differently from when you receive updates in microbatches, for example using Flume or Kafka landing events separately. Do you
have all the data already and then decide to back it up? Or are you about to get
started with loading data and can devise a backup strategy upfront?
The most common combinations we see are these:
Strategy

Description

Single
Cluster

Yes, this exists. Daredevils you are!

Active to
Backup
Cluster

A less powerful cluster that stores the data in the same format as
the production cluster.

Active to
Standby
Cluster

Same sized clusters, the standby can take over as active
whenever needed. This covers backup and disaster recovery
(DR).

Active to
Offline

Storage

Very rare. Often only the vital data is stored on a filer or cloud
based storage offering.


Keep in mind that the backup strategy might be orthogonal to the development
environment discussed above, i.e. you may have a dev, test/QA/preprod, and
prod cluster - and another one just for the backup. Or you could save money (at
the cost of features and flexibility) and reuse for example the pre-production
cluster as the standby cluster for backups.
How is data actually copied or backed up? When you have a backup cluster
with the same platform software, then you may be able to use the provided
tools such as distcp combined with Apache Oozie for automation, or use the
proprietary tools that some vendors ship in addition to the platform itself, for
example Cloudera’s BDR. It allows you to schedule regular backups between
clusters. A crucial part of the chosen strategy is to do incremental backups
once the core data is synchronized.
If you stream data into the cluster you could also consider teeing off the data
and land it in both clusters. Maybe a combination with Kafka to buffer data for
less stable connections between the two locations. This setup also allows to
batch together updates and efficiently move them across at the speed of the
shared interconnection. But considering a true backup & disaster recovery
strategy, you will need at least one more environment to hold the same amount
of data, bringing the total now to more than five or six (including the above
best case environment count and also accounting for low-latency use-cases, as
shown in Figure 1-6).


Figure 1-6. Environments needed including backup & disaster recovery



Cloud Services
Another option is that the non-production clusters are in a hosted environment,
that is, a cloud instance (be it internal or external, see [Link to Come]). That
allows quick set up of these environments as needed or recreating them for
new platform releases. Many Hadoop vendors have some tool on offer that can
help make this really easy. Of course this does require careful planning on
where the data resides, since moving large quantities of data in and out of an
external cloud might be costly. That is where a private cloud is a good choice.
Overall, using virtualized environments helps with two aspects concerning
Hadoop clusters:
Utilization of hardware resources, and
provisioning of clusters.
The former is about what we discussed so far, that is, reducing the number of
physical servers needed. With virtual machines you could run the development
and testing environments (and QA, etc.) on the same nodes. This may save a lot
of Capex type cost (capital expenditure) upfront and turn running these clusters
into an Opex type cost (operational expenditure). Of course, the drawbacks are
as expected, shared hardware may not be as powerful as dedicated hardware,
making certain kinds of tests (for example, extreme performance tests)
impractical. Figure 1-7 shows the environments that could be virtualized.


×