Tải bản đầy đủ (.pdf) (561 trang)

Moving hadoop to the cloud harnessing cloud features and flexibility for hadoop clusters

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.33 MB, 561 trang )



Moving Hadoop to the Cloud
Harnessing Cloud Features and Flexibility for Hadoop Clusters

Bill Havanki


Moving Hadoop to the Cloud
by Bill Havanki
Copyright © 2017 Bill Havanki Jr. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
( For more information, contact our
corporate/institutional sales department: 800-998-9938 or

Editor: Marie Beaugureau

Production Editor: Colleen Cole

Copyeditor: Kim Cofer

Proofreader: Christina Edwards

Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato
Cover Designer: Karen Montgomery

July 2017: First Edition



Illustrator: Rebecca Demarest


Revision History for the First Edition
2017-07-05: First Release
See for release
details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Moving
Hadoop to the Cloud, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that
the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-95963-3
[LSI]


Foreword
Apache Hadoop as software is a simple framework that allows for distributed
processing of data across many machines. As a technology, Hadoop and the
surrounding ecosystem have changed the way we think about data processing
at scale. No longer does our data need to fit in the memory of a single
machine, nor are we limited by the I/O of a single machine’s disks. These are

powerful tenets.
So too has cloud computing changed our way of thinking. While the notion
of colocating machines in a faraway data center isn’t new, allowing users to
provision machines on-demand is, and it’s changed everything. No longer are
developers or architects limited by the processing power installed in onpremise data centers, nor do we need to host small web farms under our desks
or in that old storage closet. The pay-as-you-go model has been a boon for ad
hoc testing and proof-of-concept efforts, eliminating time spent in
purchasing, installation, and setup.
Both Hadoop and cloud computing represent major paradigm shifts, not just
in enterprise computing, but affecting many other industries. Much has been
written about how these technologies have been used to make advances in
retail, public sector, manufacturing, energy, and healthcare, just to name a
few. Entire businesses have sprung up as a result, dedicated to the care,
feeding, integration, and optimization of these new systems.
It was inevitable that Hadoop workloads would be run on cloud computing
providers’ infrastructure. The cloud offers incredible flexibility to users, often
complementing on-premise solutions, enabling them to use Hadoop in ways
simply not possible previously.
Ever the conscientious software engineer, author Bill Havanki has a strong
penchant for documenting. He’s able to break down complex concepts and
explain them in simple terms, without making you feel foolish. Bill writes the
kind of documentation that you actually enjoy, the kind you find yourself
reading long after you’ve discovered the solution to your original problem.
Hadoop and cloud computing are powerful and valuable tools, but aren’t
simple technologies by any means. This stuff is hard. Both have a multitude


of configuration options and it’s very easy to become overwhelmed. All
major cloud providers offer similar services like virtual machines, network
attached storage, relational databases, and object storage — all of which can

be utilized by Hadoop — but each provider uses different naming
conventions and has different capabilities and limitations. For example, some
providers require that resource provisioning occurs in a specific order. Some
providers create isolated virtual networks for your machines automatically
while others require manual creation and assignment. It can be confusing.
Whether you’re working with Hadoop for the first time or a veteran installing
on a cloud provider you’ve never used before, knowing about the specifics of
each environment will save you a lot of time and pain.
Cloud computing appeals to a dizzying array of users running a wide variety
of workloads. Most cloud providers’ official documentation isn’t specific to
any particular application (such as Hadoop). Using Hadoop on cloud
infrastructure introduces additional architectural issues that need to be
considered and addressed. It helps to have a guide to demystify the options
specific to Hadoop deployments and to ease you through the setup process on
a variety of cloud providers, step by step, providing tips and best practices
along the way. This book does precisely that, in a way that I wish had been
available when I started working in the cloud computing world.
Whether code or expository prose, Bill’s creations are approachable, sensible,
and easy to consume. With this book and its author, you’re in capable hands
for your first foray into moving Hadoop to the Cloud.
Alex Moundalexis,
May 2017


Preface
It’s late 2015, and I’m staring at a page of mine on my employer’s wiki,
trying to think of an OKR. An OKR is something like a performance
objective, a goal to accomplish paired with a way to measure if it’s been
accomplished. While my management chain defines OKRs for the company
as a whole and major organizations in it, individuals define their own. We

grade ourselves on them, but they do not determine how well we performed
because they are meant to be aspirational, not necessary. If you meet all your
OKRs, they weren’t ambitious enough.
My coworkers had already been impressed with writing that I’d done as part
of my job, both in product documentation and in internal presentations, so
focusing on a writing task made sense. How aspirational could I get? So I set
this down.
“Begin writing a technical book! On something! That is, begin working on
one myself, or assist someone else in writing one.”
Outright ridiculous, I thought, but why not? How’s that for aspirational.
Well, I have an excellent manager who is willing to entertain the ridiculous,
and so she encouraged me to float the idea to someone else in our company
who dealt with things like employees writing books, and he responded.
“Here’s an idea: there is no book out there about Running Hadoop in the
Cloud. Would you have enough material at this point?”
I work on a product that aims to make the use of Hadoop clusters in the cloud
easier, so it was admittedly an extremely good fit. It didn’t take long at all for
this ember of an idea to catch, and the end result is the book you are reading
right now.


Who This Book Is For
Between the twin subjects of Hadoop and the cloud, there is more than
enough to write about. Since there are already plenty of good Hadoop books
out there, this book doesn’t try to duplicate them, and so you should already
be familiar with running Hadoop. The details of configuring Hadoop clusters
are only covered as needed to get clusters up and running. You can apply
your prior Hadoop knowledge with great effectiveness to clusters in the
cloud, and much of what other Hadoop books cover still applies.
It is not assumed, however, that you are familiar with the cloud. Perhaps

you’ve dabbled in it, spun up an instance or two, read some documentation
from a provider. Perhaps you haven’t even tried it at all, or don’t know where
to begin. Readers with next to no knowledge of the cloud will find what they
need to get rolling with their Hadoop clusters. Often, someone is tasked by
their organization with “moving stuff to the cloud,” and neither the tasker nor
the tasked truly understands what that means. If this describes you, this book
is for you.
DevOps engineers, system administrators, and system architects will get the
most out of this book, since it focuses on constructing clusters in a cloud
provider and interfacing with the provider’s services. Software developers
should also benefit from it; even if they do not build clusters themselves, they
should understand how clusters work in the cloud so they know what to ask
for and how to design their jobs.


What You Should Already Know
Besides having a good grasp of Hadoop concepts, you should have a working
knowledge of the Java programming language and the Bash shell, or similar
languages. At least being able to read them should suffice, although the Bash
scripts do not shy away from advanced shell features. Code examples are
constrained to only those languages.
Before working on your clusters, you will need credentials for a cloud
provider. The first two parts of the book do not require a cloud account to
follow along, but the later hands-on parts do. Your organization may already
have an account with a provider, and if so, you can seek your own account
within that to work with. If you are on your own, you can sign up for a free
trial with any of the cloud providers this book covers in detail.


What This Book Leaves Out

As stated previously, this book does not delve into Hadoop details more than
necessary. A seasoned Hadoop administrator may notice that configurations
are not necessarily optimal, and that clusters are not tuned for maximum
efficiency. This information is left out for brevity, so as not to duplicate
content in books that focus only on Hadoop. Many of the principles for
Hadoop maintenance apply to cloud clusters just as well as ordinary ones.
The core Hadoop components of HDFS and YARN are covered here, along
with other important components such as ZooKeeper, Hive, and Spark. This
doesn’t imply at all that other components won’t work in the cloud; there are
simply so many components that, due to space considerations, not all could
be included.
A limited set of popular cloud providers is covered in this book: Amazon
Web Services, Google Cloud Platform, and Microsoft Azure. There are other
cloud providers, both publicly available and deployed privately, but they are
not included. The ones that were chosen are the most popular, and you should
find that their concepts transfer over rather directly to those in other
providers. Even so, each provider does things a little, or a lot, differently
from its peers. When getting you up and running, all of them are covered
equally, but beyond that, only Amazon Web Services is fully considered,
since it is the dominant choice at this time. Brief summaries of equivalent
procedures in the other providers are given to get you started with them.
Overall, between Hadoop and the cloud, there is just so much to write about.
What’s more, cloud providers introduce new services and revamp older
services all the time, and it can be challenging to keep up even when you
work in the cloud every day. This book attempts to stick with the most vital,
core Hadoop components and cloud services to be as relevant as possible in
this fast-changing world. Understanding them will serve you well when
integrating new features into your clusters in the future.



How This Book Works
Part I starts off this book by asking why you would host Hadoop clusters in a
cloud provider, and briefly introduces the providers this book looks at. Part II
describes the common concepts of cloud providers, like instances and virtual
networks. If you are already familiar with a cloud provider or two, you might
skim or skip these parts.
Part III begins the hands-on portion of this book, where you build out a
Hadoop cluster in one of the cloud providers. There is a chapter for the
unique steps needed by each provider, and a common chapter for bringing up
a cluster and seeing it in action. Later parts of the book use this first cluster as
a launching point for more.
If you are interested in making an even more capable cluster, Part IV can help
you. It covers adding high availability and installing Hive and Spark. You
can try any combination of the enhancements, and learn even more about the
ramifications of running in a cloud provider.
Finally, Part V looks at patterns and practices for running cloud clusters well,
from designing for price and security to dealing with maintenance. Those
first starting out in the cloud may not need the guidance in this part, but as
usage ramps up, it becomes much more important.


Which Software Versions This Book Uses
Here are the versions of Hadoop components used in this book. All are
distributed through Apache:
Apache Hadoop 2.7.2
Apache ZooKeeper 3.4.8
Apache Hive 2.1.0
Apache Spark 1.6.3 and 2.0.2
Code examples require:
Java 8

Bash 4
Cloud providers update their services continually, and so determining the
exact “versions” used for them is not possible. Most of the work in the book
was performed during 2016 with the services as they existed at that time.
Since then, service web interfaces may have changed and workflows may
have been altered.


Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file
extensions.
Constant width

Used for program listings, as well as within paragraphs to refer to
program elements such as variable or function names, databases, data
types, environment variables, statements, and keywords.
Constant width bold

Shows commands or other text that should be typed literally by the user.
Constant width italic

Shows text that should be replaced with user-supplied values or by
values determined by context.

TIP
This element signifies a tip or suggestion.

NOTE

This element signifies a general note.

WARNING
This element indicates a warning or caution.


IP Addresses
Many of the examples throughout this book include IP addresses, usually for
cluster nodes. The example IP addresses are drawn from reserved address
ranges as specified in RFC 5737. They should never resolve to an actual IP
address anywhere on the internet or within private networks. Change them as
needed when using the examples in your work.


Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for
download at />This book is here to help you get your job done. In general, if example code
is offered with this book, you may use it in your programs and
documentation. You do not need to contact us for permission unless you’re
reproducing a significant portion of the code. For example, writing a program
that uses several chunks of code from this book does not require permission.
Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting
example code does not require permission. Incorporating a significant
amount of example code from this book into your product’s documentation
does require permission.
We appreciate, but do not require, attribution. An attribution usually includes
the title, author, publisher, and ISBN. For example: “Moving Hadoop to the
Cloud by Bill Havanki (O’Reilly). Copyright 2017 Bill Havanki Jr., 978-1491-95963-3.”
If you feel your use of code examples falls outside fair use or the permission

given above, feel free to contact us at


O’Reilly Safari
NOTE
Safari (formerly Safari Books Online) is a membership-based training and
reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths,
interactive tutorials, and curated playlists from over 250 publishers, including
O’Reilly Media, Harvard Business Review, Prentice Hall Professional,
Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press,
Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan
Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning,
New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among
others.
For more information, please visit />

How to Contact Us
Please address comments and questions concerning this book to the
publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any
additional information. You can access this page at
/>To comment or ask technical questions about this book, send email to


For more information about our books, courses, conferences, and news, see
our website at .
Find us on Facebook: />Follow us on Twitter: />Watch us on YouTube: />

Acknowledgments
I’m well aware that barely anyone reads the acknowledgments in a book,
especially a technical one like this. So, for those few of you who are reading
this right now, well, first, I’d like to thank you for your diligence, not to
mention your attention and support in the first place. Truly, thanks for
spending time and/or money on what I’ve written here, and I hope it helps
you.
Thank you to everyone who’s helped to build up the amazing Apache
Hadoop ecosystem, from its founders to its committers to its contributors to
its users, for showing us a new way of computing. Thank you also to
everyone who’s built and maintained the amazing cloud provider services, for
showing us another new way of computing and empowering the rest of us to
use it.
This book would be worse off without its reviewers: Jesse Anderson, Jenny
Kim, Don Miner, Alex Moundalexis, and those who went unnamed or whom
I’ve forgotten. They each applied their expertise, experience, and attention to
detail to their feedback, filling in where I left important information out and
correcting what I got wrong. I also owe thanks to Misha Brukman and the
Google Cloud Platform team for looking over Chapter 7. My editors, Marie
Beaugureau and Colleen Toporek, did a wonderful job of shepherding the
writing process and giving feedback on organization, formatting, writing
flow, and lots of other details. Finally, extra thanks is due to Alex
Moundalexis for writing the foreword.
One of my favorite aphorisms is by Laozi: “A good traveler has no fixed
plans and is not intent on arriving.” I’ve arrived at the destination of
authoring a book, but no one observing my travel, including me, could have

guessed that I’d have gotten here. The road has wound through a career with
a few different employers and with a few more projects, and I was privileged
to walk alongside a truly wonderful collection of coworkers and friends along
the way. I owe them all my gratitude for their company, and their roles in my
journey.


I owe special thanks, of course, to my current employer, Cloudera, for the
opportunity to create this book and the endorsement of the effort. I
specifically want to thank Vinithra Varadharajan, my manager for the past
few years, for her unwavering faith in and promotion of my writing effort;
and also Justin Kestelyn, who got the ball rolling between me, my employer,
and O’Reilly. My teammates past and present on my current project have all
played a part in helping me learn about the cloud and have contributed their
thoughts and opinions, for which I’m grateful: John Adair, Asif Arman,
Cagdas Bayram, Jayita Bhojwani, Michael Cudahy, Xiaohua Guo, David
Han, Joe Heyming, Ying Li, Andrei Savu, Fahd Siddiqui, and Michael
Wilson.
Finally, I must thank my family, including my parents and in-laws for their
encouragement, my daughters Samantha and Lydia, and especially my wife
Kathy.1 They have been constantly supportive of me during the long effort
it’s taken to write this book, and excited for it to be one of my
accomplishments. I love them all very much.
1 Te amo et semper amabo.


Part I. Introduction to the Cloud
The purpose of the first part of this book is to orient you. First, the exact
meaning of “the cloud” when it comes to working with Hadoop clusters is
explored, so that it is clear what the benefits and drawbacks are. Then,

overviews of three major public cloud providers are provided, including a
little of their history as well as their approaches to doing business.


Chapter 1. Why Hadoop in the
Cloud?
Before embarking on a new technical effort, it’s important to understand what
problems you’re trying to solve with it. Hot new technologies come and go in
the span of a few years, and it should take more than popularity to make one
worth trying. The short span of computing history is littered with ideas and
technologies that were once considered the future of their domains, but just
didn’t work out.
Apache Hadoop is a technology that has survived its initial rush of popularity
by proving itself as an effective and powerful framework for tackling big data
applications. It broke from many of its predecessors in the “computing at
scale” space by being designed to run in a distributed fashion across large
amounts of commodity hardware instead of a few, expensive computers.
Many organizations have come to rely on Hadoop for dealing with the everincreasing quantities of data that they gather. Today, it is clear what problems
Hadoop can solve.
Cloud computing, on the other hand, is still a newcomer as of this writing.
The term itself, “cloud,” currently has a somewhat mystical connotation,
often meaning different things to different people. What is the cloud made
of? Where is it? What does it do? Most importantly, why would you use it?


What Is the Cloud?
A definition for what “the cloud” means for this book can be built up from a
few underlying concepts and ideas.
First, a cloud is made up of computing resources, which encompasses
everything from computers themselves (or instances in cloud terminology) to

networks to storage and everything in between and around them. All that you
would normally need to put together the equivalent of a server room, or even
a full-blown data center, is in place and ready to be claimed, configured, and
run.
The entity providing these computing resources is called a cloud provider.
The most famous ones are companies like Amazon, Microsoft, and Google,
and this book focuses on the clouds offered by these three. Their clouds can
be called public clouds because they are available to the general public; you
use computing resources that are shared, in secure ways, with many other
people. In contrast, private clouds are run internally by (usually large)
organizations.

NOTE
While private clouds can work much like public ones, they are not explicitly covered in
this book. You will find, though, that the basic concepts are mostly the same across cloud
providers, whether public or private.

The resources that are available to you in the cloud are not just for you to use,
but also to control. This means that you can start and stop instances when you
want, and connect the instances together and to the outside world how you
want. You can use just a small amount of resources or a huge amount, or
anywhere in between. Advanced features from the provider are at your
command for managing storage, performance, availability, and more. The
cloud provider gives you the building blocks, but it is up to you to know how
to arrange them for your needs.


Finally, you are free to use cloud provider resources for whatever you wish,
within some limitations. There are quotas applied to cloud provider accounts,
although these can be negotiated over time. There are also large, hard limits

based on the capacity of the provider itself that you can run into. Beyond
these somewhat “physical” limitations, there are legal and data security
requirements, which can come from your own organization as well as the
cloud provider. In general, as long as you are not abusing the cloud
provider’s offerings, you can do what you want. In this book, that means
installing and running Hadoop clusters.
Having covered some underlying concepts, here is a definition for “the
cloud” that this book builds from:
“The cloud” is a large set of computing resources made available by a cloud
provider for customers to use and control for general purposes.


What Does Hadoop in the Cloud Mean?
Now that the term “cloud” has been defined, it’s easy to understand what the
jargony phrase “Hadoop in the cloud” means: it is running Hadoop clusters
on resources offered by a cloud provider. This practice is normally compared
with running Hadoop clusters on your own hardware, called on-premises
clusters or “on-prem.”
If you are already familiar with running Hadoop clusters on-prem, you will
find that a lot of your knowledge and practices carry over to the cloud. After
all, a cloud instance is supposed to act almost exactly like an ordinary server
you connect to remotely, with root access, and some number of CPU cores,
and some amount of disk space, and so on. Once instances are networked
together properly and made accessible, you can imagine that they are running
in a regular data center, as opposed to a cloud provider’s own data center.
This illusion is intentional, so that working in a cloud provider feels familiar,
and your skills still apply.
That doesn’t mean there’s nothing new to learn, or that the abstraction is
complete. A cloud provider does not do everything for you; there are many
choices and a variety of provider features to understand and consider, so that

you can build not only a functioning system, but a functioning system of
Hadoop clusters. Cloud providers also include features that go beyond what
you can do on-prem, and Hadoop clusters can benefit from those as well.
Mature Hadoop clusters rarely run in isolation. Supporting resources around
them manage data flow in and out and host specialized tools, applications
backed by the clusters, and non-Hadoop servers, among other things. The
supporting cast can also run in the cloud, or else dedicated networking
features can help to bring them close.


×