IT training oreilly edge resiliency khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.42 MB, 40 trang )

Co
m
en
ts
of

Gary Sloper & Mark Wilkins

im

Managing Volatility Through DNS

pl

Edge
Resiliency

Edge Resiliency

Managing Volatility Through DNS

Gary Sloper and Mark Wilkins

Beijing

Boston Farnham Sebastopol

Tokyo

Edge Resiliency
by Gary Sloper and Mark Wilkins
Copyright © 2018 O’Reilly Media. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or

Editor: Virginia Wilson
Production Editor: Melanie Yarbrough
Copyeditor: Octal Publishing Services,
Inc.

September 2018:

Proofreader: Christina Edwards
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2018-08-30:

First Release

This work is part of a collaboration between O’Reilly and Oracle Dyn. See our state‐
ment of editorial independence.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Edge Resiliency,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the
publisher’s views. While the publisher and the authors have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of or
reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.

978-1-492-04036-1
[LSI]

Table of Contents

1. Edge Resiliency Is Critical to Your Business. . . . . . . . . . . . . . . . . . . . . . 1
What You Will Learn
Intended Book Audience

3
3

2. Exposing Buried Threats to Your Business Network. . . . . . . . . . . . . . . 5

Vulnerability When the Internet Is Your Network Backbone
Virtualization and Outsourcing of Services
Vulnerabilities Within Your Own Organization
Looming Security Threats
Unpredictable, Uncontrollable Problem Sources
Conclusion

5
6
8
9
11
12

3. Strategies to Meet the Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Strategy 1: Consider the End-to-End User Experience
Strategy 2: Embrace Processing at the Edge as Part of Your
Total Design
Strategy 3: Engage with Your Cloud Provider to Arrive at the
Optimal Topology
Strategy 4: Increase Redundancy and Reliability with
Multicloud and Hybrid Cloud Strategies
Strategy 5: Involve DevOps Staff in All Aspects of Edge
Services Planning and Implementation
Strategy 6: Inject Chaos to Find Weaknesses Before They
Affect Customers in Production
Strategy 7: Use Managed DNS Functionality to Limit
Endpoint Exposure and Network Volatility
Conclusion

13
14
14
15
16
16
18
19
iii

4. Managed DNS Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Benefits of DNS
DNS Routing
When to Consider a Managed DNS
Conclusion

21
23
25
29

5. Choosing a Managed DNS Provider. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Evaluation Period
Business-Critical Availability
A Focus on Security
Support
Easy-to-Use Tools
Conclusion

iv

| Table of Contents

31
31
32
32
32
33

CHAPTER 1

Edge Resiliency Is Critical
to Your Business

In today’s 24/7 global business environments, resiliency is not only
an assumption by your customers, it’s a requirement for your suc‐
cess. Simply defined, IT resilience is an organization’s ability to
maintain acceptable service levels, no matter what challenges arise.
From CTOs to networking IT staff, the threats and challenges to
services that live at the edge of the network—including both the user
edge and the site edge—pose the potential for unplanned and cer‐
tainly unwanted business disruptions.
To understand the implications and solutions of resiliency at the
edge, we first need to understand what the term “edge” really means
here. In reality, there are multiple edges. The user edge is where the
end user sits and first interacts with the internet. The network edge is
in front of the content or service that the user is trying to reach

(think transit, content delivery network [CDN], domain name sys‐
tem [DNS], and so forth). The site edge is typically at the datacenter
or cloud infrastructure where the content or service resides. Your
goal is to get control as close to the user edge as possible.
Sources that trigger instability for the myriad internet services that
today’s enterprises depend on range from simple misconfigurations,
to large-scale natural disasters, to nefarious targeted attacks, as well
as business-driven internet routing decisions to meet traffic and
sovereignty requirements. The user edge is your customers’ first
(and possibly only) interaction with the application or service that

1

they’re trying to access. It’s where impressions are made—or fail.
Traditionally, companies have focused on the user experience as
they interacted in expected, or unexpected, ways across the network.
However, just as important, each edge location can also be a portal
for instability and threats. These can come from unintentional side
effects such as attempts to meet high-traffic requirements, physical
infrastructure challenges (e.g., from a natural disaster), or deliberate
attacks from bad actors.
The simple reality is that if your company relies on cloud-hosted
applications, which more and more are these days, internet volatility
now has a greater impact on your business than at any given time in
the past. The large numbers of medium-to-large enterprises that
have been moving into hybrid and multicloud implementations only
magnifies the scope and likelihood of an impact.
For the past few years, medium- to large-sized enterprises have been
transitioning away from doing everything in-house to using hosting

providers to support a sophisticated global presence. This is a natu‐
ral evolution as organizations scale, so this book will touch on the
due diligence they need to perform; the problems they might
encounter; and what they can do to optimize their performance,
security posture, balance workloads, and steer traffic more effi‐
ciently in a hybrid cloud or multicloud environment.
The shift from hosting corporate applications on-premises to using
cloud-based service providers is an accepted practice for doing busi‐
ness today. And, like any key corporate resource, companies need to
safeguard and protect it. Network resiliency (especially at the user
edge) is your insurance policy against internet-based disruptions.
Additionally, more organizations have begun to deploy multicloud
environments using additional vendors or a private infrastructure to
support their businesses. This dynamic will continue to grow, taking
advantage of diversity and performance-based cloud services.
Granted, when you depend on internet services that are a “black
box,” some aspects will be out of your direct control. In those areas,
your business must rely somewhat on trust—trust in those who have
constructed today’s complex internet, trust in the partners you work
with, and trust that the infrastructure you’ve invested in will mostly
work reliably and appropriately. However, trust is not a strategy:
24/7 global businesses face new exposures each day. To combat these
challenges, businesses must take responsibility for resiliency. In this
2

|

Chapter 1: Edge Resiliency Is Critical to Your Business

way, they can gain direct control to insure against the risks. And it
all starts by understanding the approaches that you can take to
accomplish this goal.

What You Will Learn
In the remaining chapters, we discuss these approaches and offer
insight and strategies for creating resiliency at the edge. The goal is
to stabilize internet volatility, whatever the source. The critical topics
we cover include the following:
• Recognizing volatility sources
• Optimizing performance and balancing workloads amid inter‐
net volatility
• Steering traffic more efficiently
• Strengthening your security posture—not just in a traditional
datacenter, but also in a hybrid and/or multicloud environment
• Working with DNS infrastructure, managed DNS, and edge
services
We discuss common challenges and present clear examples to
demonstrate the benefits of using managed DNS infrastructure to
strengthen edge resiliency. And we offer assessment criteria for
when you are deciding whether to incorporate a managed DNS pro‐
vider into your resiliency strategy. This, will, in turn, provide
options and strengthen your ability to manage, challenge, and work
around any internet threats, disruption, or volatility.

Intended Book Audience
We wrote this book for IT managers to help them proactively enable
a resiliency strategy in the face of planned and unplanned events
from the user edge to the applications and services those users are
trying to reach. Our goal is to help you prevent challenges that could

have a negative impact on customer satisfaction and business out‐
comes. Business leaders must be aware and plan for these challenges
before they happen, because today, our customers, our employees,
and our reputations are all “living on the edge.”

What You Will Learn

|

3

CHAPTER 2

Exposing Buried Threats to Your
Business Network

In today’s always-on, fast-paced, and infinitely connected world,
customers take for granted that networks will just work. Terms such
as high-availability and 99.999% uptime are tossed out as absolutes
in sales conversations and customer engagements. Yet, the basis for
such assertions is uncertain at best and completely unrealistic
without a plan for dealing with services at the edge.
In this chapter, we survey the classes of challenges to the networks
and applications that your business and your users depend on. Iden‐
tifying these challenges can help you to see where you are exposed—
and where you need to focus resources so that your customers aren’t
exposed to the effects of their disruption.
We begin at the lowest level—the systems that underlie the data

channels you depend on.

Vulnerability When the Internet Is Your
Network Backbone
The internet is based on a set of strategically connected “backbone”
networks that are based on localized nodes. The nodes are, in turn,
based on other systems and many smaller networks connecting
those devices. The key communication components that allow the
internet to function are managed and owned by a combination of
telecommunications (telco) companies, ISPs, and leased or pur‐
5

chased fiber implementations that provide connectivity from point
to point—all with their own vulnerabilities. Though these core sys‐
tems are hardened with monitoring and security measures, they are
not entirely insulated from internet volatility due to the multitude of
interwoven and interconnected parts.
Even in an environment in which you pay for dedicated cloud serv‐
ices, the public transit network is rarely within your control beyond
the terms of service. The immediate network resources your busi‐
ness relies on might be totally owned and managed by the cloud
provider. Or they might be dependent on or farmed out to a combi‐
nation of multiple private companies that depend on other vendors.
Each link in this critical chain must plan for and manage potential
impacts, such as scheduled maintenances, aging equipment, turn‐
over of support staff, and evolving technology.
Even if the network components are managed and sound, that in
itself is not enough. Today’s businesses don’t just run directly on a
physical box in a datacenter. More and more applications and envi‐

ronments are being virtualized, losing the distinction of how and
where exactly they run. In this kind of ephemeral environment, it is
more important than ever that we understand the virtual network
landscape. This is the subject of our next section.

Virtualization and Outsourcing of Services
Above the core of your network are the systems and applications
that run your business. You might still have some dedicated hard‐
ware within the territory that you own and operate, but these days it
is more common for the systems to be virtualized and running in
the hosted cloud. As long as you have a basic “map” to guide you as
your applications are deployed, you might feel that you have fewer
concerns. But, at the same time, you have less control because you
can’t always get to the actual systems themselves given that the cloud
provider manages them. This paradoxical “less is more” implemen‐
tation forms another point of interplay with your edge services that
must be considered.
Perceptions in these areas have had to evolve along with the tech‐
nologies. A few short decades ago, it was common for companies to
have large datacenters with on-premises hardware and staff to man‐
age the targeted needs of the business. Problems were usually local‐

6

|

Chapter 2: Exposing Buried Threats to Your Business Network

ized and could be addressed within the infrastructure that the

company managed.
Now, the physical datacenters of those days have more regularly
been replaced by cloud-based systems running on hardware (or
even other virtual systems) in a remote datacenter that a third party
manages. The “edge” is becoming the new “core.” The advantages of
this change are evident:
• Businesses can redeploy former large datacenter staff to shift to
cloud functions.
• There is less need for specialized physical environments to sup‐
port hardware.
• There is less space required for a computing environment.
• Businesses can focus on their core strengths.
A corollary of this is outsourcing the key software applications a
business uses. This is commonly known as the “as-a-service” model.
The three variants of this model in primary use today are software
as a service (SaaS), platform as a service (PaaS), and infrastructure
as a service (IaaS).
Such services can be provided for our use over the internet by any of
multiple public cloud providers. These examples illustrate that busi‐
nesses often rely on cloud providers not only to host the applica‐
tions they produce for customers, but—more and more—also to
provide the environments that they use to develop those applica‐
tions. In effect, businesses have not only exchanged the systems for
cloud-based infrastructure, they have also created businessdependent partnerships with the service providers.
Although we all like to think that we are unique and are a primary
focus in these relationships, it is important to keep in mind that
selected partners are frequently managing requests, volatility, secu‐
rity concerns, and other issues from hundreds, or even thousands,
of other customers. Again, aside from the service-level agreement
(SLA) terms you have agreed to, you have very little overall control

of the cloud provider.
For all cloud scenarios, selecting partners that have the right level of
internet experience is important. The partner might offer tremen‐
dous value in terms of services offered, but if it lacks experience

Virtualization and Outsourcing of Services

|

7

dealing with the virtual and physical layers, your business could be
left at a significant disadvantage.
The key point to remember here is that no partner or service pro‐
vider in the cloud will be as much of an expert on your business or
customer needs as your own organization will be. And, in the same
vein, your organization must be the most committed to ensuring
that your customers can use your services even if the cloud services
you depend on fail. This starts with approaches such as those that
we outline in Chapter 3. The plans you make now for this sort of
resiliency become the map that will safely guide your business across
any “uncharted territory.”
In addition to constructing our “global view” of the cloud pieces that
we rely on, we must also have an “eyes wide open” view of our own
internal organization. With the incredible number of moving pieces
within companies today, it’s important to identify the potential risks
to stability where we can.

Vulnerabilities Within Your Own Organization

Human errors occur in all organizations. They are, for the most
part, an accepted part of doing business, and it is likely that most do
not rise to the level of affecting large sets of customers. Enterprises
have traditionally held the mindset that human errors affect mainly
their internal systems. And, as such, they are recoverable by having
backups. However, today’s systems offer more options and function‐
ality than ever before. Correspondingly, they can require more con‐
figuration, understanding, and care than ever before. Where
resources within a business need to oversee such complex environ‐
ments, there is always a risk.
For example, a simple typo in a server configuration could result in
directing incoming traffic to the wrong page. Or incorrectly altering
the schedule for a backup process could affect the availability of the
resource during times of high demand.
Enterprises must expect that, at some point, a human error will
affect the way customers will interact with them. Waiting until this
happens and then reacting is a risky approach. The user might
remain a customer if there is enough value beyond the inconven‐
ience. Or they might look elsewhere. A better strategy is to arm your
services living at the edge with the ability to detect and tolerate such

8

|

Chapter 2: Exposing Buried Threats to Your Business Network

disruptions and have intelligent responses in place to steer traffic.
They can automatically work with the cloud services to recover

functionality with minimal downtime.
With those internal to our organization, we are usually more con‐
cerned about preventing accidental misuse rather than malevolent
intent. However, we must also guard against those intentionally tar‐
geting today’s technology with negative intents. What you don’t see
(especially in your network) can hurt you.

Small Error, Big Impact
On February 28, 2017, the Amazon Simple Storage Service (Ama‐
zon S3) became unavailable in the Northern Virginia (US-EAST-1)
Region. The impact was substantial: “During AWS’ four-hour dis‐
ruption, S&P 500 companies lost $150 million, according to analy‐
sis by Cyence, a startup that models the economic impact of cyber
risk. US financial services companies lost an estimate $160 mil‐
lion ....”1
This was not a problem with the systems, which have an extremely
high level of reliability. In fact, they functioned exactly as intended.
Rather, this was caused by simple human error.
“At 9:37AM PST, an authorized S3 team member using an estab‐
lished playbook executed a command which was intended to
remove a small number of servers for one of the S3 subsystems that
is used by the S3 billing process. Unfortunately, one of the inputs to
the command was entered incorrectly and a larger set of servers
was removed than intended.”2

Looming Security Threats
Today, more than ever, the internet offers a place for bad actors to
hide as they try to manipulate others, illegally obtain data or goods,
or break or deny access to key parts of your business.

1 Jason Del Rey, March 2, 2017. “Amazon’s massive AWS outage was caused by human

error”, recode.net.

2 Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1)

Region.

Looming Security Threats

|

9

The growing security threat from nefarious activities can wreak
havoc with networks and compromise your environment from the
edge inward. These attacks at the edge can come in many forms,
such as the following:
• Distributed Denial of Service (DDoS) attacks. These types of
attacks amplify the traffic directed to a website to the point at
which the website systems are unable to keep up, and thus fail.
• Malicious bots. These are self-propagating malware programs
that infect systems to gather information, open backdoor access,
or launch attacks against systems. They typically also connect
back and report to a central server.
• Attempts to hijack routes, IP addresses, or URLs to redirect
users to other websites or content.
• Targeted inputs to circumvent checks or provoke error condi‐
tions to gain internal access to the system and its data.

• Ransomware attacks. Bad actors trick users into installing soft‐
ware that encrypts data, and then demand payment to unen‐
crypt it.
Increasingly, these security threats are automated and come from
geographically diverse sources, disrupting the traffic highways of the
internet with no human involvement. It is rare these days to go for
more than a few weeks without hearing about a well-known organi‐
zation having its data stolen or becoming the victim of a DDoS or
other malware attack. Entire businesses have been held hostage by
ransomware attacks; some have not survived.
Guarding against these dangers demands more automated and pro‐
active vigilance than at any other time in the history of the internet.
Consideration and response planning for the different kinds of
attacks are essential before they happen. Security and penetration
testing (an authorized, simulated attack on a public website) are no
longer optional security checks. This mitigation of an attack forms
another part of our edge resilience. Web application firewalls, bot
management solutions, network-based DDoS protection, and DNS
all combine to help build a more resilient infrastructure. Businesses
can no longer afford to ignore who or what is trying to get into their
website but should have a plan in place to address it should it occur.
Just as a board of directors would want to stay in compliance with

10

| Chapter 2: Exposing Buried Threats to Your Business Network

IRS regulations for the organization, it should equally comply with
best practices from resiliency to security at the edge.

Unpredictable, Uncontrollable Problem
Sources
Beyond the quantifiable challenges, there are unexpected factors
that affect internet service, and thus your business, if your applica‐
tion design does not include resilient components designed for fail‐
over at a moment’s notice.
One such force that can strike businesses anywhere and at any time
is the weather. Each year, we see many examples of how quickly
weather can bring about natural disasters with unpredicted impact.
Many organizations gain experience in coping with these events
through trial and error as they occur. Take for instance when Hurri‐
cane Sandy hit the New York City area in 2012. That storm and the
ensuing floods cost telco companies and their customers significant
downtime, not to mention the repair costs and revenue losses. A
study found that internet outages in the United States doubled dur‐
ing Hurricane Sandy—up from a daily outage average of 0.3% to
0.43% and taking four days to return to normal levels.
Even telco providers themselves can be a source of internet stability
issues. Scheduled or unscheduled maintenance can disrupt internet
service if you have not planned for these occurrences. All internetbased services need updates, and your application must be able to
deal with these when they occur. What’s more, because many telco
providers are operating in a fiercely competitive environment, they
are continuously asked to do more with less as their profit margins
continue to shrink. On a practical level, this can mean that legacy
equipment is not being replaced and updated. Consider, for exam‐
ple, the prevalence of copper-based telco cables. In harsh climates
with heavy rain and snow, these in-ground cables break down,
which has a negative effect on transmitting data via the internet and
can lead to outages or packet-loss issues that are especially insidious

and difficult to troubleshoot.
These are just a few examples of the kind of events that your busi‐
ness cannot control as you navigate today’s decentralized world. But
you can control the resiliency of your application; there are many
tools available to ensure that your applications remain stable and

Unpredictable, Uncontrollable Problem Sources

|

11

available regardless of the circumstances. In Chapter 3, we offer
some strategies.

Conclusion
Your presence on the web is key to the power and reach of your
brand and the success of your business. Disruptions that are not
handled gracefully can cause your customers to consider the compe‐
tition and ultimately jeopardize your bottom line. Regardless of the
cause of the disruption, the site edge is where users will likely first
feel any disruptive effects and first target their blame. Cultivating
deliberate awareness of the challenges facing your network and
applications puts you in a better position to take action and safe‐
guard your users against them.

12

|

Chapter 2: Exposing Buried Threats to Your Business Network

CHAPTER 3

Strategies to Meet the Challenges

The scenarios outlined in Chapter 2 illustrate that although the
cloud paradigm has become an always-available, easily accessible
endpoint for users, it can also represent a somewhat murky and
mysterious platform inherent with unseen risk for many businesses.
It is no longer sufficient to simply deploy applications into a cloud
and assume that the end-to-end user experience will be what we
expect. Where the cloud is insufficient to cover the risks, we must
move more responsibility (and thus more reliability) toward resil‐
iency at the user edge.
Edge services were once the endpoints, gateways, interfaces, and
routers located on our in-house networks. Today’s edge services are
hosted in the public cloud and must now be more intelligent, strate‐
gic, and fault-tolerant than ever before. They must not just allow our
users to access applications when required; they must also allow
them to stay securely connected and able to complete their transac‐
tions.
In this chapter, we look at some strategies around edge services to
meet these challenges and provide users with the stable, reliable
interactions they expect.

Strategy 1: Consider the End-to-End User
Experience

In any interaction with your website, application, or network, the
user’s experience establishes an impression of your business. Take
13

some time to evaluate the scenarios that a user might encounter if
your cloud provider encounters problems. How able are you to pro‐
vide a reasonable experience to the user until the situation can be
resolved? Is fallback to a different provider or other endpoint feasi‐
ble?
Thinking ahead to what kind of situations your users might
encounter in the event of a problem with your cloud environment
can provide valuable foresight. And, it can be the starting point to
updating your edge services to be able to compensate for issues
closer to your application.

Strategy 2: Embrace Processing at the Edge as
Part of Your Total Design
As edge services and edge devices become more advanced and pow‐
erful, it is an oversight not to consider and take advantage of their
functionality as part of an overall edge-to-cloud deployment strat‐
egy. For example, in many cases, Internet of Things (IoT) edge devi‐
ces can run analytics at the edge to produce useful, more compact
data rather than having to send it to the cloud for processing. Pro‐
cessing at the edge can reduce and/or complement processing that
would normally be routed to the cloud. This can also provide a
pathway for processing to continue even if the cloud functionality is
disrupted.

Strategy 3: Engage with Your Cloud Provider

to Arrive at the Optimal Topology
Moving to the cloud can mean moving into a world where every ser‐
vice is hosted and managed by the cloud provider. Cloud services
are usually designed in a security model with a shared level of
responsibilities between the cloud provider and the customer. Risk
exposure can be best minimized by having open conversations at the
beginning of a relationship with any internet and managed DNS
provider with whom you partner.
Conversations should not shy away from the questions that need to
be asked, such as these:
• How can we help address the challenges we face getting our
applications closer to our users?
14

|

Chapter 3: Strategies to Meet the Challenges

• In the service model chosen, where does your responsibility
begin and end?
• What do you recommend as the optimum approach or technol‐
ogy for load balancing internet traffic between endpoints?
• How many distinct transit providers are available for bandwidth
needs?
• What are your methodologies and procedures for outages/fail‐
ures?
Such engagements do not require limiting your approach to only
what that provider offers. But it will give you a wider context to
draw upon in making decisions.

Strategy 4: Increase Redundancy and
Reliability with Multicloud and Hybrid Cloud
Strategies
While focusing on functionality at the user edge, it’s important not
to overlook cloud topologies that can assist in ensuring edge-toendpoint resiliency and reliability. Two of those topologies to con‐
sider are multicloud and hybrid cloud. Like adding two or more
bandwidth/telecommunications (telco) providers or a redundant
server, deploying a similar methodology in cloud and hybrid cloud
approaches should happen. Why? If your single-cloud instance/
provider is unavailable, how would you steer traffic to an alternate
vendor or site?
The most common approach today is a hybrid cloud strategy comb‐
ing and integrating resources on-premises with resources from
cloud providers. This approach is typically utilized to gradually shift
from an on-premises model to the cloud or as a result of workloads
that are not cloud native or ready for such a deployment. But strate‐
gically having on-premises resources that serve and interact with
users, regardless of whether cloud resources are available, can help
ensure that your users stay reliably connected to their applications.
A newer approach taking shape in the simplest terms, a multicloud
strategy centers on using two or more cloud service providers for a
single business need. The reasons for this can be to provide options
for pricing, protection against a single point of failure, better cover‐
age against multiple availability zones, or higher value propositions
Strategy 4: Increase Redundancy and Reliability with Multicloud and Hybrid Cloud Strategies

|
15

in the areas of SaaS, PaaS, or IaaS. Many of these same benefits also
serve the overall business needs of resiliency and reliability.

Strategy 5: Involve DevOps Staff in All Aspects
of Edge Services Planning and
Implementation
DevOps is a set of ideas and recommended practices focused on
how to make it easier for development and operational teams to
work together on developing and releasing software.1 If you have a
DevOps practice within your organization, it’s important to involve
them in all stages as you are designing and planning for resiliency.
Developers need to understand what functionality should be imple‐
mented at the site and user edge. And operations specialists need to
know how they will need to configure the edge services or devices to
implement the desired resiliency. Ensuring that there is shared
awareness and buy-in upfront will help make sure your strategy can
be effective once deployed. Finally, DevOps in many organizations
have been required to run areas outside of their traditional wheel‐
house, namely managing the edge and core infrastructure. Tradi‐
tionally, this has not been the case. With newer drive to be nimbler,
DevOps teams have found themselves on the front line of responsi‐
bility when it comes to ensuring the cloud edge is secure, fast, and
redundant to meet the performance requirements.

Strategy 6: Inject Chaos to Find Weaknesses
Before They Affect Customers in Production
Chaos engineering involves “experimenting on a (distributed) sys‐
tem in order to build confidence in the system’s capability to with‐
stand turbulent conditions in production,” according to the

Principles of Chaos Engineering.
Chaos engineering is a proactive discipline that aims to help you
build trust and confidence in a system’s reliability under various
conditions before your customers experience those conditions. A
chaos engineering team works closely, in partnership and collabora‐

1 The Phoenix Project and Effective DevOps are a couple great resources for learning

more about DevOps.

16

|

Chapter 3: Strategies to Meet the Challenges

tively, with development teams to help them explore and overcome
any weaknesses revealed through chaos experimentation and test‐
ing.
There are two main techniques in chaos engineering: game days and
automated chaos experiments. Both techniques offer ways of explor‐
ing a system’s resilience across everything from infrastructure,
through platforms, the applications themselves, and even the people,
practices, and processes that surround production. Game days offer
a cheap and often fun exercise that exposes weaknesses in a system
by manually causing failures, usually in a staging environment, and
then exploring how teams of people and the tools they use can
detect, diagnose, and respond to those situations.
Game days are extremely powerful and often require no specialized

tooling, but they are expensive on people’s time, especially if the sys‐
tem is rapidly evolving and confidence in its resilience needs to be
maintained on a continuous basis. This is where automated chaos
experiments come in. The free and open source Chaos Toolkit is one
such tool in this space; you can use it to create automated chaos
experiments that can then be used to choreograph experiments,
without manual intervention, over a range of chaos-creating and
system-probing systems.
Automated chaos often begins with chaos experiments, in which an
experiment is defined and then executed against the target systems
to see whether there has been any deviation in normal functioning
that lets you know whether a weakness is present. In this “experi‐
mental” phase, the automated chaos experiments are deemed valua‐
ble and successful when they find a weakness. A weakness is an
opportunity to improve the system, and this is celebrated.
The next phase of an automated chaos experiment’s life is often then
to become more of a chaos test. In contrast to the first phase, now
the automated chaos test is continually executed to prove that prior
weaknesses or even new weaknesses are not reintroduced over time.
A chaos experiment empirically proves that a weakness is present
and needs to be overcome, and this learning is celebrated. A chaos
test, often executed continuously, is celebrated when it does not
detect a weakness. Over time a suite of automated chaos experi‐
ments and tests then becomes the baseline for confidence and trust
in your production system’s resilience, and it is extended and grown
as new weaknesses are encountered or anticipated.

Strategy 6: Inject Chaos to Find Weaknesses Before They Affect Customers in Production

|

17

Strategy 7: Use Managed DNS Functionality to
Limit Endpoint Exposure and Network
Volatility
DNS services have matured over the past 30 years from a simple
lookup service to a powerful cloud-hosted service allowing you to
rely on managed DNS infrastructure and services to better manage
internet volatility and routing issues at the edge—before they affect
your content or services endpoint.
Managed DNS refers to a service from a provider that allows users
to manage their DNS traffic via specified protocols for cases such as
failover, dynamic IP addresses, geographically targeted DNS, and
more. To illustrate this approach, consider an application that must
remain available across several large cities across the globe. In this
example, the network traffic can be busy with ever-increasing user
requests.
“Data sovereignty comes into play when an organization’s data is
stored outside of their country and is subject to the laws of the
country in which the data resides. The main concern with data sov‐
ereignty is maintaining privacy regulations and keeping foreign
countries from being able to subpoena data.”2
In some cases, you might need to create an environment in which
certain content is served up based on a specific region. When you
need to apply certain region-based security requirements and data
sovereignty regulations, you can use DNS to keep users and their
data within a defined region versus routing to all available sites
across the world.

The concept of geolocation requires a more specific example. If your
company operates an online retail business in Germany, and the
German government states that all users’ personal data must stay
within Germany, you can configure DNS to meet that requirement
and ensure that all queries originating in that country stay in that
region.
You can accomplish this by intelligently routing these requests to
endpoints within Germany in such a way as to prevent routing

2 R&G Technologies, “Data Sovereignty: What it is and why it matters”.

18

|

Chapter 3: Strategies to Meet the Challenges

through another country and back again to Germany. This can be
important not only to comply with regulations, but also to target the
most efficient pathways for a given area’s service providers.
In this example, you are using DNS-based, geolocation load balanc‐
ing at the user edge to route internet traffic based on policy that you
design to reduce the risk of internet volatility along the way and to
ensure every user across the globe receives the optimal experience.

Conclusion
No strategy is a complete solution for the challenges facing your net‐
work. The range of threats and complexity of technology continue
to grow at a pace that is daunting for even the most proactive busi‐

ness to keep up with. And it can be impractical to retrofit strategies
that haven’t been designed for resiliency from the beginning.
But even without the opportunity to start over in designing a more
resilient network, the approaches outlined in this chapter can be val‐
uable as a starting point. They can serve as the genesis for crucial
conversations with colleagues or management around the exposures
your business will be or already is facing. Questions that you should
be asking yourself include: What kind of experience will our users
have if our cloud provider goes down? Would we benefit from a hybrid
cloud (or multicloud) strategy versus what we have now? Would
adding a managed DNS service buy us anything?
Eventually, another question will likely force its way to the front of
your thoughts: What can we do right now to start protecting our net‐
works? Realistically, out of all the strategies outlined in this chapter,
the most comprehensive answer to this question is the near-term
adoption of a managed DNS service. You can adapt a managed DNS
service to an existing network and immediately begin shielding the
critical entry points from attacks, service degradation, and other
points of failure.
Even though a managed DNS service might not be the right answer
for every business, it’s potential value as a solution that you can
implement in the near-term is worth understanding. This is
especially true for anyone who can also imagine the potential busi‐
ness disruptions in the near-term from unseen threats at the edge. In
Chapter 4, we further explore what the managed DNS service is and

Conclusion

|

IT training oreilly edge resiliency khotailieu

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về