IT training chaos eng observability ebook khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.47 MB, 32 trang )

Chaos Engineering
Observability

Bringing Chaos Experiments
into System Observability

Russ Miles

Beijing

Boston Farnham Sebastopol

Tokyo

Chaos Engineering Observability
by Russ Miles
Copyright © 2019 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more infor‐
mation, contact our corporate/institutional sales department: 800-998-9938 or cor‐

Development Editors: Virginia Wilson and
Nikki McDonald
Production Editor: Katherine Tozer

Copyeditor: Amanda Kersey
February 2019:

Proofreader: Zachary Corleissen
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2019-02-19:

First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Chaos Engineering
Observability, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the autho disclaim all responsibility for errors or omissions, including
without limitation responsibility for damages resulting from the use of or reliance
on this work. Use of the information and instructions contained in this work is at
your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and Humio. Please see our
statement of editorial independence.

978-1-492-05101-5
[LSI]

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Observability and Chaos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Value of Observability
The Value of Chaos Engineering
Chaos Engineering Encourages and Contributes to
Observability
Summary

1
2

4
5

2. Chaos Experiment Signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Coarse-Grained Signals Through Notifications
Fine-Grained Signals Through Chaos Controls
Summary

7
10
12

3. Logging Chaos Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
From Signals to Centralized Logging
Centralized Chaos Logging in Action
Summary

13
15
16

4. Tracing Chaos Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Open Tracing
The Open Tracing Control
Summary

17
18
19

5. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iii

For Mali, Mum, Dad, Sylvain, Aurore!
For Geeta and everyone at Humio!
Finally, for the free and open source Chaos Toolkit and Open Chaos
community! You’re all awesome!

Preface

If you’re considering running chaos experiments to find system
weaknesses, especially in production, then observability will be on
your mind. This book is for everyone adopting automated chaos
engineering in their teams and to ensure that they can execute those
experiments as safely as possible by bringing those chaos experi‐
ments into the overall system observability picture.
This book introduces the concept of chaos observability: how to run
chaos experiments and bring that work into your overall system
observability picture. You will see how chaos engineering experi‐
ments leverage a system’s observability and contribute to it. This all
begins by introducing the key concept of Chaos Experiment
Observability Signals, covered in Chapter 2.
Throughout this book, high-level samples are shown using the free
and open source Chaos Toolkit. Although only the Chaos Toolkit’s
observability capabilities are shown, the hope is that this book will
prompt the need for observability across other chaos engineering
implementations, possibly resulting in a set of open standard con‐
cepts and guidelines for chaos observability.
This book provides code samples of integrations with OpenTracing,
visualized using Jaeger, for tracing chaos experiments alongside dis‐
tributed system traces, and centralized logging using Humio. These
samples show concrete implementations of how any system could be
integrated with available chaos experiment observability signals.
For more on chaos engineering, see Chaos Engineering by Ali Basiri,
Nora Jones, Aaron Blohowiak, Lorin Hochstein, and Casey Rosen‐
thal (O’Reilly). For an introduction to Observability, see Distributed
Systems Observability by Cindy Sridharan (O’Reilly).

vii

Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file
extensions.
Constant width

Used for program listings, as well as within paragraphs to refer
to program elements such as variable or function names, data‐
bases, data types, environment variables, statements, and key‐
words.
Constant width bold

Shows commands or other text that should be typed literally by
the user.
This element signifies a general note.

With those caveats and limitations, and ideally a copy of the afore‐
mentioned books on hand, I hope you enjoy this book and I wish
you:
“Happy (Observable) Chaos Engineering!”

viii

|

Preface

CHAPTER 1

Observability and Chaos

“You see, but you do not observe.”
—Sherlock Holmes, from “A Scandal in Bohemia” by Sir Arthur
Conan Doyle

Observability and chaos engineering are two relatively new disci‐
plines that, for good reason, the mainstream has begun to recognize.
The principles of observability turn your systems into inspectable
and debuggable crime scenes, and chaos engineering encourages
and leverages observability as it seeks to help you pre-emptively dis‐
cover and overcome system weaknesses.
In this chapter you’re going to learn how chaos engineering not only
relies on observability but also, as a good citizen in your systems,
needs to participate in your overall system observability picture.

The Value of Observability
Observability is a key characteristic of a successful system, particu‐
larly a production system. As systems evolve increasingly rapidly,
they become more complex and more susceptible to failure. Observ‐
ability is the key that helps you take on responsibility for systems
where you need to be able to interrogate, inspect, and piece together
what happened, when, and—most importantly—why. Observability
brings the power of data to explore and fix issues and to improve
products. “It’s not about logs, metrics, or traces, but about being data
driven during debugging and using the feedback to iterate on and

improve the product,” Cindy Sridharan writes.
1

Observability helps you effectively debug a running system without
having to modify the system in any dramatic way. You can think of
observability as being a super-set of system management and moni‐
toring, where management and monitoring has traditionally been
great at answering closed questions such as, “Is that server respond‐
ing?” Observability extends this power to encompass answering
open questions such as, “Can I trace the latency of a user interaction
in real time?”, or, “How successful was a user interaction that was
submitted yesterday?”
Great observability signals help you become a “system detective:”
someone who is as able to shine a light on emergent system behavior
and shape the mental models of operators and engineers evolving
the system. You are able to grasp, inspect, and diagnose the condi‐
tions that are the conditions of a rapidly changing, complex, and
sometimes failing system. It helps you become the Sherlock Holmes
of your own systems, able to ask and answer new questions as your
system runs.

The Value of Chaos Engineering
Chaos engineering seeks to surface, explore, and test against system
weaknesses through careful and controlled chaos experiments. In
modern, rapidly evolving systems, parts fail all the time. Chaos engi‐
neering is key to discovering how those complex failures may affect
the system and then validating over time that those weaknesses have
been overcome.
You could learn from system outages when they occur and improve

systems that way. It’s called incident-response learning, and the cycle
is shown in Figure 1-1.

2

|

Chapter 1: Observability and Chaos

Figure 1-1. Incident-response learning is prompted by a system outage
Incident-response learning deserves a book of its own. Enabling it
effectively by adopting approaches such as “Blameless Postmortems” is a way to learn from and overcome system weaknesses.
The challenge is that post-mortem learning alone is like learning to
drive by jumping into a car for the first time and figuring out how to
drive as the car speeds its way along the highway at 90 miles an
hour! In other words, dangerous and potentially life-threatening,
and you’d better learn quick. Incident-response learning on its own
is reactive, usually painful, and possibly very expensive.
Chaos engineering takes a different approach. Instead of waiting for
a system weakness to cause a discernible outage, chaos engineering
encourages you to actually cause, in a controlled chaos experiment, a
failure to explore and is depicted in Figure 1-2.

The Value of Chaos Engineering

|

3

Figure 1-2. Learning through chaos engineering starts with an auto‐
mated experiment

Chaos Engineering Encourages and
Contributes to Observability
Chaos engineering and observability are closely connected. To con‐
fidently execute a chaos experiment, observability must detect when
the system is normal and how it deviates from that steady-state as
the experiment’s method is executed.
When you detect a deviation from the steady-state, then an unsus‐
pected and unobserved system behavior may have been found by
the chaos experiment. At this point the team responsible for the sys‐
tem will be looking to the system’s observability signals to help them
unpack the causes and contributing factors to this deviation.
Chaos engineering often encourages, even demands, better system
observability, but this is only part of the picture. A chaos experiment
itself also needs to contribute to that picture, by sending signals like
those shown in Figure 1-3, so that you can see which experiment
was running when the system was exhibiting a set of observable
characteristics.

Figure 1-3. The flow of a chaos experiment’s execution

4

| Chapter 1: Observability and Chaos

Summary

Chaos engineering experiments encourage and need to contribute to
the observability of a system. In the next chapter you’ll see, with the
help of the the types of observability signals
a running chaos experiment can produce.

Summary

|

5

CHAPTER 2

Chaos Experiment Signals

“Data! Data! Data!” he cried impatiently. “I can’t make bricks
without clay.”
—Sherlock Holmes, from “The Adventure of the Copper Beeches”
by Sir Arthur Conan Doyle

Observability feeds on the signals that a system emits that provide
the raw data about the system’s behavior. Observability is limited by
the signals that a system puts out.
As a chaos experiment executes, it can emit a number of different
signals that are useful to system observability. In this chapter you’re
going to see how the Chaos Toolkit supports two possible mecha‐
nisms for producing observability signals from automated chaos
experiment execution:

Chaos Notifications
Coarse-grained, flow-level notifications during an experiment’s
execution.
Chaos Controls
Fine-grained listeners, and even influencers, of an experiment’s
execution.

Coarse-Grained Signals Through Notifications
Chaos experiment notifications are emitted by the Chaos Toolkit
when an experiment is executed, as shown in Figure 2-1.

7

Figure 2-1. The notifications available from a chaos experiment’s exe‐
cution
Chaos experiment notifications are coarse-grained because they are
only triggered at the highest level of an experiment’s execution. The
Slack extension for the Chaos Toolkit uses chaos notifications to
then surface those signals in a specific Slack channel.
Once the Slack extension is installed, add this block to the Chaos
Toolkit’s ~/.chaostoolkit/settings.yml file to turn on the notifications:
notifications:
type: plugin
module: chaosslack.notification
token: xop-1235
channel: general

The token specified here is a Slack API token. The
channel is the Slack channel that you’d like to surface

your chaos experiment notifications into.

With the Slack extension installed, you can surface notifications to
your own Slack channels, as depicted in Figure 2-2.

8

|

Chapter 2: Chaos Experiment Signals

Figure 2-2. Chaos notifications surfacing in Slack
The Slack extension is implemented in Python, but if you’d rather
not write some Python code to hook into chaos notifications, it’s
also possible to send notifications to an HTTP endpoint using an
entry in your ~/.chaostoolkit/settings.yml:
notifications:
type: http
url: />verify_tls: false
headers:
Authorization: "Bearer 1234"

Here you’re specifying that you’d like to make an HTTPS call to the
specified API, passing the indicated headers when the notifications
API is triggered.

Coarse-Grained Signals Through Notifications |

9

Fine-Grained Signals Through Chaos Controls
As well as high-level chaos notifications, the Chaos Toolkit also sup‐
ports a second, more fine-grained set of signals that can be hooked
into and send valuable data to your observability systems. This
approach is referred to as a Chaos Toolkit Control.
A Chaos Control can do much more than simply send
data to your observability systems. A control can also
interrupt or manipulate a running experiment.

A Chaos Toolkit Control is implemented in Python and provides the
functions needed for observability. The functions available to imple‐
ment are the following:
def configure_control(config: Configuration, secrets: Secrets):
# Triggered before an experiment's execution.
# Useful for initialization code for the control.
...
def cleanup_control():
# Triggered at the end of an experiment's run.
# Useful for cleanup code for the control.
...
def before_experiment_control(context: Experiment,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered before an experiment's execution.
...
def after_experiment_control(context: Experiment,
state: Journal, configuration:
Configuration = None, secrets:

Secrets = None, **kwargs):
# Triggered after an experiment's execution.
...
def before_hypothesis_control(context: Hypothesis,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered before a hypothesis is analyzed.
...
def after_hypothesis_control(context: Hypothesis, state:
Dict[str, Any], configuration:

10

|

Chapter 2: Chaos Experiment Signals

Configuration = None, secrets:
Secrets = None, **kwargs):
# Triggered after a hypothesis is analyzed.
...
def before_method_control(context: Experiment,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered before an experiment's method is executed.
...
def after_method_control(context: Experiment, state: List[Run],
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):

# Triggered after an experiment's method is executed.
...
def before_rollback_control(context: Experiment,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered before an experiment's rollback's block
# is executed.
...
def after_rollback_control(context: Experiment, state:
List[Run], configuration:
Configuration = None, secrets:
Secrets = None, **kwargs):
# Triggered after an experiment's rollback's block
# is executed.
...
def before_activity_control(context: Activity,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered before any experiment's activity
# (probes, actions) is executed.
...
def after_activity_control(context: Activity, state: Run,
configuration: Configuration = None,
secrets: Secrets = None, **kwargs):
# Triggered after any experiment's activity
# (probes, actions) is executed.
...

Compared to the chaos notifications, a chaos control provides a lot
more signals that can be tapped into. Each of those chaos control

signals is then triggered during the execution of a chaos experiment,
as Figure 2-3 shows.

Fine-Grained Signals Through Chaos Controls

|

11

Figure 2-3. Chaos Control signals

Summary
In this chapter you saw, using the Chaos Toolkit as a reference
implementation, some of the potential observability signals that a
chaos experiment can produce. In the next two chapters we’ll dig
deeper into how these extension points, in particular the Chaos
Toolkit’s Control API, can be built upon to push these signals to des‐
tinations useful for observability.

12

|

Chapter 2: Chaos Experiment Signals

CHAPTER 3

Logging Chaos Experiments

“Come, Watson, come!"” he cried. “The game is afoot. Not a word!
Into your clothes and come!”
—Sherlock Holmes, from “The Return of Sherlock Holmes” by Sir
Arthur Conan Doyle

In Chapter 2 you were introduced to the sorts of chaos experiment
observability signals that a chaos experiment’s execution may pro‐
vide. Now it’s time to look at how those signals can be turned into
something useful to your own observability picture.
Centralized logging systems are widely recognized as a foundational
part of any system’s observability toolkit. By bringing all of the log
events of a system together in one place, you are able to interrogate,
inspect, correlate, and begin to comprehend what happened and
when across your system. Now you’re going to see how you can con‐
vert raw signals from your running chaos experiments to send them
as valuable log events to a centralized logging system.

From Signals to Centralized Logging
The Chaos Toolkit open source community has created an imple‐
mentation of a Control (see “Fine-Grained Signals Through Chaos
Controls” on page 10) that bridges from a running chaos experiment
to a centralized logging system.
The following code sample, taken from a full Logging Control shows
how you can implement a Chaos Toolkit Control function to hook
into the lifecycle of a running chaos experiment:
13

...

def before_experiment_control(context: Experiment, secrets:
Secrets):
# Send the experiment
if not with_logging.enabled:
return
event = {
"name": "before-experiment",
"context": context,
}
push_to_humio(event=event, secrets=secrets)
...

With the Humio extension installed, you can now add a Control
configuration block to each experiment that, when you execute it,
will send logging events to your logging system:
{
"secrets": {
"humio": {
"token": {
"type": "env",
"key": "HUMIO_INGEST_TOKEN"
},
"dataspace": {
"type": "env",
"key": "HUMIO_DATASPACE"
}
}
},
"controls": [
{

"name": "humio-logger",
"provider": {
"type": "python",
"module": "chaoshumio.control",
"secrets": ["humio"]
}
}
]
}

14

|

Chapter 3: Logging Chaos Experiments

Centralized Chaos Logging in Action
Once configured, and the logging extension is installed, you will
now see logging events from your experiment’s arriving in your
Humio dashboard, as shown in Figure 3-1.

Figure 3-1. Chaos experiment execution log messages
Your chaos experiment executions are now a part of your overall
observable system logging. Those events are now ready for manipu‐
lation through querying and exploring (see Figure 3-2) just as you
would conduct normally with other logging events.

Centralized Chaos Logging in Action

|

15

IT training chaos eng observability ebook khotailieu

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về