Tải bản đầy đủ (.pdf) (42 trang)

IT training service meshes managing complex communication within cloud native applications 1534262090657 khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (5.11 MB, 42 trang )

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

Service
Meshes
Managing Complex Communication
within Cloud Native Applications
eMag Issue 63 - Aug 2018

ARTICLE

ARTICLE

ARTICLE

Istio and the
Future of
Service Meshes

Service Mesh:
Promise or
Peril?

Increasing
Security with a
Service Mesh


IN THIS ISSUE
8

Istio and the Future of Service Meshes



12

Service Mesh: Promise or Peril?

16

Envoy Service Mesh Case Study:
Mitigating Cascading Failure at Lyft

24

Increasing Security with a Service Mesh:
Christian Posta Explores the Capabilities of Istio

28

How to Adopt a New Technology:
Advice from Buoyant on Utilising a Service Mesh

32

Microservices Communication and Governance
Using Service Mesh

FOLLOW US

CONTACT US
GENERAL FEEDBACK
ADVERTISING

EDITORIAL

facebook.com
/InfoQ

@InfoQ

google.com
/+InfoQ

linkedin.com
company/infoq


A LETTER FROM THE EDITOR

Daniel Bryant
Modern cloud-native applications often focus
on architectural styles such as microservices,
function as a service, eventing, and reactivity.
Cloud-native applications typically run within virtualized environments — whether this involves
sandboxed process isolation, container-based
solutions, or hardware VMs — and applications
and services are dynamically orchestrated. Although this shift to building cloud-native systems provides many benefits, it also introduces
several new challenges, particularly around the
deployment of applications and runtime configuration of networking.
Some of these technological challenges have
been solved with the emergence of de facto
solutions: for example, Docker for container
packaging and Kubernetes for deployment and

runtime orchestration. However, one of the biggest challenges, implementing and managing
dynamic and secure networking, did not initially
get as much traction as other problem spaces.
Innovators like Calico, Weave, and CoreOS provided early container networking solutions, but
it arguably took the release of Buoyant’s Linkerd,
Lyft’s Envoy proxy, and Google’s Istio to really
drive engineering interest in this space.
With one of the first uses of the phrase “service
mesh” in February 2016, Buoyant CEO William
Morgan announced the benefits of technology
providing “Twitter-style Operability for Microservices. Matt Klein, plumber and loaf balanc-

er at Lyft, further extended the concept when
announcing the release of the Envoy Proxy in
September 2016: “Envoy runs on every host and
abstracts the network by providing common features (load balancing, circuit breaking, service
discovery, etc.) in a platform-agnostic manner.”
Phil Calçado has written an excellent introduction to and history of the service mesh, and I
have presented at the microXchg 2018 conference about the fundamental benefits and challenges of using a service mesh. This emag aims
to provide you with a guide to service mesh technology, not simply the technical ins and outs but
also an exploration that aims to help you understand why this technology is important, where it
is going in the future, and why you may want to
adopt it.
The emag begins with Jasmine Jaksic from Google, who provides a guide to Istio and the future
of service meshes. She outlines how the Istio
service mesh is split into a data plane, built from
Envoy proxies that controls communication between services, and a control plane that provides
policy enforcement, telemetry collection, and
more. Jaksic concludes by stating that the longterm vision is to make Istio ambient within the
platform.

In the second article, Richard Li discusses the
pros and cons of your technology stack adopting a service mesh, based on his experience as


CEO at Datawire. If you have a small number of
microservices or a shallow topology, Li says you
should consider delaying adoption of a service
mesh, and instead evaluate alternative strategies
for failure management. If you are deploying a
service mesh, be prepared to invest ongoing effort in integrating the mesh into your software
development lifecycle.
Next, Jose Nino and Daniel Hochman examine
the mitigation of cascading failure at Lyft, where
their Envoy-powered service mesh handles millions of requests per second in production. Today, these failure scenarios are largely a solved
problem within the Lyft infrastructure as every
deployed service gets throughput and concurrency protection automatically via use of the Envoy proxy. You will also learn that over the coming months, Lyft engineers will be working in
conjunction with the team behind the (self-configuring) concurrency-limits library at Netflix in
order to bring a system based on their library
into an Envoy L7 Filter.
In the fourth article, Christian Posta from Red Hat
presents a practical guide to how Istio can facilitate good security practices within a microservice-based system, such as service-to-service encryption and authorization. He argues that these
foundational security features pave the road for
building zero-trust networks in which you assign
trust based on identity as well as context and circumstances, not just because the caller happens
to be on the same internal network.
Next, Thomas Rampelberg explores how to best
adopt a new technology like a service mesh, and
shares his experience from working at Buoyant
with Linkerd. His core advice is that planning for
failure and understanding the risks along the entire journey will actually set you up for success.

The concerns that you collect early on can help
with this planning, and each possible risk can be
addressed early on so that it doesn’t become an
issue.

The emag concludes with a virtual panel hosted
by fellow InfoQ editor Srini Penchikala, and includes a series of great discussions from innovators in the service mesh space such as Matt Klein,
Dan Berg, Priyanka Sharma, Lachlan Evenson,
Varun Talwar, Yuri Shkuro, and Oliver Gould.
The service mesh space is a rapidly emerging
technical and commercial opportunity, and although we expect some aggregation or attrition
of offerings over the coming months and years,
for the moment, there are plenty of options to
choose from (many of which we have covered on
InfoQ):
• Istio and Envoy, which are covered in this
emag;
• Linkerd (and Linkerd 2, which includes Conduit) are also covered here;
• Cilium, API-aware networking and security
powered by the eBPF kernel features;
• HashiCorp Consul Connect, a distributed service mesh to connect, secure, and configure
services across any runtime platform; and
• NGINX (with Istio and nginMesh) or NGINX
Plus with Controller.
We hope this InfoQ emag will help you decide
if your organisation would benefit from using a
service mesh, and if so, that it also guides you on
your service mesh journey. We are always keen
to publish practitioner experience and learning,
and so please do get in touch if you have a service mesh story to share.




CONTRIBUTORS
Daniel Bryant
is leading change within organisations and technology. His
current work includes enabling agility within organisations
by introducing better requirement gathering and planning
techniques, focusing on the relevance of architecture within agile
development, and facilitating continuous integration/delivery.
Daniel’s current technical expertise focuses on ‘DevOps’ tooling,
cloud/container platforms and microservice implementations.

Jasmine Jaksic

works at Google as the lead
technical program manager
on Istio. She has 15 years
of experience building and
supporting various software
products and services. She is a cofounder of Posture Monitor, an
application for posture correction
using a 3-D camera. She is also a
contributing writer for The New
York Times, Wired, and Huffington
Post. Follow her on Twitter: 
@JasmineJaksic.

Richard Li


is the CEO/co-founder of
Datawire, which builds opensource tools for developers on
Kubernetes. Previously, Li was
VP product and strategy at Duo
Security. Prior to Duo, Richard
was VP strategy and corporate
development at Rapid7. Li has a
B.Sc. and M.Eng. from MIT.

Daniel Hochman

is a senior infrastructure
engineer at Lyft. He’s passionate
about scaling innovative
products and processes to
improve quality of life for those
inside and outside the company.
During his time at Lyft, he has
successfully guided the platform
through an explosion of product
and organizational growth.
He wrote one of the highestthroughput microservices
and introduced several
critical storage technologies.
Hochman currently leads
Traffic Networking at Lyft and
is responsible for scaling Lyft’s
networking infrastructure
internally and at the edge.



Srini Penchikala

Jose Nino

Christian Posta

Thomas Rampelberg

currently works as a senior software architect
in Austin, Tex. Penchikala has over 22 years of
experience in software architecture, design, and
development. He is also the lead editor for AI,
ML & Data Engineering community at InfoQ,
which recently published his mini-book Big Data
Processing with Apache Spark. He has published
articles on software architecture, securiaty, risk
management, NoSQL, and big data at websites like
InfoQ, TheServerSide, the O’Reilly Network (OnJava),
DevX’s Java Zone, Java.net, and JavaWorld.

(@christianposta) is a chief architect of cloud
applications at Red Hat and well known in the
community for his writing (Introducing Istio Service
Mesh, Microservices for Java Developers). He’s also
known as a frequent blogger, speaker, open-source
enthusiast, and committer on various open-source
projects including Istio, Apache ActiveMQ, Fabric8,
etc. Posta has spent time at web-scale companies and
now helps companies create and deploy large-scale,

resilient, distributed architectures — many of what we
now call microservices. He enjoys mentoring, training,
and leading teams to be successful with distributed
systems concepts, microservices, DevOps, and cloudnative application design.

is the lead for dev tooling and configuration on the
Networking team at Lyft. During the nearly two
years he’s been at Lyft, Nino has been instrumental
in creating systems to scale configuration of Lyft’s
Envoy production environment for increasingly large
deployments and engineering orgs. He has worked as
an open-source Envoy maintainer and has nurtured
Envoy’s growing community. More recently, Nino has
moved on to scaling Lyft’s network load-tolerance
systems. He has spoken about Envoy and related
topics at several venues, most recently at KubeCon
Europe 2018.

is a software engineer at Buoyant, which created the
Linkerd service mesh. He has made a career of building
infrastructure software that allows developers and
operators to focus on what is important to them. While
working for Mesosphere, he helped create DC/OS, one
of the first container orchestration platforms, used by
many of the Fortune 500. He has moved to the next
big problem in the space: providing insight into what
is happening between services, improving reliability
between them, and using best practices to secure the
communication channels between them.



Read online on InfoQ

KEY TAKEAWAYS
The microservices architectural style simplifies
the implementation of individual services.
However, connecting, monitoring, and securing
hundreds or even thousands of microservices
is not simple.
A service mesh provides a transparent and
language-independent way to flexibly and
easily automate networking, security, and
observation functions. In essence, it decouples
development and operations for services.
The Istio service mesh is split into 1) a data
plane built from Envoy proxies that intercepts
traffic and controls communication between
services and 2) a control plane that supports
services at runtime by providing policy
enforcement, telemetry collection, and
certificate rotation.

ISTIO AND THE FUTURE
OF SERVICE MESHES
by Jasmine Jaksic
It wouldn’t be a stretch to
say that Istio popularized the
concept of a service mesh.
Before we get into the details
on Istio, let’s briefly dive into

what a service mesh is and
why it’s relevant.

The near-term goal is to launch Istio to 1.0,
when the key features will all be in beta
(including support for Hybrid environments)
The long-term vision is to make Istio ambient.

8

Service Meshes // eMag Issue 63 - Aug 2018


We all know the inherent challenges associated with monolithic applications, and the obvious solution is to decompose
them into microservices. While
this simplifies individual services,
connecting, monitoring and securing hundreds or even thousands of microservices is not
simple. Until recently, the solution was to string them together
using custom scripts, libraries,
and dedicated engineers tasked
with managing these distributed
systems. This reduces velocity on
many fronts and increases maintenance costs. This is where a service mesh comes in.
A service mesh provides a transparent and language-independent way to flexibly and easily
automate networking, security,
and telemetry functions. In essence, it decouples development
and operations for services so a
developer can deploy new services as well as make changes to
existing ones without worrying
about how that will impact the

operational properties of their
distributed systems. Similarly, an
operator can seamlessly modify
operational controls across services without redeploying them
or modifying their source code.
This layer of infrastructure between services and their underlying network is what is usually
referred to as a service mesh.
Within Google, we use a distributed platform for building services, powered by proxies that
can handle various internal and
external protocols. These proxies
are supported by a control plane
that provides a layer of abstraction between developers and operators and lets us manage services across multiple languages
and platforms. This architecture
has been battle-tested to handle
high scalability and low latency
and to provide rich features to
every service running at Google.

Illustration: Dan Ciruli, Istio PM
Back in 2016, we decided to build
an open-source project for managing microservices that in a lot
of ways mimicked what we use
within Google, which we decided to call “Istio”. Istio means
“sail” in Greek, and Istio started
as a solution that worked with
Kubernetes, which in Greek
means “helmsman” or “pilot”. It
is important to note that Istio is
agnostic to deployment environment, and was built to help
manage services running across

multiple environments.
Around the same time that we
were starting the Istio project,
IBM released an open-source
project called Amalgam8, a content-based routing solution for
microservices that was based on
NGINX. Realizing the overlap in
our use cases and product vision,
IBM agreed to become our partner in crime and shelve Amalgam8 in favor of building Istio
with us, based on Lyft’s Envoy.

How does Istio work?

Broadly speaking, an Istio service
mesh is split into 1) a data plane
built from Envoy proxies that

Service Meshes // eMag Issue 63 - Aug 2018

9


intercepts traffic and controls
communication between services and 2) a control plane that
supports services at runtime by
providing policy enforcement,
telemetry collection, and certificate rotation.

Proxy


Envoy is a high-performance,
open-source distributed proxy
developed in C++ at Lyft (where
it handles all production traffic).
Deployed as a sidecar, Envoy intercepts all incoming and outgoing traffic to apply network
policies and integrate with Istio’s
control plane. Istio leverages
many of Envoy’s built-in features such as discovery and load
balancing, traffic splitting, fault
injection, circuit breakers, and
staged rollouts.

Pilot

As an integral part of the control
plane, Pilot manages proxy configuration and distributes service
communication policies to all Envoy instances in an Istio mesh. It
can take high-level rules (like rollout policies), translate them into
low-level Envoy configuration,
and push them to sidecars with
no downtime or redeployment
necessary. While Pilot is agnostic
to the underlying platform, operators can use platform-specific
adapters to push service discovery information to Pilot.

Mixer

Mixer integrates a rich ecosystem of infrastructure back-end
systems into Istio. It does this
through a pluggable set of

adapters using a standard configuration model that allows Istio
to be easily integrated with existing services. Adapters extend
Mixer’s functionality and expose
specialized interfaces for monitoring, logging, tracing, quota
management, and more. Adapt10

ers are loaded on demand and
used at runtime based on operator configuration.

Citadel

Citadel (previously known as
Istio Auth) performs certificate
signing and rotation for service-to-service communication
across the mesh, providing mutual authentication as well as
mutual authorization. Envoy uses
Citadel certificates to transparently inject mutual transport-layer security  (TLS) on each call,
thereby securing and encrypting
traffic using automated identity
and credential management. As
is the theme throughout Istio,
authentication and authorization
can be configured with minimal
to no changes of service code
and will seamlessly work across
multiple clusters and platforms.

Why use Istio?

Istio is highly modular and is

used for a variety of use cases.
While it is beyond the scope of
this article to cover every benefit,
let me provide a glimpse of how
it can simplify the day-to-day life
of NetOps, SecOps, and DevOps.

Resilience

Istio can shield applications from
flaky networks and cascading
failures. If you are a network operator, you can systematically
test the resiliency of your application with features like fault
injection to inject delays and
isolate failures. If you want to migrate traffic from one version of
a service to another, you can reduce the risk by doing the rollout
gradually through weight-based
traffic routing. Even better, you
can mirror live traffic to a new
deployment to observe how it
behaves before doing the actual migration. You can use Istio
Gateway to load-balance the incoming and outgoing traffic and

Service Meshes // eMag Issue 63 - Aug 2018

apply route rules like timeouts,
retries, and circuit breaks to reduce and recover from potential
failures.

Security


One of the main Istio use cases
is securing inter-service communications across heterogeneous
deployments. In essence, security operators can now uniformly
and at scale enable traffic encryption, lock down the access to
a service without breaking other
services, ensure mutual identity verification, whitelist services using ACLs, authorize service-to-service communication,
and analyze the security posture
of applications. They can implement these policies across the
scope of a service, namespace,
or mesh. All of these features can
reduce the need for layers of firewalls and simplify the job of a security operator.

Observability

The ability to visualize what’s happening within your infrastructure
is one of the main challenges of a
microservice environment. Until
recently individual services had
to be extensively instrumented
for end-to-end monitoring of
service delivery. And unless you
have a dedicated team willing to
tweak every binary, getting a holistic view of the entire fleet and
troubleshooting bottlenecks can
be cumbersome.
With Istio, out of the box, you get
visualization of key metrics and
the ability to trace the flow of
requests across services. This allows you to do things like enable

autoscaling based on application
metrics. While Istio supports a
whole host of providers like Prometheus, Stackdriver, Zipkin, and
Jaeger, it is back-end agnostic. If
you don’t find what you are look-


ing for, you can always write an adapter and integrate it with Istio.

Where is Istio now?

We at Google continually add new
features to Istio while stabilizing and
improving the existing ones. In true
agile style, each Istio feature proceeds
through its own individual lifecycle
(dev/alpha/beta/stable). Although we
are still tinkering around with some
functionality, there are a bunch of features that are ready for production use
(beta/stable). Check out the latest feature list on the Istio website.
Istio has a rigorous release cadence.
While daily and weekly releases are
available, they are not supported
and may not be reliable. The monthly
snapshots, on the other hand, are relatively safe and are usually packed with
new features. However, if you are looking to use Istio in production, look for
releases that have the tag “LTS” (longterm support). As of this writing, 0.8
is the latest LTS release. You can find
that release and all the other versions
on GitHub.


What next?

It’s been a year since Istio 0.1 was officially launched at GlueCon. While
we have come far, there is still a lot
more ground to cover. The near-term
goal is to launch Istio to 1.0, when the
key features will all be in beta (and in
some cases stable). It’s important to
note that this does not mean 100%
of Istio features, just what we consider the most important features based
on feedback from the community. We
are also rigorously working on improving nonfunctional requirements
like performance and scalability for
this launch, as well as improving our
documentation and “getting started”
experiences.
One of the primary  goals for Istio is
support for hybrid environments. As
an example, someday users could run
VMs in GCE, an on-premises Cloud
Foundry cluster, and other services in

another public cloud, and Istio would
provide them with a holistic view of
their entire service fleet, as well as the
means to operate and secure connections between them. There is work in
progress to enable multi-cluster architecture in which you can join multiple Kubernetes clusters into a single
mesh and enable service discovery
across clusters on a flat network; this

work is in alpha in the 0.8 LTS release.
In the not-so-distant future, it will also
support global ingress for cluster-level load balancing, as well as support
for non-flat networks using Gateway
peering.
Another area of focus beyond our
1.0 release is API management capabilities. As an example, we intend to
launch a service broker API that will
help connect service consumers with
service operators by enabling discovery and provisioning of individual services. We will also provide a common
interface for API management features like API business analytics, API
key validation, auth validation (JWT,
OAuth, etc.), transcoding (JSON/REST
to gRPC), routing, and integration with
various API management systems like
Google Endpoints and Apigee.

Latest InfoQ
service mesh
podcast

OLIVER GOULD
ON SERVICE
MESH FOR
MICROSERVICES,
LINKERD, AND
THE RECENTLY
RELEASED
CONDUIT


All these near-term goals pave the
path towards our long-term vision,
which is to make Istio ambient. As
Sven Mawson, our technical lead and
Istio founder puts it, “We want to reach
a future state where Istio is woven into
every environment, with service management available no matter what environment or platform you use.”
While it’s still early, the speed of Istio
development and adoption has been
accelerating. From major cloud providers to independent contributors,
Istio has already become synonymous
with service mesh and has grown to
be an integral part of infrastructure
roadmap. With every launch, we are
getting closer to the new reality.

Service Meshes // eMag Issue 63 - Aug 2018

11


Read online on InfoQ

KEY TAKEAWAYS
The three core
strategies for
managing failure
in a microservices
architecture are
proactive testing,

mitigation, and rapid
response.
If you have a
small number of
microservices or a
shallow topology,
consider delaying
adoption of a service
mesh and evaluate
alternative strategies
for failure management.
If you are deploying
a service mesh, be
prepared to invest
ongoing effort in
integrating the mesh
into your software
development lifecycle.

12

SERVICE MESH:
PROMISE OR PERIL?
by Richard Li
Service meshes such as Istio, Linkerd,
and Cilium are gaining increased
visibility as companies adopt
microservice architectures. The
arguments for a service mesh are
compelling: full-stack observability,

transparent security, systems
resilience, and more. But is a service
mesh really the right solution for your
cloud-native application? This article
examines when a service mesh makes
sense and when it might not.
Service Meshes // eMag Issue 63 - Aug 2018


Microservices, done right, let you
move faster

Time to market is a fundamental competitive advantage. Responding quickly to market forces and customer feedback is crucial
to building a winning business.  Microservices  are a powerful paradigm to accelerate
your software agility and velocity  workflow.
Empowering different software teams to simultaneously work on different parts of an
application decentralizes decision making.
Decentralized decision making has important consequences. First, software teams can
make locally optimal decisions on architecture, release, testing, and so forth, instead of
relying on a globally optimal standard. The
most common example of this type of decision is release: instead of orchestrating a single monolithic application release, each team
has its own release vehicle. The second consequence is that decision making can happen
more quickly, as the number of communication hops between the software teams and
centralized functions such as  operations, architecture, and so forth is reduced.

Microservices aren’t free: They
introduce new failure modes

Adopting microservices has far-reaching implications for your organization, process, and
architecture. In this article, we’ll focus on one

of the key architectural shifts — namely, microservices form a distributed system. In a microservices-based application, business logic
is distributed between multiple services that
communicate with each other via the network. A distributed system has many more
failure modes, as highlighted in the  fallacies
of distributed computing.
Given these failure modes, it’s crucial to have
an architecture and process that prevent
small failures from becoming big failures.
When you’re going fast, failures are inevitable
— e.g., bugs will be introduced as services are
updated, services will crash under load, and
so forth.
As your application grows in complexity, the
need for failure management grows more
acute. When an application consists of a
handful of microservices, failures tend to be
easier to isolate and troubleshoot. As your
application grows to tens or hundreds of

13


microservices — with different,
geographically distributed teams
— your failure management systems need to scale with your application.

Managing failure

There are three basic strategies
for managing failure: proactive

testing, mitigation, and rapid response:
1. Proactive testing —  Implement processes and systems to test your application
and services so that failure
is identified early and often.
Classic quality assurance
(QA) is included within this
category, and although traditional test teams focused
on pre-release testing, this
frequently now extends to
testing in production.
2. Mitigation — Implement
strategies to reduce the impact of any given failure. For
example, load-balancing between multiple instances of
a service can insure that if a
single instance fails, the overall service can still respond.
3. Rapid response — Implement processes and systems
to rapidly identify and address a given failure.

Service meshes

When a service fails, there is
an impact on its upstream and
downstream services. The impact of a failed service can be
greatly mitigated by properly
managing the communication
between services. This is where a
service mesh comes in.
A service mesh manages service-level (i.e.,  layer 7 or L7)
communication. Service meshes
provide powerful primitives that

can be used for all three failure

14

management strategies. Service
meshes implement the followingt:
1. Dynamic routing can be
used for different release and
testing strategies such as
canary routing, traffic shadowing, or blue/green deployments.
2. Resilience mitigates the impact of failures through strategies such as circuit breaking
and rate limiting.
3. Observability helps improve response time by collecting metrics and adding
context (e.g., tracing data) to
service-to-service communication.
Service meshes add these features in a way that’s largely transparent to application developers.
However, as we’ll see shortly,
there are some nuances to this
notion of transparency.

Will a service mesh
help you build software
faster?

In deciding whether or not a service mesh makes sense for you
and your organization, start by
asking yourself two questions:
how complex is your service topology and how will you integrate a service mesh into your
software development lifecycle
(SDLC)?

Typically, an organization will
start with a single microservice
that connects with an existing
monolithic application. In this situation, the benefits of the service
mesh are somewhat limited. If
the microservice fails, identifying
the failure is straightforward. The
blast radius of a single microservice failure is inherently limited.
You can likely accomplish incremental releases through existing

Service Meshes // eMag Issue 63 - Aug 2018

infrastructure such as  Kubernetes or your API Gateway.
As your service topology grows
in complexity, however, the benefits of a service mesh start to
accumulate. The key constraint
to consider is the depth of your
service-call chain. If you have a
shallow topology, where your
monolith directly calls a dozen
microservices, the benefits of a
service mesh are still fairly limited. As you introduce more service-to-service communication
where service A calls service B,
which calls service C, a service
mesh becomes more important.
As far as integration goes, a service mesh is designed to be
transparent to the actual services
that run on the mesh. One way
to think about a service mesh is
that it’s a richer L7 network. No

code changes are required for a
service to run on a service mesh.
However, deploying a service
mesh does not automatically accelerate your software velocity
and agility. You have to integrate
the service mesh into your development processes. We’ll explore
this in more detail in the next
section.

Failure management

A service mesh provides powerful primitives for failure management, but alternatives to service
meshes exist. In this section, we’ll
walk through each of the failure
management strategies, and
discuss how they apply to your
SDLC.

Proactive testing

Proactive testing for a microservices application should be as
real-world as possible. Given the
complexity of a multiservice application,  contemporary testing
strategies  emphasize testing in


production (or with production
data).
A service mesh enables testing
in production by controlling the

flow of L7 traffic to services. For
example, a service mesh can
route 1% of traffic to v1.1 of a service while the other 99% of traffic
follows the beaten path to v1.0
(called “canary deployment”).
These capabilities are exposed
through declarative routing rules
(e.g., a Linkerd dtab or Istio routing rules).
A service mesh is not the only
way to proactively test. Complementary strategies include using
your container scheduler such
as Kubernetes to do a rolling update,  an API Gateway that can
canary-deploy, or  chaos engineering.
With all of these strategies, the
question of who manages the
testing workflow becomes apparent. In a service mesh, the
team that manages the mesh
could also centrally manage the
routing rules. However, this likely
won’t scale, as individual service
authors presumably will want
to control when and how they
roll out new versions of their
services. But if service authors
manage the routing rules, how
do you educate them on what
they can and can’t do? How do
you manage conflicting routing
rules?


Mitigation

A service can fail for a variety of
reasons: a code bug, insufficient
resources, hardware failure, etc.
Limiting the blast radius of a
failed service is important so that
your overall application continues operating, albeit in a degraded state.
A service mesh mitigates the impact of a failure by implementing

resilience patterns such as load
balancing, circuit breakers, and
rate limiting on service-to-service communication. For example, a service that is under
heavy load can be rate limited
so that some responses are still
processed, without causing the
entire service to collapse under
load.

be propagated by the service itself.

Other strategies for mitigating
failure include using smart RPC
libraries (e.g.,  Hystrix) or relying
on your container scheduler. A
container scheduler such as Kubernetes supports health checking,  auto-scaling, and dynamic
routing around services that are
not responding to health checks.

An important component of observability is exposing the alerting and visualization to your service authors. Collecting metrics

is only the first step, and thinking
of how your service authors will
create alerts and visualizations
that are appropriate for the given service is important to closing
the observability loop.

These mitigation strategies are
most effective when they are appropriately configured for a given service. For example, different
services can handle different volumes of requests, necessitating
different rate limits. How do policies such as rate limits get set?
Netflix has  implemented some
automated configuration algorithms  for setting these values.
Other approaches would be to
expose these capabilities to service authors, who can configure
the services correctly.

Observability

Failures are inevitable. Implementing observability — spanning monitoring, alerting/visualization, distributed tracing, and
logging — is critical to minimizing the response time to a given
failure.

Other approaches for collecting similar metrics include using
monitoring agents, collecting
metrics via  StatsD, and implementing tracing through libraries (e.g., the Jaeger instrumentation libraries).

It’s all about the
workflow!

The mechanics of deploying a

service mesh are straightforward. However, as this discussion
hopefully makes clear, the application of a service mesh to your
workflow is more complicated.
The key to successfully adopting
a service mesh is to recognize
that a mesh affects your development processes, and be prepared
to invest in integrating the mesh
into those processes. There is no
single right way to integrate the
mesh into your processes, and
best practices are still emerging.

A service mesh automatically
collects detailed metrics on service-to-service communication,
including data on throughput,
latency, and availability. In addition, service meshes can inject
the necessary headers to support  distributed tracing. Note
that these headers still need to

Service Meshes // eMag Issue 63 - Aug 2018

15


Read online on InfoQ

KEY TAKEAWAYS
Over the past four years, Lyft has
transitioned from a monolithic
architecture to hundreds of

microservices. As the number
of microservices grew, so did
the number of outages due to
cascading failure or accidental
internal denial of service.
Today, these failure scenarios are
largely a solved problem within
the Lyft infrastructure. Every
service deployed at Lyft gets
throughput and concurrency
protection automatically via use of
the Envoy proxy.
Envoy can be deployed as
middleware or solely at request
ingress, but the most benefit
comes from deploying it at
ingress and egress locally with the
application. Deploying Envoy on
both sides of the request allows
it to act as a smart client and a
reverse proxy for the server.

16

ENVOY SERVICE-MESH
CASE STUDY: MITIGATING
CASCADING FAILURE AT LYFT
by Jose Nino and Daniel Hochmanl

Cascading failure is one of the primary

causes of unavailability in high-throughput
distributed systems. Over the past four
years, Lyft has transitioned from a monolithic
architecture to hundreds of microservices.
As the number of microservices grew, so did
the number of outages due to cascading
failure or accidental internal denial of service.

Service Meshes // eMag Issue 63 - Aug 2018


Today, these failure scenarios are
largely a solved problem within the Lyft infrastructure. Every
service deployed at Lyft gets
throughput and concurrency
protection automatically. With
some targeted configuration
changes to our most critical services, there has been a 95% reduction in load-based incidents
that impact the user experience.
Before we examine specific
failure scenarios and the corresponding protection mechanisms, let’s first understand how
network defense is deployed at
Lyft. Envoy is a proxy that originated at Lyft and was later opensourced and donated to the
Cloud Native Computing Foundation. What separates Envoy
from many other load-balancing
solutions is that it was designed
to be deployed in a mesh configuration. Envoy can be deployed as middleware or solely
at request ingress, but the most
benefit comes from deploying it
at ingress and egress locally with

the application. Deploying Envoy
on both sides of the request allows it to act as a smart client and
a reverse proxy for the server. On
both sides, we have the option to
employ rate limiting and circuit
breaking to protect servers from
overload in a variety of scenarios.

Core concepts
Concurrency and rate limiting
Concurrency and rate limiting
are related, but different concepts; two sides of the same coin.
When thinking of limiting load
in systems, operators traditionally think in terms of requests per
second. The act of limiting the
rate of requests sent to a system
is “rate limiting”. Stress tests are
normally done to determine the
request rate that will overload
the service, then limits are set
somewhere below this point. In

some cases, business logic dictates the rate limit.
On the other side of the coin, we
have concurrency: how many
units are in use simultaneously. These units can be requests,
connections, etc. For example,
instead of thinking in terms of
request rate, we can think about
the number of concurrent inflight requests at a point in time.

When we think about concurrent
requests, we can apply queueing
theory to determine the number
of concurrent requests a service
can handle before a queue starts
to build, requests latencies increase, and the service fails due
to resource exhaustion.
Global versus local decisions
Circuit breakers in Envoy are
computed based on local information. Each instance of Envoy tracks its own statistics and
makes its own circuit-breaking
decisions. This model has a few
advantages over a global system.
The first is that the limits can be
computed in memory, without
the expense of a network call to
a central system. The second is
that the limits scale with the size
of the cluster. Third, the limit accounts for differences between
machines, whether they receive
a different query mix or have differences in performance.
Common failure scenarios
Before we introduce the defense mechanisms, it’s helpful
to go over some common failure
modes.
Retry amplification
A dependency begins to fail. If a
service does one retry for all requests to that dependency, the
overall call volume will double.
Resource starvation

Every service is bound by some
constraint, usually CPU, network,
or memory. Concurrent requests

are usually directly correlated
with the amount of resources
consumed.
Recovery from resource
starvation
Even when the cause of the increase in resource consumption
subsides to normal levels, a service may not recover due to resource contention.
Back-end slowdown
A dependency (database or other service) slows, causing the
service to take longer to fulfill a
request.
Burstiness and undersampling
When doing advance capacity
planning or elastically scaling a
service, the usual practice is to
consider the average resources consumed across an entire
cluster. However, callers of the
service may elect to send a large
number of requests simultaneously. This can saturate a single
server momentarily. When collecting metrics, per-minute or
higher data will almost definitely
obscure these bursts.

Present day at Lyft
How do we rate-limit?
Ratelimit is an open-source Go/

gRPC service designed to enable
generic rate-limit scenarios for
a variety of applications. Rate
limits are applied to domains.
Examples of a domain are per IP
or the number of connections
per second made to a database.
Ratelimit is in production use at
Lyft, and handles hundreds of
thousands of rate-limit requests
per second. We use Ratelimit at
both the edge proxy and within
the internal service mesh.
The open-source service is a reference implementation of the

Service Meshes // eMag Issue 63 - Aug 2018

17


Envoy rate limit API. Envoy offers
the following integrations:
Network-level rate-limit filter
— Envoy can call the rate-limit
service for every new connection
on the listener where the filter
is installed. The configuration
specifies a specific domain and
descriptor set to rate limit on.
This has the ultimate effect of

rate-limiting the connections per
second that transit the listener.
HTTP-level rate-limit filter —
Envoy can call the rate-limit service for every new request on
the listener where the filter is installed.
At Lyft, we primarily use rate
limiting to defend against load
at the edge of our infrastructure
— for example, the request rate
allowed per user ID. This protects
Lyft’s services from resource starvation due to unexpected or malicious load from external clients.
The networking team at Lyft provides metrics for all configured
rate limits. When a service owner
creates new rate limits to enforce
at the edge, between services, or
to a database, it is immediately
possible to gather data pertaining to the defence mechanism.
The graph above is a snippet
from the Ratelimit service dashboard, which shows three panels:
• Total hits per minute — This
panel shows a time series
with the total hits per rate
limit configured. In this panel,
service owners can see trends
over time.

18

• Over limits per minute —
This panel shows the metrics

that exceed the configured
limit. This panel provides service owners with quantifiable
data they can use to assess
call patterns to plan for highload events.
• Near limits per minute —
This panel shows when the
metric is hitting 80% of the
limit configured.
How do we manage
concurrency?
One of the main benefits of Envoy is that it enforces concurrency limits via its circuit-breaking
system at the network level as
opposed to having us configure
and implement the pattern in
each application independently.
Envoy supports various types of
fully distributed circuit breaking:
Maximum connections — This
is the maximum number of connections that Envoy will establish
to all hosts in an upstream cluster. In practice, this is generally
used to protect HTTP/1 clusters
since HTTP/2 can multiplex requests over a single connection,
therefore limiting connection
growth during slowdown.
Maximum pending requests
— This is the maximum number
of requests that will be queued
while waiting for an available
connection from the pool. In
practice, this is only applicable

to HTTP/1 clusters since HTTP/2
connection pools never queue
requests. HTTP/2 requests are
multiplexed immediately.

Service Meshes // eMag Issue 63 - Aug 2018

Maximum requests —  This is
the maximum number of requests that can be outstanding
to all hosts in a cluster at any
given time. In practice, this is primarily used with HTTP/2 since
HTTP/1 generally operates with
one request per connection.
Maximum active retries — This
is the maximum number of retries that can be outstanding to
all hosts in a cluster at any given
time. In general, we recommend
aggressively circuit-breaking retries so that retries for sporadic
failures are allowed but the overall retry volume cannot explode
and cause large-scale cascading failure. This setting protects
against retry amplification.
At Lyft, we have focused on two
mechanisms for managing concurrency in the service mesh:
Limiting the number of concurrent connections at the ingress
layer — Given that every service
at Lyft runs an Envoy sidecar to
manage incoming requests into
the service (ingress), we can configure the number of concurrent
connections that the sidecar
make to the application, thus

limiting ingress concurrency into
the application. We provide reasonable values as defaults, but
encourage service owners to analyze their concurrency patterns
and tighten the settings.
Limiting the number of concurrent requests at the egress layer — Another benefit of running
a sidecar to manage egress traffic from the service is that we can
manage outgoing concurrent requests from the service (egress)


to any other service at Lyft. This
means that the owners of the
Locations service can selectively
configure the levels of concurrency that they want to support
for every other service at Lyft —
e.g., they can decide and configure that the Rides service can
make 100 concurrent requests to
Locations, but the Users service
can only make 50 concurrent requests to Locations.

An interesting consequence of
running concurrency limits on
both the egress and ingress of
every service at Lyft is that it’s
much easier to track down undesired behavior. As mentioned
above, a common failure scenario is burstiness, which may be
hard to diagnose due to metric
resolution. The presence of concurrency limiting on both egress
and ingress makes it easy to pinpoint bursty behavior across the

system by seeing where in the

request path concurrency overflows occur.
As we have mentioned, concurrency is not always an intuitive
concept. To improve this shortcoming, the networking team
provides different visualizations
so that service owners can configure their concurrency limits
and then monitor how those limits affect the system.

Setting the limits

The interactive dashboard above is where service owners can experiment with different limits for the maximum
number of concurrent requests allowed from any service at Lyft to their specific service. In the example above,
the owner of the Locations service can see that a limit of 60 concurrent requests would suffice for the majority
of calling services, with the exception of the Viewport service. Using this dashboard, service owners can visualize
what selective changes in concurrency configuration would look like in the current network topology, and can
make those changes with confidence.
Monitoring the limits
As mentioned above, having Envoy running as a sidecar that handles both ingress and egress traffic from every
service allows service owners to protect their service from ingress concurrency and egress concurrency. The network team automatically creates dashboards like the ones below to help service owners visualize concurrency.

Using the two panels for ingress concurrency above, service owners can watch the number of concurrent connections from their sidecar Envoy into their service (the panel on the left) and see if the concurrency limit is
being hit (the panel on the right).

Service Meshes // eMag Issue 63 - Aug 2018

19


These two panels for egress concurrency show service owners
the number of concurrent number of requests from any service
to their service at any point in

time (the panel on the left). Service owners can additionally observe services that are going over
the configured limit (the panel
on the right) and proceed to address the situation with concrete
data.
What are the shortcomings?
Unfortunately, as with any static
value, it’s hard to pick a nominal
limit. This is true of rate limits
but especially true of concurrency limits. There are several important factors to consider. The
concurrency limits are local, and
must account for the maximum
possible concurrency rather than
the average. Engineers are also
not accustomed to thinking locally and primarily think in terms
of rate of requests and not concurrency. With the aid of some visualizations and statistics, service
owners can usually grasp concurrency and pick a nominal value.
In addition to having difficulty
reasoning about the value, service owners must deal with the
constant changes at Lyft. There
are hundreds of deploys per day
throughout the service mesh.
Any change to a service and
its dependencies can change
the resource and load profile.
As soon as a value is chosen, it
will be out of date due to these
changes. For example, almost all
services at Lyft are CPU-bound. If
a CPU-bound service’s direct dependency slows down by 25%,
the service can handle additional concurrency since in-flight

requests that were previously
using the CPU will now sit idle
for some additional time as they
wait for network I/O to complete.
For this reason, a 25%-50% in-

20

crease over the nominal value is
recommended.

Roadmap
Near term
The networking team at Lyft focuses on building accessible and
easy-to-use systems for service
developers to successfully configure, operate, and debug Envoy
and related systems. As such, we
take it in our charter not only to
design, implement, and deploy
the systems showcased above,
but also to provide continued
support for our users. At Lyft, one
of the main tenets of the infrastructure organization is that the
abstractions we provide for service owners should be self-service. This means that we need to
invest heavily in documenting
use cases, providing debug tools,
and offering support channels.
Given how non-intuitive concurrency is, the networking team
is investing in additional documentation and engineering education around this topic in the
short term. In the past, we have

seen success with related systems in the following formats:

Service Meshes // eMag Issue 63 - Aug 2018

1. FAQs — Frequently asked
question lists are extremely
helpful as a first stop for customers. Moreover, it reduces
the support burden of answering questions directly
(e.g., on Slack, via email, or
even in person). It allows you
to easily point someone to a
link; this practice scales better than having people answer the same questions repeatedly. The downfall here
is that these lists can get long
and difficult to parse. This
can be addressed by separating content into categorical
FAQs.
2. Choose your own adventure — The service owner is
the protagonist who gets to
choose the outcome of their
adventure. In the concurrency space described above,
several issues can arise and,
in tandem, several settings
that can be modified to solve
the problems. This means
that this support burden
lends itself extremely well
to a format where the service owner can start with the
problem they are having and
navigate a flowchart to arrive



at the metrics they need to
analyze to derive the correct
settings.
Near-term investments in documentation and engineering
education mitigate one dimension of the current problem with
concurrency: the non-intuitive
nature of the system. However,
they do not address the other dimension: staleness.
Longer term
Concurrency limits are easy to
enforce because Envoy is present at every hop in the network.
However, as we have seen, the
limits are difficult to determine
because it would require service
owners to fully understand all
the constraints of the system.
Moreover, static limits grow rapidly stale in today’s internet-scale
companies, especially those at a
growth stage, due to the evolving and elastic nature of their
network topologies.
Netflix has invested heavily in
this problem, and recently opensourced a library “to measure or
estimate the concurrency limits
at each point in the network”.
And, more importantly, “as systems scale and hit limits, each
node [in the system] will adjust
and enforce its local view of the
limit.” They have borrowed from
common TCP congestion control

algorithms by equating the system’s concurrency constraints to
a TCP congestion window.
One of Envoy’s design principles
included rich and capable filter
stacks to provide extensibility.
Envoy has both L3/L4 (TCP-level)
and L7 (HTTP-level) filter stacks.
HTTP Filters can be written to
operate on HTTP-level messages.
HTTP filters can stop and continue iteration to subsequent
filters. This rich filter architecture is what allows for complex

scenarios such as health-check
handling, calling a rate-limiting
service, buffering, routing, and
generating statistics for application traffic such as DynamoDB,
etc.
In the coming months Lyft is going to work in conjunction with
the team behind the concurrency-limits library at Netflix to bring
a system based on their library
into an Envoy L7 filter. This means
that at Lyft, or any other company using Envoy, we would move
to an automated system, where
our service engineers would not
have to statically configure concurrency limits. This means, for
instance, that if there is service
slowdown due to unexpected
circumstances, the adaptive limit
system can automatically clamp
down on the detected limit and

prevent failure due to the unforeseen slowdown. An adaptive
system, in general, eliminates the
two problems we have had in the
past: that determining appropriate limits is non-intuitive and
static limits growing rapidly stale
in an elastic distributed system.

Final thoughts

To learn more about Envoy’s circuit-breaker
implementation,
please see “Circuit breaking” in
the Envoy documentation. As
open source, the Envoy project
is open to code contributions. It
also welcomes new ideas. Feel
free to open an issue with a suggested improvement for circuit
breaking even if the code is not
forthcoming. One example of a
capability that has not yet been
implemented at the time of
writing is circuit breaking based
on system resources. Instead of
approximating the concurrent
request threshold based on the
CPU profile, we can directly circuit-break on the CPU when
dealing with ingress traffic.

While circuit breakers can improve the behavior of a system
under load, it is important not to

forget improvements that can be
made in the system itself. Circuit
breakers should be treated as a
failsafe, not as a primary means
of constraint. Service owners
should use knowledge of circuit
breakers to make improvements
to their own codebase. Limiting
concurrency with bounded pools
is the most common way to solve
concurrency issues. If large numbers of requests are generated
from the same context, callers
can opt instead to use a batch
API. If a batch API does not exist,
it may be in the best interest of
the service receiving the call to
implement one. These patterns
are often a further extension of
the education process. At Lyft,
the networking team operates
in partnership with other teams
to educate and introduce improvements to all services. With
education, system design improvements, and concurrency
management at the network
level, service owners can design,
implement, and operate large
distributed systems while minimizing the risk of cascading failures due to load.

Service Meshes // eMag Issue 63 - Aug 2018


21


Every service
deployed at Lyft
gets throughput and
concurrency protection
automatically via use
of the Envoy Proxy.

22

Service Meshes // eMag Issue 63 - Aug 2018


Sponsored article

What Is a Service Mesh?
A service mesh is a configurable infrastructure layer
for a microservices application. It makes communication between service instances flexible, reliable,
and fast. The mesh provides service discovery, load
balancing, encryption, authentication and authorization, support for the circuit breaker pattern, and other
capabilities.
The service mesh is usually implemented by providing a proxy instance, called a sidecar, for each service
instance. Sidecars handle inter‑service communications, monitoring, security‑related concerns – anything that can be abstracted away from the individual
services. This way, developers can handle development, support, and maintenance for the application
code in the services; operations can maintain the service mesh and run the app.
Istio, backed by Google, IBM, and Lyft, is currently the
best‑known service mesh architecture. Kubernetes,
which was originally designed by Google, is currently

the only container orchestration framework supported by Istio.
Service mesh comes with its own terminology for
component services and functions:
• Container orchestration framework. As more
and more containers are added to an application’s
infrastructure, a separate tool for monitoring
and managing the set of containers – a container orchestration framework – becomes essential.
Kubernetes seems to have cornered this market,
with even its main competitors, Docker Swarm
and Mesosphere DC/OS, offering integration with
Kubernetes as an alternative.
• Services vs. service instances. To be precise,
what developers create is not a service, but a service definition or template for service instances.
The app creates service instances from these, and
the instances do the actual work. However, the
term service is often used for both the instance
definitions and the instances themselves.

• Sidecar proxy. A sidecar proxy is a proxy instance
that’s dedicated to a specific service instance. It
communicates with other sidecar proxies and is
managed by the orchestration framework.
• Service discovery. When an instance needs to
interact with a different service, it needs to find –
discover – a healthy, available instance of the other service. The container management framework
keeps a list of instances that are ready to receive
requests.
• Load balancing. In a service mesh, load balancing works from the bottom up. The list of available instances maintained by the service mesh
is stack‑ranked to put the least busy instances –
that’s the load balancing part – at the top.

• Encryption. The service mesh can encrypt and
decrypt requests and responses, removing that
burden from each of the services. The service
mesh can also improve performance by prioritizing the reuse of existing, persistent connections,
reducing the need for the computationally expensive creation of new ones.
• Authentication and authorization. The service
mesh can authorize and authenticate requests
made from both outside and within the app, sending only validated requests to service instances.
• Support for the circuit breaker pattern. The service mesh can support the circuit breaker pattern,
which isolates unhealthy instances, then gradually
brings them back into the healthy instance pool if
warranted.
The part of the service mesh where the work is getting done – service instances, sidecar proxies, and the
interaction between them – is called the data plane
of a service mesh application. (Though it’s not included in the name, the data plane handles processing
too.) But a service mesh application also includes a
monitoring and management layer, called the control
plane.

Click here to read the full article

Service Meshes // eMag Issue 63 - Aug 2018

23


Read online on InfoQ

KEY TAKEAWAYS
Istio helped make the

service-mesh concept more
concrete and accessible,
and with the recent release
of Istio 1.0, we can expect a
surge in interest.
Istio attempts to solve
some particularly difficult
challenges when running
applications in a cloud
platform: application
networking, reliability, and
observability.
Another challenge Istio
addresses is security.
With Istio, communication
between services in
the mesh is secure and
encrypted by default.
Istio can also help with origin
or end-user JWT identitytoken verification.

24

INCREASING SECURITY
WITH A SERVICE MESH
CHRISTIAN POSTA
EXPLORES ISTIO
by Christian Posta
Istio helped make the service-mesh concept
more concrete and accessible, and with the

recent release of Istio 1.0, we can expect a
surge in interest. Jasmine Jaksic did a great job
introducing Istio and service mesh earlier, and so
I would like to take the opportunity to introduce
a particular area of Istio that will bring lots of
value to developers and operators of cloud
services and applications: security.

Service Meshes // eMag Issue 63 - Aug 2018


Istio use cases

Istio attempts to solve some
particularly difficult challenges
when running applications in a
cloud platform. Specifically, Istio
addresses issues around application networking, reliability, and
observability. In the past, we’ve
tried to use purpose-built application libraries to solve some
of these challenges like circuit
breaking, client-side load balancing, metrics collection, and
others. Doing this across different languages, frameworks, runtimes, etc. creates a prohibitive
operational burden that most
organizations can’t scale.
Additionally, it is hard to get consistency across the implementations found in each language,
never mind upgrade them all in
lock-step when things need to
change or bugs are identified.
A lot of challenges around reliability, observability, and policy

enforcement are very horizontal
concerns and are not business
differentiators. Although they
are not directly differentiators,
neglecting them can cause massive business impact, and so we
need to address them. Istio aims
to solve these concerns.

Importance of network
security

Another one of the horizontal,
difficult-to-get-right concerns for
application teams is security. In
some cases, security is an afterthought, and we try to shoehorn
it into our apps at the last possible moment. Why? Because doing security right is hard. For example, something foundational
like encrypting application traffic
should be commonplace and
straightforward, right? Configuring TLS/HTTPS for our services
should be straightforward, right?
We may have even done it before
on past projects. However, getting it right, in my experience, is
not as easy as it sounds. Do we
have the right certificates? Are
they signed by a CA that the clients will accept? Are we enabling
the right cipher suites? Did we
import that into our truststore/
keystore properly? Wouldn’t it
just be easy to enable the --insecure flag on our TLS/HTTPS
configuration?

Misconfiguring this type of thing
can be extremely dangerous. Istio provides some help here. Istio
deploys sidecar proxies (based
on the Envoy proxy) alongside
each application instance that
handles all of the network traffic for the application. When an

application tries to talk to http://
foo.com, it does so via the sidecar proxy (over the loopback
network interface) and Istio will
redirect that traffic to the other service’s sidecar proxy, which
then proxies the traffic to the
actual upstream .
By having these proxies in the request path, we can do things like
automatically encrypt the transport without the application having to know anything. This way
we get consistently applied traffic encryption without relying on
each application development
team to get it right.
One of the issues with setting
up and sustaining an implementation of TLS and mutual TLS for
our services architecture is certificate management. Istio’s Citadel
component, in the control plane,
handles getting the certificates
and keys onto the application
instances. Citadel can generate
the certificates and keys needed
for each workload to identify itself, as well as rotate certificates
periodically so that any compromised certificates have a short
life. Using these certificates, Istio-enabled clusters have automatic mutual TLS. We can plug in
our own CA provider root certificates as needed as well.


Service Meshes // eMag Issue 63 - Aug 2018

25


×