Tải bản đầy đủ (.pdf) (22 trang)

IT training white paper resolve network partitions split brain lightbend enterprise suite khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.71 MB, 22 trang )

WHITE PAPER

How To Resolve Network
Partitions In Seconds With
Lightbend Enterprise Suite
Strategies To Seamlessly Recover
From Split Brain Scenarios


Table Of Contents
Executive Summary And Key Takeaways........................................................................................ 3
The Question Is When, Not If, The Network Will Fail........................................................................ 4
Distributed Systems Raise Network Complexity......................................................................................................................................4
Reactive Systems Can Heal Themselves, But Not Network Partitions...............................................................................................5

The Problem................................................................................................................................... 7
High-Level Solution.......................................................................................................................11
Four Strategies To Resolve Network Partitions..............................................................................14
Strategy 1 - Keep Majority.......................................................................................................................................................................... 15
Strategy 2 - Static Quorum........................................................................................................................................................................ 15
Strategy 3 - Keep Oldest............................................................................................................................................................................. 16
Strategy 4 - Keep Referee............................................................................................................................................................................17
Split Brain Resolution From Lightbend....................................................................................................................................................17
Akka SBR.........................................................................................................................................................................................................17
Cluster Management SBR........................................................................................................................................................................... 18

The Benefits..................................................................................................................................21
Serve Customers Better............................................................................................................................................................................. 21
Eliminate Expensive Downtime................................................................................................................................................................ 21
Immediate Time-To-Value.......................................................................................................................................................................... 21


Summary......................................................................................................................................21

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

2


Executive Summary And Key Takeaways
In the era of highly-distributed, real-time applications, network issues often result in so-called “split brain”
scenarios, where one or more nodes in a cluster becomes suddenly unresponsive. This results in data
inconsistencies between distributed services that can cause cascading failures and downtime.
While the industry has turned to Reactive systems to solve issues of application/service level resilience,
elasticity, and consistency, network partitions occur outside of the area of concern addressed by these
architectures.
Given the inevitability of network partitions and the impossibility of truly eradicating them, the best solution is to have predetermined strategies that fit business requirements for dealing with this recurring issue
quickly and with minimal disruption to overall system responsiveness.
Lightbend Enterprise Suite, the commercial component of Lightbend Reactive Platform, offers four such
strategies as part of its Split Brain Resolver (SBR) feature. These strategies, called “Keep Majority,” “Static
Quorum,” “Keep Oldest,” and “Keep Referee” can be seamlessly executed in a matter of seconds if and
when network partitions occur.
Users can take advantage of the SBR feature either in development at the code level or at the production/
operations level.
By avoiding data inconsistencies, failures, and downtime, users of Lightbend Enterprise Suite can better
serve their customers, leading to increased retention, growth, and market share.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

3



The Question Is When, Not If, The Network Will Fail
“The network is reliable.”
One Of The Fallacies Of Distributed Computing
Network issues are unavoidable in today’s complex environments. To put it more colloquially: networks
are flaky. Most users are aware of this and even accustomed to it, and are willing to handle random
unresponsiveness from time to time. However, user tolerance for network flakiness has limits. If a specific website or app repeatedly experiences problems, patience wears thin. As network issues mount, it
becomes increasingly likely that a user will interact with the offending website or app less often--or even
abandon it altogether.
This is not to say that users are the only ones impacted by network issues. In a world full of APIs and
interconnected systems, network problems affecting one system can easily impact other connected or
dependent systems. Users interacting with one of those applications through a front-end will likely notice
the problem within a short amount of time, but in some cases, it might take many hours or even days for
such problems to become apparent. For the sake of convenience, this paper will use user experience to
highlight the pernicious effects of network issues.
Most websites and apps access a database or have some form of a data persistence layer. Communication from the application layer to the persistence layer is often over a network. Thus, for the duration of
network issues, problems, and outages, the application becomes unable to perform its normal duties
and user experience starts to suffer.
Network problems can also span a variety of locations. They can be widespread across the entire network, occur locally within data centers, or even arise in a single router or on-premise server. A complete
outage is the nightmare scenario, but even small network hiccups can result in lost revenue. For example,
high network traffic often creates very slow response times. In many cases, these slowdowns are actually worse than broken connections because, even with proper monitoring tools, the offending issue is
non-obvious and difficult to diagnose and fix.

Distributed Systems Raise Network Complexity
With distributed systems, various application components (e.g. individual microservices and Fast Data
pipelines) communicate with each other via some form of messaging. One component asks another component for some information. A component may communicate to other components a variety of information, such as a state change of a business entity. A component may delegate work to other components
and wait for a response from the delegates, and so on and so forth.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS

WITH LIGHTBEND ENTERPRISE SUITE

4


E
A

D
B

C

Figure 1 - Component Messaging

Given that, distributed applications can serve to make network issues better or worse, depending on how
well they are designed. Poorly-designed systems crumble when network problems occur. Well-designed
systems recover gracefully when the impacted components stop responding or respond very slowly. The
latest breed of distributed systems are designed from the ground up to be fully prepared for inevitable
network issues. These systems typically have well-defined default or compensating actions that activate
when needed. This allows the overall application to continue to function for users even when an application component stops working. This new breed of systems is known as Reactive systems.

Reactive Systems Can Heal Themselves, But Not Network Partitions
Reactive systems are designed to maintain a level of responsiveness at all times, elastically scaling to meet
fluctuations in demand, and remain highly resilient against failures with built-in self-healing capabilities.
Lightbend, a leader in the Reactive movement, codified the Reactive principles of responsiveness, resilience, and elasticity, all backed by a message-driven architecture, with the Reactive Manifesto in 2013.
Since then, the topic of Reactive has gone from being a virtually unacknowledged technique for building
applications—used by only fringe projects within a select few corporations—to becoming part of the
overall platform strategy for some of the biggest companies across the world.
Compared to a traditional system, in which small failures can cause a system-wide crash, Reactive systems are designed to isolate the offending application or cluster node and restart a new instance somewhere else. However, at the overall network level, which may span across the entire globe, there exists


HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

5


a fundamental problem with network partitions in distributed systems: it’s impossible to tell if an unresponsive node is the result of a partition in the network (known as a “split brain” scenario) or due to an
actual machine crash.
Network partitions and node failures are indistinguishable to the observer: a node may see that there is a
problem with another node, but it cannot tell if it has crashed and will never be available again or if there is a
network issue that might heal after some time. Processes may also become unresponsive for other reasons,
such as overload, CPU starvation, or long garbage collection pauses, leading to further confusion.
As such, even the most well-designed Reactive systems require additional tooling to quickly and decisively tackle large scale network issues. The next section explores how networking problems, in particular
network partitions, impact Reactive systems.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

6


The Problem
“The network is homogeneous.”
One Of The Fallacies Of Distributed Computing
In Reactive systems, challenges arise when heterogeneous software components–such as collaborating
groups of individual microservices–exchange important messages with each other. Important messages
must be delivered and processed, and any failure to deliver and process an important message will result
in some form of inconsistent state.
When the network fails in a distributed system environment, this effectively causes a partition between

the systems on each side of the network outage. In most cases, the network has failed while all of the
systems are still running. The systems on each side of the network outage can no longer communicate
across the partition. It is as if an impenetrable wall has been placed between the systems on both sides
of the network outage. This is known as a split brain scenario.

Figure 2 - Network Partition

As shown in Figure 2, the network between the left and right nodes is broken. The connections between
the nodes on each side of the partition are cut.
To illustrate the impact of network partitions, let’s consider two examples.
In the first one, we’ll look at an order processing system - one that consists of just two microservices:
order and customer. The responsibility of the order microservice is to create new orders and the customer microservice is responsible for reserving customer credit.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

7


When users interact with this system and place an order, the order service creates a new order and sends
an order created message to the customer service. The customer service receives the order created message and reserves the credit. It then sends a customer credit reserved message back to the order service.
The order service receives the message and changes the order state to approved.
Let’s now consider the impact of a network partition on this system.
To begin with, the order service sends the customer service an order created message. The customer
service then receives the message and reserves the credit as it should.

Order

Customer


Order Created

Figure 3 - Send Message Successfully

The customer service then attempts to send a credit reserved message back to the order service. But
suddenly the the customer service falls off the network and the message cannot be sent.
Credit Reserved

Order

Customer

Figure 4 - Send Message Fails

The order service never hears back from the customer service, so it resends the order created message
again. It receives no response, so it retries repeatedly to send the message.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

8


Order

Customer
Retry
Order Created
Figure 5 - Message Send Retry Loop


While the order service is caught in this retry loop, the network detects that the customer service is offline
and efforts begin to bring it back online. When that eventually occurs, the order service successfully
sends the order created message and the customer service receives it. But a naive implementation of the
customer service would then reserve the credit again, which is the incorrect course of action.

Credit Reserved

Order

Order Created

Customer

Figure 6 - Message Sent Again

As demonstrated by this example, in the absence of a network partition handling strategy, businesses
must make sure to incorporate a robust at-least-once delivery mechanism into the design.
Unfortunately, implementing such a mechanism is not trivial. For example, the common retry loop approach is brittle and has a number of complexities that, if not handled properly, will result in the system
converting to an inconsistent state. In these circumstances and many others that are beyond the scope of
this paper, the only viable option is to have an effective network partition / split brain resolution strategy.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

9


In the second illustrative example, we will use an in-person meeting with a group of seven coworkers.
To begin with, all seven members of the meeting are freely communicating back and forth.


Figure 7 - People in Meeting

Suddenly, a wall appears that splits the meeting in two, dividing the co-workers into one group of four
people and another of three. The wall is solid and soundproof, preventing any communication between
the two groups. No one on either side of the wall can ascertain what the other group is planning, making
collaboration impossible.

Figure 8 - Wall Splits the Group

Let’s say that this is a very important meeting and it must continue, regardless of the presence of a giant
impenetrable wall. What should each group do? Both groups could sit idly and wait for the wall to disappear. Or they could try determine a strategy that would get the entire group back together.
Most responses to this situation would result in some confusion and disruption to the meeting. What if
each group decided to continue the meeting on their own? That could result in decisions being made by
these two split groups based on incomplete information, information that is only known to people on the
other side of the wall.
In either case, a network partition is creating a number of issues, affecting the team’s velocity and ultimately its business agility.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

10


High-Level Solution
Now, imagine that magic walls suddenly appearing in the middle of meetings was common, and even
expected among the group. To overcome this, the co-workers designed a plan of action in advance: when
a wall appears in a meeting, the smaller group leaves their chairs and moves to the side where the larger
group is located.
In our previous example, the wall created one group of four (Group A) and one group of three (Group B).
Group B knows they are in the minority, so they move to the other side of the partition. Group A knows they

are in the majority, so they all stay seated and wait for Group B to arrive on their side of the wall.

Figure 9 - Group B Joins Group A

A key point here is that both groups had to independently arrive at the same conclusion. The majority
stays where they are and the minority group moves to rejoin the majority. This works for an odd number
of people, but what if there is an even split? Say there were eight in the meeting and the split was four and
four. In this situation, tie-breaking rules could help. Perhaps the plan is to go to the side with the highest-ranking employee.
Returning to the first example with the order and customer microservices we see that a network partition
that cuts off communication between them will interrupt the normal order processing workflow.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

11


Order

Customer

Figure 10 - Network Partition Between Services

When that happens, just as with the meeting room example, each side needs to independently detect
and resolve the issue. In other words, the system must be capable of running on each side of the partition, detecting that there is a problem, and deciding which side stays up and which side shuts down. The
winning side should also be capable of restarting all the processes that were running on the losing side.

Figure 11 - Cluster Network Partition

Now consider a more realistic example. Say there are 20 microservices running on a cluster of five nodes

(a node, in this case, could be a real server or VM), with four microservices running on each node (see
Figure 11 above). The partition has cut off three of the nodes on one side and two nodes on the other side
of the partition.
In order to detect and resolve a network partition in an environment like this, there are number of things
that must occur:
1. The cluster must be aware of all nodes within it.
2. A service discovery mechanism needs to know what is running and where it is running.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

12


3. Each node must be constantly checking to see if it can talk to the other nodes in the cluster.
4. When a network partition occurs, each of the monitoring components on each node needs to
determine which nodes are still accessible and which ones are not.
In an environment where all of the above is in place, here’s what should happen when the network partition occurs: the node level monitoring detects that three nodes can still communicate with each other
on one side of the partition and the other two nodes detect that they can only communicate with each
other.

Figure 12 - Split Brain Recovery

The two nodes on the minority side each shut down the microservices that are currently running on those
nodes. The three nodes on the majority side move the eight processes that were running on the minority
side over and begin to run them, if there is sufficient capacity on the majority side to run them. See Figure
12 above.
If there is insufficient capacity to host these additional microservices, it will be necessary to add one
or more nodes to the majority side without administrator intervention, which in turn requires that the
system possess the ability to scale automatically.

In short, based on the environment, its capabilities, and any constraints it is subject to, there can be
multiple strategies to resolve issues caused by network partitions. The next section explores four such
strategies.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

13


Four Strategies To Resolve Network Partitions
In order to detect and resolve a split brain condition, a system needs to have a reasonable degree of
self-awareness. That is, it needs to know the composition of the runtime environment and what processes are running in that environment. Using a microservice system as an example, this system is composed
of multiple servers that are running a collection of microservices and other associated application components, such as front-end containers.
This self-awareness also needs to include some form of monitoring of the runtime environment that is
capable of detecting network partitions. This process has to occur independently on the nodes on each
side of the partition. Within a reasonably functional self-aware system, all of the necessary capabilities to
detect and resolve a network partition or split brain condition are present.
For the distributed Reactive applications it manages, Lightbend Enterprise Suite, the commercial component of Lightbend Reactive Platform, includes a powerful and easy-to-use Split Brain Resolver (SBR)
feature. Users can take advantage of this feature in two different ways - one that is more suited for development, and one that is more suited for production/operations. Each implementation is discussed in
more detail later.
Once a split brain condition is detected, there are a number of recovery options that can be used to resolve
the problem. The SBR feature includes four effective resolution strategies from which users can select:
1. Keep Majority
2. Static Quorum
3. Keep Oldest
4. Keep Referee
SBR offers four strategies because there is no “one-size-fits-all” solution to the split brain condition. Every
strategy has a network partition-triggered failure scenario where it makes the “right” decision, and there
is a failure scenario where it makes the “wrong” decision. Users must select the strategy that is best

suited for the characteristics of the specific services, nodes, and clusters running in their system.
Regardless of the strategy chosen, SBR allows users to recover from split brain scenarios in a matter of
seconds. Even for a large 1000-node cluster, SBR allows for recovery in about 65 seconds, based on Lightbend’s test data (see Table 1).

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

14


Cluster size
(nodes)

Failure Detection
(seconds) >>

Stable After
(seconds) >>

Removal Margin
(seconds) >>

Total Time To Resolution
(seconds)

5
10
20
50
100

1000

5
5
5
5
5
5

7
10
13
17
20
30

7
10
13
17
20
30

19
25
31
39
45
65


Table 1 - Default configuration for total failover latency in clusters

Strategy 1 - Keep Majority
The Keep Majority strategy was discussed in the previous examples as one approach to resolve split brain
conditions. This approach is simple and fairly intuitive. When a network partition causes a split brain
condition, the nodes on the majority side of the partition continue to run and the nodes on the minority
side shut down. In order to function properly, the decision-making process must happen independently
on each side of the partition. In the event of an even split, a tie-breaking rule is used to decide which side
stays up and which side shuts down. To break ties in Keep Majority, SBR keeps up the side with the lowest
IP address.
When to use Keep Majority: This strategy is appropriate for systems in which the number of
nodes changes dynamically and often. A cluster of nodes that employs some form of auto-scaling
is one example. At any point in time the number of nodes may change, scaling up as traffic increases and scaling down as traffic declines.
When to avoid Keep Majority: This strategy is less effective in scenarios where more than half
the nodes on a given cluster go down. If Keep Majority is used here, the remaining nodes will shut
down because they are in the minority. Also, if there are more than two partitions, all of the nodes
will shut down because none have a majority.
Example Use Case: A retail or eCommerce site that scales up during peak traffic.

Strategy 2 - Static Quorum
The Static Quorum strategy is similar to the Keep Majority strategy. The difference is a configuration setting that specifies the minimum number of nodes needed for a majority: a quorum size.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

15


When a network partition occurs, the side with a node count of at least the configured quorum size continues running. For example, in a cluster of nine nodes with a quorum size of five, in a five-to-four split on
each side of the partition, the five-node side stays up.

When to use Static Quorum: This strategy is based on simple addition and is therefore easy to
understand. When problems do occur, it is clear how the decision was made on which nodes continued running and which nodes shut down.
When to avoid Static Quorum: Unlike Keep Majority, this strategy is not as flexible. One limitation is that the cluster must not exceed quorum size * 2 - 1. If there are successive failures and new
nodes are not added to replace failed downed nodes, this will result in the remaining nodes not
meeting the quorum requirement. For example, consider a cluster of nine nodes with a quorum
size of five. First, there is a six to three split, which results in a six node cluster staying up. A second
split occurs within the six node cluster, with four and two nodes on each side of the split. In this
case, neither side has a quorum of five, resulting in all clusters being shut down.
Example Use Case: Systems used in Financial Services that must meet consistency requirements.

Strategy 3 - Keep Oldest
Keep Oldest is used when cluster singletons are located on the oldest node in the cluster. When a network
partition occurs, the nodes on the side that includes the oldest node continue running.
The cluster singleton approach is used when there can only be one instance of a given application or
resource running in a cluster at any time. A cluster singleton may be an application or resource that has
licensing restrictions or is expensive to start up.
When to use Keep Oldest: This strategy is most useful in systems where it is vital to keep cluster
singletons running with minimal disruption.
When to avoid Keep Oldest: This strategy could result in a significant reduction of the number
of available nodes when network partitions occur. For example, in a nine node cluster, a sevento-two split with the oldest node on the minority side would result in two nodes staying up and
seven nodes shutting down.
Example Use Case: A single expensive application or resource that needs to be kept running as often
as possible.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

16



Strategy 4 - Keep Referee
Keep Referee is a good strategy when shutting down a particular node would cause the entire system to
cease functioning properly. A node that hosts strictly licensed software or a node that has specific firewall
restrictions are potential examples.
When to use Keep Referee: This strategy is best for systems hosting a special node that must
always be running.
When to avoid Keep Referee: In Keep Referee, the referee node is a single point of failure. If this
node goes down, the entire cluster will shut down with it.
Example Use Case: Networks forced to have a single special node due to licensing, firewalls, or other related
constraints.

Split Brain Resolution From Lightbend
Lightbend customers that have access to Lightbend Enterprise Suite as part of their subscription can
take advantage of the SBR feature in two different ways, based on how they’re using Lightbend Enterprise
Suite in their environment.
The first solution is oriented towards development. Lightbend Enterprise Suite includes an implementation of SBR for the Akka actor model toolkit. Users can use this to add SBR resolution functionality to
their applications during development.
The second implementation of SBR is available as part of Lightbend Enterprise Suite’s cluster management system and is oriented towards production/operations use.
Users, including both developers and architects, must understand these implementations in more detail
in order to decide which one to use.

Akka SBR
The Akka toolkit allows for building Akka actor-based Reactive applications that run in a cluster of nodes.
These nodes are Java Virtual Machines (JVMs) that may run on a single host, such as a single VM or physical server or on a cluster of hosts. An Akka cluster is a distributed system and is therefore vulnerable to
network partitions.
When an Akka cluster experiences a network partition or split brain, SBR goes to work to resolve the
issue--often within seconds. As described in the previous sections, SBR is configured to use one of four
strategies: Keep Majority, Static Quorum, Keep Oldest, or Keep Referee. If the current cluster conditions
are right, the Akka cluster nodes on the winning side of the partition will continue to run, while nodes on
the losing side shut down.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

17


Each node in an Akka cluster runs actors. When nodes leave the cluster, as when one or more nodes are
shut down by SBR, this results in all of the actors located on the departing nodes being terminated.
Note that the way an actor system responds to the departure of cluster nodes is application-dependent
and is independent of the SBR feature. Persistence and sharding, both of which are parts of the Akka
toolkit, determine that behavior.
These solutions are implemented to work in a clustered environment. In the event of a node leaving the
cluster, the actors that went down with the departing nodes are recreated on other nodes in the cluster.
Both the persistence and sharding solutions are designed to automatically self-heal.
Custom actor systems that are implemented to be distributed across clusters also need to react appropriately when nodes leave the cluster. The Akka toolkit includes many features that may be used to build
highly-resilient systems. As an example, particular actors can be created that will be notified when nodes
leave and join the cluster. These notifications can then be used to react appropriately when the topology
of a cluster changes. Users are therefore advised to use one of these solutions in conjunction with SBR.
For more details about persistence, sharding, and other aspects of the Akka toolkit, please refer to
Akka documentation.

Cluster Management SBR

Akka Cluster

While Akka clusters provide a platform for running and managing actor systems, Lightbend Enterprise
Suite’s cluster-management functionality delivers a clustered runtime environment for running and managing application systems (a collection of services.)

Actor System


Actor
Figure 13 - Akka Cluster Nodes and Actors

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

18


As previously discussed, an Akka actor system is composed of a cluster of JVMs. Actors reside within each
of the cluster nodes (see Figure 13).
Again, as previously discussed, Akka SBR detects and resolves network partitions by identifying which
nodes are on the winning side and the losing side of a partition. Various strategies are used to restore
actors that were previously resident on the losing side, as shown in Figure 14 where the rectangles represent cluster node JVMs, and the circles represent actors.

Figure 14 - Cluster Nodes and Actors with Network Partition

In contrast, a Lightbend application cluster is composed of Linux nodes. Each node hosts running containers. In this environment, SBR also detects and resolves network partitions by identifying the nodes on
the winning and losing sides of a partition. The running containers on the losing side are restarted on the
winning side. See Figure 15 where the rectangles represent Linux nodes, and the circles represent running
containers.
Note that Figures 14 and 15 are conceptually similar. The point is that logically the same SBR processes
are used for both Akka and cluster management. The difference is the level of granularity. Akka is more
fine-grained with actors running in JVMs, while cluster management is more coarse-grained with containers running on Linux nodes.
SBR is built into Lightbend Enterprise Suite’s cluster management feature. When a Lightbend application
cluster experiences a network partition, the resolution process goes to work using the Keep Majority
strategy. The nodes on each side of the partition determine which one is the winning side and which one
is the losing side. The losing side shuts down its services, while the winning side attempts to restart the
services that were running on the losing side. The primary limitation is capacity: does the winning side
have sufficient capacity to run all of the services that were running on the losing side?


HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

19


Figure 15- Service Orchestration Cluster and Containers

After the losing side’s cluster nodes have stopped all of the running services, they then enter a recovery
mode. While in this recovery mode, the nodes constantly attempt to reconnect with the nodes on the
winning side. When the network starts working again, the losing side nodes rejoin the cluster of nodes on
the winning side.
Next, let’s discuss the impact on organizations that use Lightbend’s Enterprise Suite to resolve network
partitions using the SBR feature.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

20


The Benefits
Serve Customers Better
Network partitions are guaranteed to happen, and there is no one solution to resolve this issue. In today’s
highly competitive, fast-moving environment, users that experience frequent outages, extended downtime, or recurring performance degradations are susceptible to defections. The Split Brain Resolver from
Lightbend minimizes the impact of network partitions on users and helps businesses serve them better,
improving retention and helping attract new users.

Eliminate Expensive Downtime

By automatically applying pre-configured resolution strategies in just seconds, recovering failed nodes no
longer requires time-consuming manual intervention by operations staff, freeing them to focus on other
operational matters.

Immediate Time-To-Value
As an out-of-the-box feature, SBR delivers immediate value. Further, it is built using Lightbend’s deep Akka
expertise, so enterprises can be assured of utilizing best-in-class network partition resolution strategies.

Summary
In the era of highly-distributed, real-time applications, network issues often result in so-called “split brain”
scenarios, where one or more nodes in a cluster becomes suddenly unresponsive. This causes data inconsistencies between distributed services that can lead to cascading failures and downtime. Lightbend
offers a Split Brain Resolver feature that can help resolve network partitions in just seconds. This feature
is available to Lightbend customers as part of Lightbend Enterprise Suite and can be taken advantage of
either in development at the code level or at the production/operations level. Lightbend customers that
use the SBR feature can avoid expensive downtime and serve their customers better.

HOW TO RESOLVE NETWORK PARTITIONS IN SECONDS
WITH LIGHTBEND ENTERPRISE SUITE

21


Contact us to discuss the best way to
add Split Brain Resolver - and other
Lightbend Enterprise Suite features to your Akka, Lagom and Play-based
Reactive applications.
CONTACT US

Lightbend (Twitter: @Lightbend) provides the leading Reactive application development platform
for building distributed systems. Based on a message-driven runtime, these distributed systems,

which include microservices and fast data applications, can scale effortlessly on multi-core and cloud
computing architectures. Many of the most admired brands around the globe are transforming their
businesses with our platform, engaging billions of users every day through software that is changing
the world.
Lightbend, Inc. 625 Market Street, 10th Floor, San Francisco, CA 94105 | www.lightbend.com



×