Tải bản đầy đủ (.pdf) (29 trang)

cloud performance optimizations

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.46 MB, 29 trang )




Performance Optimizations in a CloudCentric World
Andy Still


Performance Optimizations in a Cloud-Centric World
by Andy Still
Copyright © 2015 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles ( ). For more information,
contact our corporate/institutional sales department: 800-998-9938 or .
Editor: Brian Anderson
Copyeditor: Holly Bauer
Proofreader: Nicole Shelby
Cover Designer: Randy Comer
July 2015: First Edition
Revision History for the First Edition
2015-07-19: First Release
2015-09-02: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Performance Optimizations in a
Cloud-Centric World, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without limitation responsibility for damages
resulting from the use of or reliance on this work. Use of the information and instructions contained in
this work is at your own risk. If any code samples or other technology this work contains or describes


is subject to open source licenses or the intellectual property rights of others, it is your responsibility
to ensure that your use thereof complies with such licenses and/or rights.
Cover image courtesy of Vera & Jean-Christophe, from flickr. The original image (“Heavy Traffic”)
was in color.
978-1-491-93137-0
[LSI]


For Candance
For Candance,
who insists that all poor performance on the Internet is my fault


Introduction
Back in the day, it was simple...
Content was served from your server, over your network, and then to client machines that you
controlled. Even when that moved out from a LAN to a WAN, the connectivity came from a single
provider—it was all under your control.
Then came the Internet…
Now content was being served across the public Internet to end-user machines—you lost control of
the location, type of machine, and type of connectivity.
Then came the cloud…
The cloud brought with it a new way of thinking about web system hosting. Hosting shifted from being
a hand-crafted service to a commodity service providing throwaway systems. You moved from being
a hardware owner to being a service consumer.
With this change came the increasing loss of control over your system. Nowadays your application is
often the only element that you control directly, and even that can be dependent on consuming thirdparty services.
This is not a bad thing, but you need to be aware of the issues that can arise as a result of this shift to
the cloud.
Whether you’ve already moved systems to the cloud or are thinking of doing so, this book will point

out some of the risks to your site’s performance created by this loss of control and puts forth some
methods to identify and then mitigate those risks.
In no way, though, does this book set out to deter you from moving into the cloud. This author has long
been a cloud advocate and works almost exclusively on cloud-based systems.
Terminology
For simplicity, I’ve used the term “website” throughout to refer to any system that distributes data across the Internet,
including browser-based applications, mobile apps, etc.


Chapter 1. Losing Control
So, here we are in the world of the cloud, with ever-expanding elements of our websites being placed
in the hands of others.

Advantages to Giving Up Control
There are many positive aspects to making this move (after all, why else would so many people be
doing it?), so before going into the negatives, let’s remind ourselves of some of the advantages of
cloud-based systems:
Quick and easy access to enterprise-level solutions
For example, building your own geographically available SQL server cluster with real-time
failover would take lots of hardware, high-quality connectivity between data centers, a high
degree of expertise in databases and networking, and a reasonable amount of time and ongoing
maintenance. Services such as Amazon RDS make this achievable within an hour, and at a
reasonable hourly rate.
Flexibility and the ability to experiment and evolve systems easily
The ability to create and throw away systems means that you can make mistakes and learn from
experience what’s the best setup for your system. Rather than spending time and effort doing
capacity estimates to determine the hardware needed, you can just try different sizes, find the best
size, and then change the setup if you reach capacity, or even at different times of day.
Access to data you could never create yourself
Third-party data sources do create risks, but they also enhance the attractiveness of your system

by providing data that you otherwise wouldn’t be able to provide but that your users rely upon—
either because that data is about a third-party system (e.g., Twitter feeds), or because it would not
be economical (e.g., mapping data).
They improve performance and resilience
While they are out of your control, most cloud-based systems have higher levels of resilience
built in than you would build into an equivalent system. Likewise, though, there are potential
issues created by how CDNs route traffic; CDNs will usually offer performance improvements
over systems that do not use them.
Cloud-based systems are also built for high performance and throughput and designed to scale out
of the box. Many services will scale automatically and invisibly to you as the consumer, and
others will scale at the click of a button or an API call.


Access to systems run by specialists in the area—not generalists
In house or using a general data center, you may have a small team dedicated to a task—or more
likely, a team of generalists who have a degree of expertise across a range of areas. Bringing in a
range of specialist cloud providers allows to you work with entire companies that are dedicated
to expertise in specific areas, such as security, DNS, or geolocation.

Performance Risks
Despite these advantages, it’s important to be aware of the inherent performance risks, especially in
this era where good website performance is key to user satisfaction. The next sections cover
important considerations for performance and outline key performance risks, following the journey
that a user must travel in order to take advantage of your website.

1. The Last Mile
Before any user can access your website, they need to connect from their device to your servers. The
first stage of this connection, between the user’s device and the Internet backbone, is known as the
last mile. For a desktop user, this is usually the connection to their ISP, whether that be by DSL,
cable, or even dial-up. For a mobile user, it’s the connection via their mobile network.

This section of the connection between user and server is the most inefficient and variable, and it will
add latency onto any connection.
To illustrate this, in 2013 the FCC released research that showed that a top-speed fiber connection
would add 18ms latency—and that was the best-case scenario—a DSL connection would add 44ms,
and dial-up was considerably slower. For mobile users, the story was even worse: a 4G connection
had a latency overhead of 600ms on new connections, a 3G connection had a latency of over 2s on
new connections, and even existing open connection had a latency as high as 500ms.
T HE “LAST M ILE” OR T HE “FIRST M ILE”?
Although the last mile is the traditional name of the first stage of the connection between a user and the server, it may be more
appropriate to think of it as the first mile or the on ramp, as the delay is often in establishing the connection in the first instance,
particularly in mobile networks.
Mobile connections have to communicate with the network to validate that a connection is allowed and to define the speed at which
they can connect before anything can be opened. For 4G networks, this exchange happens with the local cell tower, but for 3G
networks, the exchange takes place with the core network; therefore, 3G networks have much higher latency on newly opened
connections.

This is a high-impact area of the delivery of any website, and it’s the one area where there is
genuinely little to be done about the issue. Nevertheless, it’s important to be aware of the variations
that are possible and actually being experienced, and to ensure that your website’s functionality is not
affected by them.


Performance Risks
Unreliable delivery of content
The variability in connection speed of the last mile means that it’s hard to determine how fast
content will be delivered to users. This presents many of the same challenges that we’ll explore
in the next section—they’re often amplified by the challenges of the last mile.

2. Backbone Connectivity
Traditionally, this is seen as the path that the data from your website takes after it leaves your data

center until it arrives at the end user’s machine. However, in the Internet age, backbone connectivity
can be seen more as the means by which a user reaches your data—you have little control over how
or from where the user is coming to you to request it.
Users are now accessing data from an expanding range of devices, via many different means of
connectivity, and from an ever-widening range of locations.
To understand the performance challenges caused by unknown means of connectivity, you need to
look at three key factors:
Bandwidth
Bandwidth is the amount of traffic that can physically pass through the hardware en route to the
end destination. Bandwidth can usually be increased on demand from your ISP.
Contention
Contention is the amount of other traffic that is sharing your connectivity. This will often vary
greatly depending on the time of day. Like bandwidth, contention is something that can be
minimized on demand from your ISP.
Latency
Latency is based on the distance that the data has to travel to get from end to end and any other
associated delays involved in establishing and maintaining a connection.

Which Is the Biggest Challenge to Performance?
Bandwidth is often discussed as a limiting factor, but in many cases, latency is the killer—bandwidth
can be scaled up, but latency is not as easy to address.
There is a theoretical minimum latency that will exist based on the physical distance between two
places. Optimally configured fiber connections can travel at approximately 1.5× the time taken to
travel at the speed of light. The speed of light is very fast, but there is still a measurable delay when
transmitting over long distances. For example, the theoretical fastest speed for sending data from New
York to London is 56ms; to Sydney, it’s 160ms.
This means that to serve data to a user in Sydney from your servers in New York, 160ms will pass to
establish a connection, and another 160ms will pass before the first byte of data is returned. That



means that 320ms is the fastest possible time, even in optimal conditions, that a single byte of data
could be returned. Of course, most requests will involve multiple round trips for data and multiple
connections.
However, data often doesn’t travel by an optimal route.
The BGP (Border Gateway Protocol) that manages most of the routing on the Internet is designed to
find optimal routes between any two points. Like all other protocols, though, it can be prone to
misconfiguration, which sometimes results in the selection of less-than-optimal routes.
More commonly, such suboptimal routes are chosen due to the peering arrangements of your network
provider. Peering determines which other networks a network will agree to forward traffic to. You
should not assume that the Internet is a non-partisan place where data moves freely from system to
system; the reality is that peering is a commercial arrangement, and companies will choose their
peers based on financial, competitive, and other less-idealistic reasons. The upshot for your system is
that it is important to be aware that the peering arrangement that your hosting company (and the
companies they have arrangements with) has in place can affect the performance of your system.
When choosing a data center, you can get information about these arrangements; however, cloud
providers are not so open. Therefore, it’s important to monitor what’s happening to determine the best
cloud provider for your end users.

Performance Risks
The variability of connectivity across the backbone really boils down to a single performance risk,
but it’s a fundamental one that you need to be aware of when building any web-based system.
Unreliable delivery of content
If you cannot control how data is being sent to a user, you cannot control the speed at which it
arrives. This makes it very difficult to determine exactly how a website should be developed. For
example:
Can data can be updated in real time?
Can activity be triggered in response to a user activity, e.g., predictive search?
Which functionality should be executed client side and which server side?
Can functionality be consistent across platforms?


3. Servers and Data Center Infrastructure
Traditionally, when hosting in a data center, you can make an informed choice about all aspects of the
hardware and infrastructure you use. You can work with the data center provider to build the
hardware and the network infrastructure to your specific requirements, including the connectivity into
your systems. You can influence or at least be aware of the types of hardware and networking being
used, the peering relationships, the physical location of your hardware, and even its location within


the building.
The construction of your platform is a process of building something to last, and once built, it should
remain relatively static, with any changes being non-trivial operations.
The migration of many data centers to virtualized platforms started a process of migration from static
to throwaway platforms. However, it was with the growth of cloud-based Infrastructure as a Service
(IaaS) platforms that systems became completely throwaway. An extension to IaaS is Platform as a
Service (PaaS), where, rather than having any access to the infrastructure at all, you simply pass some
code into the system, a platform is created, and the code deployed upon it is ready to run.
With these systems, all details of the underlying hardware and infrastructure are hidden from view,
and you’re asked to put your trust in the cloud providers to do what is best. This way of working is
practical and can be beneficial; cloud providers are managing infrastructure across many users and
have a constant process of upgrading and improving the underlying technology. The only way they can
coordinate rolling out the new technology is to make it non-optional (and therefore hidden from end
users).

Performance Risks
Loss of control over the data center creates two key performance risks.
Loss of ability to fine-tune hardware/networking
Cloud providers will provide machines based on a set of generic sizes, and they usually keep the
underlying architecture deliberately vague, using measurements such as “compute units” rather
than specifying the exact hardware being used.
Likewise, network connectivity is expressed in generic terms such as small, medium, large, etc.,

rather than specifying the actual values so that the exact nature of the networking is out of your
control.
All of this means that you cannot benchmark your application and then specify the exact hardware
you want your application to run on. You cannot make operating system modifications to suit that
exact hardware, because at any point, your servers may restart on different hardware
configurations.
No guarantee of consistency
Every time you reboot a machine it can potentially (and usually, actually) come back up on
completely different hardware, so there is no guarantee that you’ll get consistent performance.
This is due in part to varying hardware, and also to the potential for noisy neighbors—that is,
other users sharing your infrastructure and consequently affecting the performance of your
infrastructure. In practice, these inconsistencies are much rarer than they used to be.
Some cloud vendors will offer higher-priced alternatives that will guarantee that certain pieces of
hardware will be dedicated for your use.


4. Third-Party SaaS Tools
While you lose control over the hardware and the infrastructure with IaaS, you still have access to the
underlying operating system; however, in the world of the cloud, systems are increasingly dependent
on higher-level Software as a Service (SaaS) systems that deliver functionality rather than a platform
on which you can execute your own functionality.
All access is provided via an API, and you have absolutely no control over how the service is run or
configured.

Examples in this section
For consistency and to illustrate the range of services offered by single providers, all examples of services in this section are
provided by Amazon Web Services (AWS); other providers offer similar ranges of services.

These SaaS systems can provide a wide range of functionality, including database (Amazon RDS or
DynamoDB), file storage (Amazon S3), message queuing (Amazon SQS), data analysis (Amazon

EMR), email sending (Amazon SES), authentication (AWS Directory Service), data warehousing
(Amazon Redshift), and many others.
There are even cloud-based services now that will provide shared code-execution platforms (such as
Amazon Lambda). These services trigger small pieces of code in response to a schedule or an event
(e.g., an image upload or button click) and execute them in an environment outside your control.

Performance Risks
As you start to introduce third-party SaaS services, there are two key performance risks that you must
be aware of.
Complete failure or performance degradation
Although one of the selling points of third-party SaaS systems is that they are built on much more
resilient platforms than you could build and manage on your own, the fact remains that if they do
go down or start to run slowly, there is nothing you can do about it—you are entirely in the hands
of the provider to resolve the issue.
Loss of data
Though the data storage systems are designed to be resilient (and in general, they are), there have
been examples in the past of cloud providers losing data due to hardware failures or software
issues.

5. CDNs and Other Cloud-Based Systems


Many systems now sit behind remote cloud-based services, meaning that any requests made to your
server are routed via these systems before hitting it.

CDNs
The most common example of these systems are CDNs (content delivery networks). These are
systems that sit outside your infrastructure, handling traffic before it hits your servers to provide
globally distributed caching of content.
CDNs are part of any best-practice setup for a high-usage website, providing higher-speed

distribution of data as well as lowering overhead of your servers.
The way they work is conceptually simple: when a user makes a request for a resource from your
system, the DNS resolution is resolved to the point of presence within the CDN infrastructure that has
the least latency and load. The user then makes the request to that server. If the server has a cached
copy of the resource the user is requesting, it returns it; if it doesn’t, or if the version it has has
expired, then it requests a copy from your server and caches it for future requests.
If the CDN has a cached copy, then the latency for that request is much lower; if not, then the
connection between the CDN and the origin server is optimized so that the longer-distance part of the
request is completed faster than if the request was made directly by the end user.

Other Systems
There are many other examples of systems that can sit in front of yours, including:
DDoS protection
Protects your system from being affected by a DoS (Denial of Service) attack.
Web application firewall
Provides protection against some standard security exploits, such as cross-site scripting or SQL
injection.
Traffic queuing
Protects your site from being overrun with traffic by queuing excess demand until space becomes
available.
Translation services
Translate content into the language of the locale of the user.
It is not uncommon to find that requests have been routed via multiple cloud-based services between
the user and your server.

Performance Risks
There are a number of performance risks associated with moving your website behind cloud-based


services.

Complete failure or performance degradation
Like with third-party SaaS tools, if a cloud system you rely on goes down, so will your system.
Likewise, if that cloud system starts to run slowly, so will your system.
This could be caused by hardware or infrastructure issues, or issues associated with software
releases (SaaS providers will usually release often and unannounced). They could also be caused
by third-party malicious activities such as hacking or DoS attacks—SaaS systems can be high
profile and therefore potential targets for such attacks.
Increased overhead
All additional processing being done will add time to the overall processing time of a request.
When adding an additional system in front of your own system, you’re not only adding the time
taken for that service to execute the functionality that it is providing, but you’re also adding to the
number of network hops the data has to make to complete its journey.
Increased latency
All services will add additional hops onto the route taken by the request. Some services offer
geolocation so that users will be routed to a locally based service, but others do not. It’s not
uncommon to hear of systems where requests are routed back and forth across the Atlantic several
times between the user and the server as they pass through cloud providers offering different
functionality.

6. Third-Party Components
Websites are increasingly dependent on being consumers of data or functionality provided by thirdparty systems.

Client Side
Client-side systems will commonly display data from third parties as part of their core content. This
can include:
Data from third-party advertising systems (e.g., Google AdWords)
Social media content (e.g., Twitter feeds or Facebook “like” counts)
News feeds provided by RSS feeds
Location mapping and directions (e.g., Google Maps)
Unseen third party calls, such as analytics, affiliate tracking tags, or monitoring tools



Server Side
Server-side content will often retrieve external data and combine it with your data to create a mashup
of multiple data sources. These can include freely available and commercial data sources; for
example, combining your branch locations with mapping data to determine the nearest branch to the
user’s location.

Performance Risks
Dependence on these third-party components can create the following performance risks.
Complete failure or inconsistent performance
If your system depends on third-party data and that third party becomes unavailable, your system
could fail completely. Likewise, poor performance by the third party can have a domino effect on
your system’s performance.
Unexpected results
Third parties can sometimes change the data they return or the way their data feeds work,
resulting in errors when you make requests or when the requests return unexpected data.


Chapter 2. If You Can’t Control It, Monitor
It
It’s vitally important for you to understand what’s going on with the elements of your website and
infrastructure that you can’t control—particularly their impact on other areas of your website. A good
monitoring system is essential to enabling the performance optimizations that are recommended in
Chapter 3.
In addition to monitoring, it’s important that you set up appropriate alerting to notify you when issues
may be occurring.
A good monitoring solution needs to gather a full range of data about how your website is performing.
This needs to illustrate not only what is happening across the full end to end—from server to user—
but also across the full range of users. It not only needs to gauge the user experience, but also provide

sufficient data to be able to determine the root cause if the experience is not at the level expected.
Reasonable end-to-end monitoring requires five types of monitoring.

1. RUM and EUM
Ultimately, the most important data answers the question: what is the user seeing? This is the task of
real user monitoring (RUM) and end user monitoring (EUM).
RUM gathers data from all user activity and passes that data back to a central collection server,
allowing analysis of your users’ exact experience. This will flag any unexpected behavior and can
help you drill down to identify the cause of the problem. RUM is also useful for determining whether
there is a pattern to the types of users who are experiencing a particular problem. For example, is it
related to a specific geographic area, type of connection, browser, or device?
EUM is similar, but relies on synthetically generated, regularly repeated tests of specific
functionality. EUM will quickly show if tasks are varying over time and whether key functionality is
still acting as expected.
EUM is valuable in that you can be proactively alerted when problems occur without having to
depend on real users executing a specific function (and hopefully resolve issues before they are
noticed). Also, because you control the way the test is executed, you can remove other variables and
only run a known, repeatable test.
RUM is valuable because it executes any functionality within your system that users are doing without
your having to specify what that functionality is. This means that you see issues that are occurring in
areas that you may not have expected.
A good monitoring solution includes elements of both RUM and EUM.


2. APM
Application performance management (APM) is a monitoring technology that sits on your server and
tracks all activity and reports to a central analysis server. This will collect code-level metrics (e.g.,
method and SQL query execution times) and details of communications with external systems, in
addition to hardware metrics (e.g., memory and CPU usage).
APM systems are very useful for getting a detailed understanding of what your application is doing

under the hood, and they’re a good starting point for root-cause analysis of issues with your system.
Some APM solutions will integrate with RUM and EUM tools to give a full end-to-end breakdown of
a user’s interaction with your system.

3. Network Monitoring (NPM)
While RUM and EUM give you a good understanding of what the end user is experiencing and APM
illustrates what’s going on on your server, network monitoring looks at the areas in between.
In traditional data centers, this would involve operational management tools such as Nagios, or NPM
(network performance monitoring) tools such as Zabbix or SolarWinds to see details of how your
network infrastructure is behaving. (It’s worth noting that these two types of tools are increasingly
overlapping.) However, the network infrastructure is largely hidden from you in cloud environments.
INT ERNET PERFORM ANCE M ONIT ORING
In addition to NPM, it’s good practice instead to use IPM (Internet performance monitoring) tools to see how data is behaving
when traveling over the Internet between your servers and your end users.
The depth of information offered by these tools allows you to understand the efficiency of the routing used by your cloud
provider/third-party provider and to determine whether it is efficient in general or for your specific audience. This is a good way to
determine which is the best cloud provider in terms of reliable network performance.

4. Proprietary System Monitors
Most cloud providers will offer their own tools for monitoring the performance of their systems, like
Amazon’s CloudWatch for AWS services. The depth of information and functionality provided by
these systems varies greatly, but they should all should a first port of call for identifying issues with a
system.

5. Data Aggregators/Dashboard Creation
It can be difficult to stay on top of all of the monitoring tools that are necessary to understand the
diverse elements in your system. Data aggregators and dashboarding systems provide the ability to
gather all these data sources into one central location and display them side by side. There are many



examples of these types of tools, from open source (e.g., Tableau) or cloud-based (e.g., DataDog) to
enterprise-level (e.g., Soasta DOC).
The more advanced of these systems will also allow you to correlate multiple datasets onto a single
graph.


Chapter 3. Minimizing Performance Risks
The performance risks described previously can be minimized using the following five strategies:
1. Use a best-of-breed DNS provider
2. Cache content as close to the user as possible
3. Understand the nature of cloud services
4. Apply the same good practice to the cloud as you would to any other system
5. Ensure you can handle any failure

Use a Best-of-Breed DNS Provider
DNS is your first point of contact with an end user; without it, your user will never access your site.
So it is essential that it is reliable, performant, and flexible.
Providers, such as cloud providers or CDNs, often prefer (or require) that they also manage your
DNS, but this can create a single point of failure (SPOF); if a provider experiences problems with its
own system, it may also have issues with its DNS provision, making it difficult to use DNS as a
defense against that failure.
Having an independent DNS provider allows you to have policies that favor different cloud
providers/CDNs in different circumstances, such as location, speed, SLAs, etc. This allows you to
optimize your systems based on the output from the monitoring solutions.
Therefore, it’s good practice to use a managed DNS solution that is provider-independent and to
ensure that it offers the services described in the following sections.

A Low-Latency Network
It is essential that the DNS provider you select operates a low-latency network, allowing fast
resolution of DNS records wherever your users are situated.

As all users will need to resolve your DNS record before accessing your system, a slow resolution
time will add delay onto the first request to your site for all users. If you’re using domain sharding
(i.e., serving your content from many subdomains to improve performance), then this delay is
applicable for each of subdomains you are using. (The actual impact of the overall delay will be
dependent on how well constructed your page is; a well-constructed page will ensure that as many
requests as possible are made concurrently.)


Support for DNS-Based Failover
If your provider has a complete outage, then your DNS provider should allow a switch to point traffic
to another location. Alternatively, if one of the cloud providers you’re sitting behind has an outage,
then you need to be able to quickly reroute traffic to bypass that service.
T T L AND DNS
TTL, or time to live, is the element of a DNS record that tells the requester how long the record is valid for. In other words, if the
TTL for your DNS record is set to 24 hours, once a browser has resolved that DNS record, it will continue to use that same value
for the next 24 hours regardless of whether you’ve updated the details.
If the TTL is set too high, then DNS cannot be used as a failover method, as the change will take too long to take effect with any
existing users. Setting a very low TTL, however, adds extra overhead, as DNS lookups have to happen much more regularly, which
adds to the page-load time for a user and increases the stress on the DNS servers.

Most DNS providers will allow you to access an interface to make these changes reactively. If you’re
planning to use DNS to provide this functionality, it is essential that you set the TTL for your domain
name to a suitable value. (The default value for most providers is 24 or 48 hours.) Some DNS
providers do not allow changes to TTL records or have minimum values.
More feature-rich managed DNS providers provide automated failover functionality that will
constantly monitor the availability of systems and update DNS records to alternative systems in the
event that failure is identified.

Support for Geolocation
A simple way to mitigate the impact of latency is to serve content from as close to users as possible.

This can be achieved by caching content close to user (see “CDNs”); however, it can also be
achieved by hosting multiple systems at different locations around the world.
ANYCAST
Anycast is an addressing methodology that allows a “one-to-nearest” transmission of traffic to a target node, usually using BGP to
simultaneously advertise the same IP address at multiple locations. In practice, this means that traffic to a single IP address can be
routed to multiple locations based on the location of the request.

Managed DNS providers use anycast networks to allow resolution of DNS records of the most
geographically relevant system. This has some interesting potential solutions that can be employed to
increase your control over your systems.
The obvious implementation of this is to host multiple versions of your system within different
regions. DNS resolution will then route the user to the one that is nearest to them. Many cloud
providers will provide this functionality as part of their service.
However, using a managed DNS service, rather than a cloud provider, to manage geolocation allows
you to have more granular control over the situation under which you will use a cloud provider. For
example, if your chosen cloud provider is weak in one area, then you can use a secondary provider in
that region.


Likewise, if your IPM data indicates that your cloud provider is routing traffic from certain locations
via inefficient routes, you can elect to use a different provider in that region.

Cache Content as Close to the User as Possible
It’s an old statement, but it’s still as true as ever: the fastest request is the one you don’t make, so it is
best to cache content as close to the user as possible.
Make sure all your static resources have appropriate expires headers on them so the browser will
cache as you expect.
If you’re using any client-side data retrieval from APIs, then try to store what you can locally—
JavaScript has access to local storage on the client now, so data can be stored across sessions.
Future W3C standards such as service workers are designed to give more control to developers about

what is cached on the client beyond the standard browser cache.
SERVICE WORKERS
Service workers are a technology that allows you to install a JavaScript module that is executed as part of any future requests to
your domain. What this means in practice is that you can intercept that request and intelligently decide how to handle it, including
returning content direct from your JavaScript module rather than passing the request onto the server.
Service workers are a published W3C standard but are currently only supported in Chrome.

CDNs
If you can’t cache on the client, then try to cache as close as possible. This leads us on to CDNs,
which we discussed previously in “CDNs”.
CDNs are designed as globally distributed caching and delivery systems. Modern CDNs offer much
wider functionality than this, but this is the core of their function.
The advantages of CDNs are obvious: most of the time, users should be served content from
destinations close to them. CDNs are also typically set up for high-traffic usage, so a good CDN will
address issues of both bandwidth and latency.
Using CDNs for dynamic content
The caching capability of CDNs is only really useful for static content—dynamic content is by its
nature less cacheable—though modern CDNs are doing their best to change that with technology such
as Edge Side Includes (ESI).
Despite this, though, there are still advantages to serving dynamic content via a CDN: it reduces the
impact of HTTP slow-start. The negative impact of slow-start increases as the latency of the
connection increases.
CDNs maintain open HTTP connections to your server, meaning that only rarely do they have to go
through the slow-start process. Using a CDN, therefore, means that even for dynamic content, the


slow-start element is only completed for a short round trip between client and CDN, and the
communication between CDN and server is carried out using an existing open connection.
T CP SLOW-START
Slow-start is a core part of the TCP standard; it’s there to minimize network congestion and ensure that transmissions are made at

a speed appropriate for the available bandwidth. However, a side effect is that newly established connections have much higher
latency than they theoretically need to.
Slow-start, as its name suggests, starts a transfer slowly and then builds up speed as it becomes apparent that the network can
handle it. After the initial connection is made and a handshake completed, the server sends a small number of packets, the client
receives and acknowledges receipt, and the server can then send two packets for every packet successfully acknowledged. This
allows for exponential growth until the capacity of the network is determined.
This means that an initial request to a server will involve more round trips to the server than are actually necessary. For example, a
20k request that could easily be served in one round trip will take four round trips on an initial connection to a server.

Choose the best CDN
Not all CDNs are created equal, and this is where knowledge of your audience and some of the
topology of the Internet comes in useful.
Most CDN providers publish maps of the locations of their POPs; the amount and distribution of them
will vary from CDN to CDN. Looking at a selection, you will soon see that there are areas that are
well supported and others that are not.
A good understanding of the location of your audience combined with a knowledge of Internet
topology will allow you to identify a CDN provider that will sit beyond any bottlenecks that could
affect your users.
Using multiple CDNs
As discussed previously, it’s possible to use geolocation of DNS to manage multiple cloud
providers. This approach can also be used to take advantage of multiple CDNs.
For example, some CDNs specialize in specific areas (e.g., China), so if you have users in that area,
it might be worth using that CDN.

Understand the Nature of Cloud Services
Although there are risks inherent in taking advantage of cloud providers’ multitude of different
services, they are generally built for high performance and high resiliency and are generally less risky
than trying to create your own, especially when running that software on cloud-based infrastructure.
However, it’s essential to confirm that this is the case for you and that the cloud services are being
used correctly.


Try Before You Buy
Before using any service, you need to put it through its paces and ensure that it is behaving as


expected and performing as advertised.
The nature of the cloud makes these kinds of proof-of-concept tests much more viable than non-cloud
offerings. They can be undertaken with minimal upfront costs and long-term commitment and can be
thrown away if they fail.
While performing this testing, it’s good to get as many monitoring systems as possible going to ensure
that you’re not just focusing on functional correctness; other metrics such as availability, reachability,
and performance should be considered. For example, the IPM data should be used to determine the
network impact of using this service from different locations.
All tests should include a reasonable amount of load to understand the end-to-end performance of the
system under normal and high traffic.

Optimize Your Systems for the Cloud
It’s easy to use cloud services in a sub-optimal way, because they’re relatively new systems, have a
high velocity of change, and because developers are usually self-taught. Furthermore, developers
often apply on-premise thinking and practices to the cloud, not realizing that cloud systems are built
with a slightly different paradigm in mind.
For example, the cloud-based database as a service offerings are better suited for a few larger
queries than many small queries, meaning that any systems that are very “chatty” with the database
will likely perform considerably worse in the cloud than on premise with a direct database
connection.
Monitoring data should be used to confirm that the performance of these services is as expected and
required.

Understand the Configuration Options
Cloud services are usually aimed at delivering complex pieces of functionality in a simple way

through a GUI or API. Therefore, you can usually get up and running with them fairly quickly.
However, the out-of-the-box configuration options may not be the most resilient or performant.
You should be proactive in understanding which options are available as well as being reactive to
issues identified by monitoring and testing.

Understand the SLAs
Most cloud providers will provide SLAs; however, it’s is important to understand the terms of the
SLA that they provide and ensure that you have implemented your service correctly to take advantage
of it.
For example, Microsoft Azure provides an uptime SLA for cloud services, but only if you’re running
two or more instances.


Apply the Same Good Practice to the Cloud as You Would
to Any Other System
The same good practices that you would apply to on-premise solutions should be applied to cloudbased solutions. A standard risk assessment process should be followed.
For example, the cloud-based database as a service systems provide multiple levels of resilience
around data (multiple copies in multiple places) but still involve a SPOF if there’s a system failure
that causes data corruption. Good practice in this case would dictate that a separate backup be taken
and stored remotely—in traditional terms, an “offsite backup.” This backup should ideally be stored
with another cloud provider (or elsewhere).

Ensure You Can Handle Any Failure
When you’re dependent on services that are out of your control, you have to be conscious of two
things:
1. They may stop working at any point
2. You will have no control whatsoever over when they will start working again
Therefore, you have to architect your systems to handle this failure gracefully.

NOTE

Failure is not just failure—it’s also poor performance. You should be monitoring third-party services to ensure they’re
responding in a timely manner.

Avoid “Death by Retry”
Once a failure state is known, share that knowledge across any elements of your system that depend
on that service and put in place a measured policy for attempting retries. Do not create a death by
retry situation where your system is brought down by constant attempts to connect to an unavailable
system.
A good architectural practice is to route all requests through a central point of connection.

Have a Backup Plan
If the functionality provided by the third-party system is key, then consider having a replacement
system in place and automatically failover to it.
Another option is to capture all the details of the request for processing offline when the system
returns. This is valid for systems such those for payment processing or appointment bookings.


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×