Tải bản đầy đủ (.pdf) (16 trang)

IT training calculating costs for observability khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.82 MB, 16 trang )

Calculating Costs for
Observability
What Next-generation APM Means for
Your Business
Mar, 2019


 

Observability is the only way to proactively manage 
production systems. Complex systems are the top 
challenge facing DevOps teams. Your customers 
depend upon you to deliver high reliability without 
slowing development productivity. You ​must​ invest in 
shortening outage durations and eliminating wasted 
developer time. 
Practitioners of DevOps and business leaders alike are beginning to understand 
that in order to scale and operate a service that drives growth and competitive 
edge, you must invest in the right tools and approach. Production system 
performance and uptime is just one aspect which directly impacts the customer 
experience and when you continuously deliver and integrate new features, 
systems become more complex and unless tightly managed, business risk goes 
up. Observability is a critical requirement that enables teams to level up and 
manage ever-increasing complexity. 
Distributed systems architectures are inherently complex, and the addition of 
continuous integration and continuous delivery (CI/CD) raises the stakes. 
Visibility and control are central to success and as delivery systems become 
automated, everything becomes more opaque and therefore harder to proactively 
manage. Add to this the abstraction layers of containers or a serverless 
infrastructure and the team feels farther removed from being in control. As a 
result, the number of potential causes for any given issue increases while your 


ability to point at any single issue as the cause is becoming much harder.  
 

 




 

Debugging in production is a requirement for modern teams, especially for teams 
who ship frequently. DevOps teams need the best tools to debug issues when 
they come up, not just hope they can catch everything in staging. Our customers 
tell us that before Honeycomb, they frequently experienced incidents where 
problem sources were never identified. Teams can no longer rely on simple 
metrics alone to provide the level of insight they need to diagnose and resolve, 
especially at scale. 
Observable production systems enable you to move beyond locating gnarly bugs 
or fixing a problematic incident or outage. Designing your systems to include 
observability from the point at which a feature is released allows teams to 
immediately learn how it behaves in production and adjust before a critical 
outage occurs. 

Performance Analysis 
When a new feature is shipped, can you clearly see the impact it has on your 
systems? As load climbs and you have to choose to add capacity or optimize 
code, do you know where to focus in order to make the most impact and keep 
your most important customers happy? 
Intercom used Honeycomb to evaluate performance across all the 
dimensions required to understand how different users and types of 

usage affected the performance of a given endpoint. They were able to 
both identify the portions of the code needing refactoring as well as 
document concrete examples of how they'd improve performance.  

How Intercom sped up their busiest endpoint(by as much as 
50%) 
 

 




 

Incident Response 
When a user misuses your service, maliciously or otherwise, are you able to 
locate the vulnerability in your codebase and then address the problem before 
others notice? Do your tools have the power to isolate the source of an attack, or 
how many users it may be impacting? 
When hackers tried to DOS their service, carwow needed the ability to 
query at a level of granularity that their traditional APM tools couldn't 
manage, so they turned to Honeycomb  

Preventing Bad Actors from Spoiling the Show at carwow 
Visibility into 3rd-party Services 
If your product relies on external API calls and responses, can you identify the 
source of a service slowdown? Do you have the ability to sift through the 
information coming from your database, your cache, your load balancers, and 
your own code quickly and reliably to know if you should be looking to 3rd party 

providers to resolve? 
Behaviour Interactive (BHVR) had been using a classic APM approach for 
some time to troubleshoot latency issues in their flagship multiplayer 
video game, but were unable to identify the source of a service 
slowdown—was it in the caching, the database, or somewhere in one of 
the numerous external calls? With Honeycomb, they found the issues in 
just minutes 

Gamers Won't Wait: Dead By Daylight Gets Some Sweet 
Attention 
 

 




 

Addressing Technical Debt 
As your organization scales and your product's footprint grows, are you able to 
maintain clear sight-lines across your infrastructure as complexity increases? 
Can you evaluate systems performance using distributed tracing views and 
better understand the interactions among an increasing number of services? 
While growing as fast as possible to meet their business demands, 
carwow leveraged Honeycomb to follow a request through its entire 
life-cycle and understand the impact on different subsystems in the 
code, leveraging its cross-team collaboration features to solve issues:  

Honeycomb Tracing Drives Efficiencies as carwow Scales 

User Happiness and Product Management 
Do you understand how the end user experiences your product? Do you notice 
when they use features in unexpected ways and can you capture that data for 
your product team to investigate? 
Using Honeycomb, Intercom discovered one of their users was trying so 
hard to use their product in ways they hadn't anticipated that it was 
impacting the overall experience of many others—and as a result 
informed future product planning for that feature: 

Intercom <3 Honeycomb 

 

 




 

Key capabilities to move the needle on your 
observability practice 
If you experience any of the following, then you must adopt an observability 
approach. This will involve cultural and process-centric changes but for this 
document, we will focus on technology tooling that DevOps teams require in 
order to fully understand production, debug faster and spend less time fighting 
technical debt. 


Increased frequency of code ships or feature releases 




Increase in volume of users / customers 



More questions for engineers from on-call teams 



Customer complaint issues on the rise 



Pressure to get new features into production faster. 

Technology tooling requirements for an observability practice requires the 
following capabilities. Without these, it is extremely difficult to answer the 
questions that matter to your production system. 
Here are some key capabilities and why they matter: 

● Automatic instrumentation of events and traces across 
popular languages 
Most developers don't enjoy instrumenting their code, yet everyone on the team 
needs that telemetry to give context and meaning to ultimately achieve 
observability. 

Does your solution provide drop-in, immediate, automatic instrumentation 
to instrumentation to jump-start your observability practice? 

 




 

● Query performance suitable for rapid iteration and debugging 
in production 
When your team is in a firefight, the last thing you want to wait for is query 
results, whether from ETL delay or slow search performance or worse, no access 
to the data set. 

Does your solution provide the ability to slice and dice your data across a
number of dimensions in a frictionless way, as well as the ability to
backtrack and try new theories, and still get to the problem source fast? 
 

● Support for flexible queries over many dimensions 
More and more, the questions that need to be asked to move the business 
forward require the ability to drill down across many aspects of your system data, 
but most tools aggregate data, removing the detail required. 

Does your solution support fast query response across many fields 
containing an arbitrary number of unique values so you can ask the 
important questions? 
 

● Intuitive query interface 
Seems like every new tool involves learning a new query language, and the 

slowdown of the associated learning curve. 

Does your solution offer a graphical query interface designed to speed 
users to insights and intelligent, data-driven suggestions for investigation 
paths when troubleshooting any issue, large or small? 




 

 

● Next-generation anomaly-detection 
When something in the data looks unusual, it can take a lot of false starts to get 
at what might be the cause, to determine how big an impact it may have. When 
production is impacted, speed matters. 

Does your solution allow investigators to simply select outliers and 
accelerate their validation process by seeing the most likely suspects right 
away? Does your solution have a high signal to noise ratio that aids 
debuggers rather than slowing them down with red herrings? 
 

● Distributed tracing visualizations in context 
Context switching is a known productivity killer. 

Does your solution force you to switch tools and re-orient yourself if you 
want to try a different visualization? Will that other tool have the same 
data-set? Does it give you everything you need to keep investigating and 

easily switch to a tracing waterfall diagram with the same data? 
 

● Smart sampling to retain key data points 
Relying on metrics means having little or no control over what gets averaged out 
of your data.  

Can you precisely manage the level and application of sampling so you can 
control costs without missing the events that are important to track for your 
business? 




 

 

● Fine-tunable data retention policies 
Multiple data sources means multiple legal or business requirements for 
retention. 

Does your solution allow you to rebalance storage allocation and determine 
how long you want to keep data in order to meet business needs as you 
scale over time? 
 

● Support for open standards such as OpenCensus 
Ease of getting data into your tools is critical, and some organizations want to
adhere to open source standards where available.

 
Does your solution support and provide capabilities for getting data in via
OpenCensus libraries that collect key metrics and distributed traces from
your system services? 
 

● The ability to fully leverage your best talent, past and present 
Collaboration is more than just an idea or process approach.  

Does your solution encourage each team member to discover past queries 
from their teammates and learn by following in their footsteps at every step 
in the query construction? 
 




 

Does this seem like a tall order? In many ways, observability is equal parts 
process, culture and tooling. Building your own stack or leveraging a myriad of 
different tools has drawbacks and costs to the business overall. 

If you're thinking of building your own 
observability stack 
With a homegrown solution, you might start small and expand from there. 
Someone on your team provisions a couple of AWS EC2 instances for Logstash, 
the cluster nodes, and Kibana, and you might get some useful data flowing into 
ELK for your team to troubleshoot problems. 
But if you’re doing it *right*, more people will want to use the system and so you 

will need to either limit what data you ingest, or scale it up, and that will 
dramatically increase your costs year over year. 
Consider this breakdown from an IT manager who investigated how much time 
would be spent on each of these tasks in order to deploy and run an ELK stack 
for a year: 

These relatively conservative estimates come out to 344 hours, ​essentially a 
full-time person's workload​.  
 

 




 

Doing the same calculation for deploying Honeycomb, a purpose built technology 
to enable system observability reduces the time down to just 40 hours—a single 
work-week per year, which is a significantly reduced level of time and money 
investment: 

And of course, this doesn't take into account the cost of the hardware you would 
need to run a home-grown solution—whether on-premise or in the cloud. 

What about traces? 
Beyond being costly to run, Elastic stack doesn’t allow you to see the full 
distributed context of the requests that generated log messages. You could of 
course ask your team to use a second tool to access tracing event data, but in 
doing so they would then have to re-orient their analysis and investigation, drilling 

into a different UI with a different approach and mindset—which can not only 
derail an investigation slowing it down but more importantly with less accuracy 
and reliability. 
This is undesirable when time-to-understanding (and ultimately, resolution) is 
critical. Honeycomb provides distributed tracing built in, and with Honeycomb 
Beelines, your code is instrumented automatically, including traces. 
“The Beelines include insightful and valuable traces by default, they're 
built by people who know what is useful. The magic that is there makes 
sense.”  
- Alex Newman, Co-founder, hCaptcha 

 
Why not offload all that operational overhead to Honeycomb?​ We are your first 
port of call for debugging, and can help you find the right data, in context. 

10 


 

If you're thinking of using a vendor to 
achieve observability 
Metrics tools do well at counting discrete gauges and counters such as ​number 
of jobs queued​, ​host level resource usage​, ​number of requests served by the 
system​, and so on. Searching is often fast because these systems aggregate, or 
average the data by a fixed set of buckets (e.g. host, region, or version). But for 
resolving​ and ​preventing ​problematic production issues, your team needs the rich 
details and context that Honeycomb provides—and although metrics vendors 
claim to offer context, they charge a hefty fee to deliver it. 
With standard metrics tools, engineers and ops teams typically work with 

counters that track numbers as they change over time. Some common things to 
measure are the number of HTTP requests a given app is serving, average 
latency, and rate of errors. Each one of these counters or gauges will generate 
one unique time series ​on disk. 

That’s good for a basic start at detecting ​when​ problems are potentially 
occurring, but you need much more context to find out ​what​ is going wrong, and 
that context is not present in the data if there is only one global counter. 
As a result, what happens next is that engineers try to add the context they need 
using tags or labels, which generate new time series to track the counter's value 

11 


 

summed across each unique set of tags or labels, and those time series add up 
fast. 

And the multiplicative effect gets greater and greater as you add more tags and 
labels. If you want to know which user was associated with a given request, 
rather than just what host or container it came from, the cost of storing all that 
additional data goes up exponentially. Likewise, to track percentiles in addition to 
just averages, you need to create multiples of the metric(s) in question. The 
number of time series quickly becomes very expensive to store for tools that 
aren’t designed to handle it, much less read it back quickly enough to help you 
figure out where issues lie. 

Metrics vendors typically limit the number of tags you can define, or simply 
charge you the exponentially increasing cost for each combination of tag values 

you send. All features that are built in to Honeycomb's solution such as unlimited 
queryable fields/columns, tracing, anomaly detection are typically an extra cost 
for most metrics providers. These vendors ​have ​to charge more to provide 

12 


 

observability beyond their basic capabilities because they are compensating for 
the cost and performance of their older architecture. 
To avoid this problem you must seek an observability tool that is designed to 
handle high levels of granularity – data with a lot of distinct possible values— 
natively. Honeycomb’s storage model solves this completely by shifting to an 
event-based model, where raw values are stored and can be dynamically queried, 
instead of aggregating everything up front. 
Honeycomb has no limit on dimensions​ ​–​ We encourage the exact behavior that 
metrics systems forbid. Add fields to your events like user or team ID. Tag them 
with specifics about client versions! Add timers and fields that measure the 
behavior of 3rd parties. Honeycomb will happily handle it all, and let you easily 
query cross-sections of data in one place. 

What about traces? 
Widespread use of tracing is new enough that it is an afterthought for most 
vendors. Some have acquired early-stage third-party solutions and included them 
in their offerings, but the integrations are sub-optimal.  
As a result, what's out there does not allow you to truly leverage the context that 
exists in your data. Query experience is typically limited to just a few parameters, 
which means you're constrained by what the vendor thinks is interesting or 
useful.  

Honeycomb tracing gives you the most flexibility to dig into what matters to 
your business​ and define "interesting" based on not just trace length, or a few 
tags. In Honeycomb, tracing, events, and metrics are all just views into the same 
data, so you can apply the method or visualization that is the most effective at 
any time to get to the answer faster.  
Tracing is not siloed away, requiring engineers to first check the metrics tool, 
then the logging tool to get a specific enough ID to look for the right trace. 

13 


 

Instead, Honeycomb lets you attach the same set of contexts you apply to your 
business across all your debugging efforts. 

Build the observability practice your 
business needs with Honeycomb 
To get the broad set of capabilities and performance you need to achieve an 
observable state, you can choose to maintain the high cost of building your own 
stack, or take the path of high cost-for-performance last-gen APM/metrics 
tools…or you can choose to get an immediate and strong ROI by investing in 
Honeycomb that is purposely designed and built for production system 
observability. 

14 


About Honeycomb
Honeycomb provides next-gen APM for modern dev teams to better understand and

debug production systems. With Honeycomb teams achieve system observability
and find unknown problems in a fraction of the time it takes other approaches and
tools. More time is spent innovating and life on-call doesn't suck. Developers love it,
operators rely on it and the business can’t live without it.
Follow Honeycomb on Twitter LinkedIn
Visit us at Honeycomb.io



×