Tải bản đầy đủ (.pdf) (84 trang)

IT training the automated traffic handbook by andy still khotailieu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.31 MB, 84 trang )

Co
m
pl
im
en
ts
of

The Automated
Traffic Handbook
Managing Spiders, Bots, Scrapers,
and Other Non-Human Traffic

Andy Still


TheI
ndust
r
yL
eadi
ngWebTr
af
fic
ManagementSyst
em

Under
st
and,
Cont


r
olAndOpt
i
mi
seOnl
i
neTr
af
fic
Guar
ant
eewebsi
t
eupt
i
me,
pr
ot
ectyourbusi
nessf
r
om mal
i
ci
ousbot
s,
ensur
eex
cel
l

entcust
omerex
per
i
enceand
max
i
mi
ser
evenuegener
at
edbywebappl
i
cat
i
ons.

How I
tWor
k
s
Desi
gnedbywebper
f
or
manceex
per
t
s,
Tr

af
ficDef
enderi
sacl
oudser
vi
cet
hatsi
t
si
nf
r
ontofyourwebsi
t
eorAPI
cont
r
ol
l
i
ngt
heflow oft
r
af
fict
oi
t
.Ourhi
ghl
yr

esi
l
i
entpl
at
f
or
m guar
ant
eesupt
i
meandpr
ot
ect
syourwebsi
t
ef
r
om
mal
i
ci
ousbotact
i
vi
t
y,
enabl
i
ngyout

ogener
at
emax
i
mum r
evenueoveryoursi
t
e’
sbusi
estper
i
ods.

F
i
ndoutwhatTr
af
ficDef
endercandof
oryourwebsi
t
e
BookDemo

OR

F
r
eeTr
i

al

L
ear
nmor
eati
nt
echni
ca.
com/t
r
af
ficdef
ender


The Automated Traffic
Handbook

Managing Spiders, Bots, Scrapers,
and Other Non-Human Traffic

Andy Still

Beijing

Boston Farnham Sebastopol

Tokyo



The Automated Traffic Handbook
by Andy Still
Copyright © 2018 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles ( For more
information, contact our corporate/institutional sales department: 800-998-9938 or


Editor: Virginia Wilson
Production Editor: Nicholas Adams
Copyeditor: Jasmine Kwityn
Interior Designer: David Futato
February 2018:

Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
Tech Reviewers: Daniel Huddart, Andy
Lole, and Jason Hand

First Edition

Revision History for the First Edition
2018-02-02:

First Release


The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Automated
Traffic Handbook, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Intechnica. See our state‐
ment of editorial independence.

978-1-492-02935-9
[LSI]


Table of Contents

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Part I.

Background

1. What Is Automated Traffic?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Key Characteristics of Automated Traffic
Exclusions


4
4

2. Misconceptions of Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . 7
Misconception: Bots Are Just Simple Automated Scripts
Misconception: Bots Are Just a Security Problem
Misconception: Bot Operators Are Just Individual Hackers
Misconception: Only the Big Boys Need to Worry About
Bots
Misconception: I Have a WAF, I Don’t Need to Worry
About Bot Activity

7
9
9

10
11

3. Impact of Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Company Interests
Other Users
System Security
Infrastructure

13
14
14
15


iii


Part II.

Types of Automated Traffic

4. Malicious Bots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Application DDoS

20

5. Data Harvesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Search Engine Spiders
Content Theft
Price Scraping
Content/Price Aggregation
Affiliates
User Data Harvesting

22
22
24
25
26
26

6. Checkout Abuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Scalpers
Spinners

Inventory Exhaustion
Snipers
Discount Abuse

28
29
30
30
31

7. Credit Card Fraud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Card Validation
Card Cracking
Card Fraud

33
34
34

8. User-Generated Content (UGC) Abuse. . . . . . . . . . . . . . . . . . . . . . . . 35
Content Spammer

36

9. Account Takeover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Credential Stuffing/Credential Cracking
Account Creation
Bonus Abuse

37

38
39

10. Ad Fraud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Background to Internet Advertising
Banner Fraud
Click Fraud
CPA Fraud
Cookie Stuffing
Affiliate Fraud
Arbitrage Fraud

iv

|

Table of Contents

42
44
45
46
46
47
48


11. Monitors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Availability
Performance

Other

52
52
52

12. Human-Triggered Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . 53

Part III. How to Effectively Handle Automated Traffic in
Your Business
13. Identifying Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Indications of an Automated Traffic Problem
Challenges
Generation 0: Genesis—robots.txt
Generation 1: Simple Blocking—Blacklisting and
Whitelisting
Generation 2: Early Bot Identification—Symptom
Monitoring
Generation 3: Improved Bot Identification—Real User
Validation
Generation 4: Sophisticated Bot Identification—Behavioral
Analysis

57
58
60

60
61
62

64

14. Managing Automated Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Blocking
Validation Requests
Alternative Servers/Caching
Alternative Content

68
69
71
71

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Table of Contents

|

v



Introduction

Web traffic consists of more than just the human users who visit
your site. In fact, recent reports show that human users are becom‐
ing a minority. The rest belongs to an ever-expanding group of traf‐
fic that can be grouped under the heading automated traffic.


Terminology
The terms automated traffic, bot traffic, and nonhuman traffic are equally common and are used inter‐
changeably throughout this book.

As long ago as 2014, Incapsula estimated that human traffic only
accounted for as little as 39.5% of all traffic they saw. This trend is
predicted to continue, with Cisco estimating that automated traffic
will grow by 37% year on year until 2022.
However, this is not simply a growth in the quantity of automated
traffic but also in the variety and sophistication of that traffic. New
paradigms for interaction with the internet, more complex business
models and interdependence between sources of data, evolution of
shopping methods and habits, increased sophistication of criminal
activity and availability of cloud-based computing capacity, are all
converging to create an automated traffic environment that is ever
more challenging for a website owner to control.

vii


It’s Not All Good or Bad
It is simplistic to think of automated traffic as being all goodies and
baddies. However, the truth is much more nuanced than that. As
we’ll discuss, there are clear areas of good and bad traffic but there
is a gray area in between where you will need to assess the positivity
or negativity for your situation.

This growth poses a number of fundamental questions for anyone
with responsibility for maintaining efficient operation or maximum
profitability of a public-facing website:

• How much automated traffic is hitting my website?
• What is this traffic up to?
• How worried should I be about it?
• What can I do about it?
The rest of this book will help you understand how you can provide
answers to these questions.

Terminology
The challenge of automated traffic applies to anyone
who runs a public-facing web-based system, whether
that is a traditional public website, complex web-based
application, SaaS system, web portal, or web-based
API. For simplicity I will use the generic term website
when referring to any of these systems.
Likewise I will use website owner to refer to the range
of people who will be responsible for identifying and
managing this problem—from security and platform
managers to ecommerce and marketing directors.
I will use the term bot operator to identify the individ‐
ual or group that is operating the automated traffic.

viii

|

Introduction


PART I


Background

Before going into detail about what automated traffic is doing on
your website and how this can be addressed it is important that we
have a good shared understanding of what automated traffic encom‐
passes.
The following chapters will give a brief introduction to core ele‐
ments of automated traffic and clarify some of the common miscon‐
ceptions that people hold about the nature and complexity of bot
traffic and the bot operators.



CHAPTER 1

What Is Automated Traffic?

There is a range of different definitions of what can be classed as
automated traffic.
For example, Frost & Sullivan describe bot traffic as “computer pro‐
grams that are used to perform specific actions in an automated
fashion,” Akamai has defined it as “automated software programs
that interact with websites,” and Wikipedia defines a bot as “a soft‐
ware application that runs automated tasks (scripts) over the Inter‐
net,” whereas Hubspot says “A bot is a type of automated technology
that’s programmed to execute certain tasks without human interven‐
tion.”
For the purposes of this book I will use the following description for
automated traffic, which I feel captures the essential details of what
is meant by the term and removes some of the vagaries included in

the other descriptions:
Automated traffic is any set of legitimate requests made to a website
that is made by an automated process rather than triggered by a
direct human action.

History of Automated Traffic
The history of the type of automated traffic I am discussing here
can be traced back to 1988 with the creation of IRC bots such as the
Hunt the Wumps game and Bill Wisner’s Bartender. It wasn’t until
1994, however, that the first search engine spiders were created by

3


WebCrawler (later purchased by AOL). GoogleBot followed in
1996.

Key Characteristics of Automated Traffic
For the purposes of this book, I will have a limited definition of
automated traffic; this is not to say that other types of automated
traffic are not a concern, just that they are addressed elsewhere.

Web-based Systems
The automated traffic discussed in this book is targeted at webbased systems and excludes other types of traffic, such as automated
emails.

Layer 7
Automated traffic operates at layer 7 of the OSI Model—in other
words, it operates at the application level, making HTTP/HTTPS
requests to websites and receiving responses in the same format.

Anything that interacts with servers via any other means is classed
as outside the scope of this book.

Legitimate Requests
Automated traffic is defined as traffic that makes legitimate requests
to websites (i.e., requests formulated in the same way as those made
by human users). This means that the automated traffic that is iden‐
tified as negative is focused on exploiting weaknesses in business
logic of systems, not exploiting security weaknesses.

Exclusions
The following types of traffic, which could be categorized as auto‐
mated traffic, have been excluded from any discussion within this
book. The reason for this exclusion is that they are subjects in their
own right and are well catered for in other literature, with a range of
well-established products and solutions in existence to mitigate the
issues created.
Their exclusion from this work does not imply that they are not
worthy subjects of concern for website owners. They are, in fact,
4

|

Chapter 1: What Is Automated Traffic?


very real threats that should be handled as part of any website man‐
agement strategy.

DDoS (Distributed Denial of Service)

DDoS is a low-level volumetric attack, designed to overwhelm the
server by the quantity of requests being made. There are a wide
range of different attacks that can be made to achieve this objective,
all of which aim to exploit weaknesses in networking protocols. To
mitigate this, there are well-established, dedicated DDoS manage‐
ment tools and services that can be put in place to minimize risk
from DDoS attacks.
A variation on this called application DDoS aims to make large
numbers of requests for certain, known pressure points within sys‐
tems, with the intention of bringing the system to its knees. This will
be discussed in more depth in Chapter 4.

Security Vulnerability Exploits
These types of exploits involve attempts to make illegitimate
requests to a system with the aim of exploiting weaknesses within
the security of a system allowing the operator to gain control over
the server or data within the application. Common examples
include SQL injection and cross-site scripting.
Hackers employ constant automated scripts that execute across the
internet looking for sites/servers where these vulnerabilities have
not been mitigated. Well-managed servers and good application
development can protect systems from these exploits, but it is also a
good practice to use a web application firewall (WAF) to identify
and block illegitimate requests to further minimize risk from these
or future exploits.
These automated scans and attacks are a real threat and should be
taken seriously by anyone who has responsibility for the security of
a website.

Exclusions


|

5



CHAPTER 2

Misconceptions of
Automated Traffic

As we’ve already discussed, the amount of automated traffic is grow‐
ing consistently and as it rises so too does the sophistication and
complexity of the bot operators. Before discussing the activities of
bot traffic in detail, it is worth addressing some of the common mis‐
conceptions that website owners may have about automated traffic.

Misconception: Bots Are Just Simple
Automated Scripts
While this may have been accurate 15 years ago, the level of sophis‐
tication of bot traffic has been increasing massively as both the
technology and platforms available to bot operators and the sophis‐
tication of defenses in place increases and, most importantly, the
gains to be achieved increase.
Modern bots are sophisticated systems that will manage distribution
of traffic across large-scale environments or large botnets and via
multiple proxies in order to hide their activity among that of human
users (even executing requests as a part of a human session). Bots
will routinely execute requests from real browsers and execute Java‐

Script sent to validate users as humans. Detection mechanisms such
as CAPTCHA can be bypassed, either by using artificial intelligence
or brute-force systems, or by employing farms of human agents to
solve them on demand and pass the solution back to the bot. Bots

7


are intelligent enough to integrate with these human services seam‐
lessly.

Botnet
Botnets are networks of compromised computers
(usually infected by viruses or other malware) that can
be accessed remotely and used to execute any pro‐
cesses defined by the botnet operator. Often this means
they are used to send requests to remote machines
over the internet.
They are more commonly associated with being used
for DDoS attacks but can be used for automated traffic
(e.g., account takeover or card validation attempts).
There is an increasing number of botnets being made
available for hire.

Multiple bot activities can be coordinated into a complete system.
For example, data harvesting will be undertaken to get product
details from a site to identify appropriate products to target, then
checkout abuse will be undertaken to create more valuable advertis‐
ing subjects, and finally ad fraud will be undertaken—and all of
these activities can be viewed and coordinated from a central control

panel.
Similarly, ticket touts will use spinner bots to hold a ticket and then
trigger another bot to automatically add this ticket to a secondary
ticketing site. When the ticket is sold the original bot will complete
the purchase. A central management system is in place to see the
status of tickets being held/purchased and to handle distribution of
tickets to end purchasers. Additional software is used to then mod‐
ify the downloaded tickets to reflect the new purchaser’s details.
These are just some examples of the sophistication seen in bot activ‐
ity and this level is increasing constantly to exploit weaknesses in
systems, business logic, and practices and to stay ahead of the
defense mechanisms that are constantly being improved.

8

|

Chapter 2: Misconceptions of Automated Traffic


Misconception: Bots Are Just a Security
Problem
The challenge of managing automated traffic is often just dropped at
the door of an information security officer (ISO) and the security
department, if the company has one. For some types of automated
traffic (such as credit card fraud) this makes absolute sense because
it is definitively a security issue and should be handled as such.
However, some other types of automated traffic (such as price aggre‐
gators) are actually business considerations and should be managed
as such by a relevant section of the business.

There are a number of other roles that may be involved in making
decisions about the varying types of and challenges raised by auto‐
mated traffic. These can include roles such as Head of Platform,
Head of Ecommerce, Head of Ops, and Head of Marketing.
The ideal management solution will provide sufficient information
to allow people in these roles to view details of and make informed
decisions about how to manage the elements of automated traffic
specific to their roles without being dependent on a black box
security-based system.

Misconception: Bot Operators Are Just
Individual Hackers
Obviously, we all know that there are extremely large organizations
that operate automated traffic networks (think Google) and below
that there are a group of organizations that are scraping data for
legitimate purposes (price aggregators, etc.) but beyond that there is
sometimes a sense of a distributed set of lone hackers developing
software to perpetrate scams or to sell to companies to spy on their
competitors.
While there is no doubt that such individuals exist, it is far from the
truth about all bot operators. The amount of money that can be
made with some types of automated traffic means that they are, in
reality, complex criminal organizations employing technical experts
and backed by human endeavor at an organizational, strategic level
and also at a lower level to complete manual tasks that are out of the
scope of bot activity (e.g., completing CAPTCHAs).

Misconception: Bots Are Just a Security Problem

|


9


There is also an increasing trend for the existence of third-party
services that are focused on delivering automated traffic activity on
demand. For example, there are a range of companies who offer
price/content scraping services on a per-use basis, and will provide
all standard bot evasion techniques as standard (and they are con‐
stantly working to improve the reliability of their evasion techni‐
ques). This means that rather than your competitors building a price
scraping bot in house or by using a freelancer they now have access
to a service that is dedicated to evading bot detection in order to
maintain income. Other third parties such as ticket bots, sneaker
bots, and CAPTCHA farms are all being created to further increase
the sophistication of automated traffic being made available to users
both legitimate and dubious (as well as end consumers, as is some‐
times the case with sneaker bots).

Misconception: Only the Big Boys Need to
Worry About Bots
There can sometimes be a feeling that there are two types of bots:
• Generic bots that are targeted at spotted untargeted weaknesses
in large numbers of sites
• Targeted bots that focus on specific, high-profile sites
This can lead to a false sense of security for website owners of midsized sites—they might feel that, as long as they have some general
security protection in place, then the bot operators are never going
to go to the effort of targeting their site.
In reality, this is untrue: smaller sites tend to have fewer defenses, so
are easier targets, and although solutions will need to be evolved to

be targeted to a specific site, this is often not as much work as might
be imagined. The frameworks that have been built are sophisticated
to allow for easy expansion and the available resources are such that
a wide range of websites can be targeted.
Small and mid-sized commercial online presences have been shown
to be equally targeted by automated traffic activity.

10

|

Chapter 2: Misconceptions of Automated Traffic


Misconception: I Have a WAF, I Don’t Need to
Worry About Bot Activity
Web application firewalls (WAFs) are very useful tools that form a
fundamental part of a secure system. They are similar to network
firewalls, but rather than operating at a TCP/IP level, they operate at
the HTTP level to process all incoming requests and match each
request against a set of static rules, blocking requests that fail the
checks. They are, therefore, very effective at stripping out vulnera‐
bility scanning attempts such as SQL injection attacks.
However, WAFs are not well suited for identifying bot traffic, as the
challenge of spotting automated traffic is fundamentally different.
Basically, WAFs scan web traffic looking for illegitimate requests
designed to exploit security weaknesses in web applications, whereas
bot detection systems need to scan web traffic looking for legitimate
requests that are aiming to exploit weaknesses in the business logic
of a web application. Typically this involves making a judgment after

analyzing the series of requests made to look for patterns of behav‐
ior that differ from legitimate users (either human or good bot).

Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity

|

11



CHAPTER 3

Impact of Automated Traffic

Before deciding on how to manage the automated traffic that is hit‐
ting your system, it is important that you have effectively assessed
the impact it is having, weighed against the value it is delivering to
you. When considering the impact you need to be sure that you are
not just considering the impact on your servers but also the business
impact. In addition, sufficient investigation must be undertaken to
determine the intent of the bot operator and to understand what
they were actually trying to do when executing the automated
attack.
It’s important to realize that non-human traffic can deliver value
while also having a negative impact on your business. In this case,
you must assess the relative importance of the non-human traffic to
deduce whether the benefits of this traffic outweigh the negative
effects.
When assessing the impact consider the impact on company inter‐

ests, other users, system security, and infrastructure. Let’s now
examine each of these in turn.

Company Interests
Is the automated traffic accessing your site for purposes that would
not be in the interests of your company?

13


Examples of this include:
• Competitors who are scraping your prices so that they can
adjust their pricing accordingly, putting them at a competitive
advantage.
• Bots stealing your content to use on their sites, saving them the
costs of creating that content or purchasing data feeds.
• Spambots utilizing areas of your site that allow user-generated
content (UGC), such as comments or forums, to publish offen‐
sive content or ads for services you would not want your com‐
pany associated with.
• Account takeover bots accessing people’s personal data for use
elsewhere.
• Scalpers purchasing limited availability goods for resale else‐
where creating a negative public opinion of your brand.
• Creation of fake accounts in order to take unfair advantage of
special offer terms.
• Skewing of analytics and other metrics that would lead you to
make invalid business decisions.

Other Users

Does the non-human traffic negatively affect the experience of
human users?
This could be in terms of the quality of service that you are able to
provide to them—for example, the non-availability of products due
to bots removing inventory from sale, or the variation in price if
dynamic pricing is being influenced by bot activity.
Alternatively, it could be impacting users due to the effects on the
performance of the system, such as the deterioration of site response
as a result of the higher traffic on the system.

System Security
Is the non-human traffic trying to identify or exploit security weak‐
nesses in your site?

14

| Chapter 3: Impact of Automated Traffic


Is this traffic trying to bypass your system defenses in order to gain
access to areas of the system that should not be publicly available,
such as bypassing password-protected areas of the system to gain
access to user’s personal/financial data or to steal credit associated to
that account.
As previously discussed, there is a whole range of security exploits
that can be identified by security software that will regularly be scan‐
ning your site. These are outside the scope of this book but the
impact of allowing them to hit your site without appropriate man‐
agement in place can be catastrophic, including complete loss of
control of servers and compromise of data.

Poor security can make your site a target for some of the other types
of automated traffic attacks described in this book, such as carding
or data theft. A robust approach to security management is essential
to reduce the risk of reputational damage from a wide range of
potential attacks.

Infrastructure
Does the non-human traffic affect your infrastructure?
System performance can be negatively impacted by automated traf‐
fic—for example, servers might reach capacity and therefore strug‐
gle to return content or process requests in an appropriate manner.
Alternatively, it could affect your scalability, meaning you hit limits
such as disk space required for logs, cache, or database storage or
software licence limits much sooner than expected.
All of these can further result in a negative impact on costs. This
could be due to increased bandwidth usage because of the amount
of data being returned to automated processes, additional storage
costs, or additional infrastructure or software licences required to
run the site.
If you are scaling up your infrastructure to meet high demand from
automated traffic and are not in a flexible cloud environment then
you will be paying for a level of capacity far greater than that needed
to meet the business needs of the platform just to maintain user
experience during bot attacks.

Infrastructure

|

15



×