Tải bản đầy đủ (.pdf) (48 trang)

13 outbreak detection in networks

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (28.59 MB, 48 trang )

CS224W: Analysis of Networks
Jure Leskovec, Stanford University




¡
¡

(1) New problem: Outbreak detection
(2) Develop an approximation algorithm
§ It is a submodular opt. problem!

¡

(3) Speed-up greedy hill-climbing
§ Valid for optimizing general submodular functions
(i.e., also works for influence maximization)

¡

(4) Prove a new “data dependent” bound
on the solution quality
§ Valid for optimizing any submodular function
(i.e., also works for influence maximization)

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

2




¡

Given a real city water
distribution network

¡

And data on how
contaminants spread
in the network

¡

Detect the
contaminant as quickly
as possible

¡

11/7/18

S

Problem posed by the
US Environmental
Protection Agency
Jure Leskovec, Stanford CS224W: Analysis of Networks,


3


Posts

Users/blogs
Information
cascade
Time
ordered
hyperlinks

Which users/news sites should
one follow to detect cascades
as effectively as possible?
11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

4


Want to read things
before others do.
Detect blue & yellow
stories soon but miss
the red story.

Detect all
stories but late.

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

5


¡

Both of these two are instances of the same
underlying problem!

¡

Given a dynamic process spreading over
a network we want to select a set of nodes
to detect the process effectively

¡

Many other applications:
§ Epidemics
§ Influence propagation
§ Network security

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

6



¡

Utility of placing sensors:
§ Water flow dynamics, demands of households, …

¡

For each subset S Í V compute utility f(S)

High impact
outbreak
Contamination

Low impact
outbreak
S3
S1S2

S1

S4

Set V of all
network junctions
High sensing “quality” (e.g., f(S) = 0.9)
11/7/18

Medium

impact
outbreak

S3

Sensor reduces
impact through
early detection!

S2
S4

S1

Low sensing “quality” (e.g. f(S)=0.01)

Jure Leskovec, Stanford CS224W: Analysis of Networks,

7


Given:
¡
¡

Graph !(#, %)
Data about how outbreaks spread over the ':
§ For each outbreak ( we know the time )(*, ()
when outbreak ( contaminates node *


Water distribution network
(physical pipes and junctions)
11/7/18

Simulator of water consumption & flow
(built by Mech. Eng. people)
We simulate the contamination spread for every
possible location.

Jure Leskovec, Stanford CS224W: Analysis of Networks,

8


Given:
¡
¡

Graph !(#, %)
Data about how outbreaks spread over the ':
§ For each outbreak ( we know the time )(*, ()
when outbreak ( contaminates node *
a

c

b

a
c

b

The network of
newsmedia
11/7/18

Traces of the information flow and
identify influence sets
Collect lots of articles and trace them to
obtain data about information flow from a
given news site.

Jure Leskovec, Stanford CS224W: Analysis of Networks,

9


Given:
¡
¡

Graph !(#, %)
Data on how outbreaks spread over the ':

¡

Goal: Select a subset of nodes S that
maximizes the expected reward:

§ For each outbreak ( we know the time )(*, ()

when outbreak ( contaminates node *

max 1 2 = 4 6 ( 15 2
.⊆0

5

Expected reward for
detecting outbreak i

subject to: cost(S) < B

P(i)… probability of outbreak i occurring.
f(i)… reward for detecting outbreak i using sensors S.
11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

10


¡

Reward (one of the following three):
§ (1) Minimize time to detection
§ (2) Maximize number of detected propagations
§ (3) Minimize number of infected people

¡


Cost (context dependent):

§ Reading big blogs is more time consuming
§ Placing a sensor in a remote location is expensive
5
outbreak i

Monitoring blue node saves more
people than monitoring the green node
11/7/18

9

2
1

8
11

6
3

10
7

Jure Leskovec, Stanford CS224W: Analysis of Networks,

f(S)
11



¡

Objective functions:

§ 1) Time to detection (DT)

§ How long does it take to detect a contamination?
§ Penalty for detecting at time !: "# (%) = %

§ 2) Detection likelihood (DL)

§ How many contaminations do we detect?
§ Penalty for detecting at time !: "# (%) = 0, "# (∞) = 1
§ Note, this is binary outcome: we either detect or not

§ 3) Population affected (PA)

¡
11/7/18

§ How many people drank contaminated water?
§ Penalty for detecting at time !: "# (%) = {# of infected
nodes in outbreak + by time %}.

Observation:
In all cases detecting sooner does not hurt!
Jure Leskovec, Stanford CS224W: Analysis of Networks,

12



We define !" # as penalty reduction:
$% & = (% ∅ − (% (,(&, .))
¡

Observation: Diminishing returns
New sensor:

x1

S’
s’

x1
x2

x3
x2

x4

Placement S={x1, x2}
Adding s’ helps a lot
11/7/18

Placement S’={x1, x2, x3, x4}
Adding s’ helps
very little


Jure Leskovec, Stanford CS224W: Analysis of Networks,

13


Claim: For all ! ⊆ # ⊆ $ and sensors % ∈ $\#
( !∪ * −( ! ≥( #∪ * −( #
¡ Proof: All our objectives are submodular
¡

§
§
§
§

Fix cascade/outbreak Show (- ! = /- ∞ − /- (2(!, -)) is submodular
Consider ! ⊆ # ⊆ $ and sensor * ∈ $\#
When does node % detect cascade -?
§ We analyze 3 cases based on when * detects outbreak i
§ (1) 2 #, - < 2 !, - < 2(*, -): * detects late, nobody benefits:
67 8 ∪ 9 = 67 8 , also 67 : ∪ 9 = 67 : and so
67 8 ∪ 9 − 67 8 = 0 = 67 : ∪ 9 − 67 :

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

14



¡

Proof (contd.):

Remember ' ⊆ "

§ (2) ! ", $ ≤ ! &, $ ≤ ! ', $ : & detects after B but before A
& detects sooner than any node in ' but after all in ".
So & only helps improve the solution ' (but not ")
)* + ∪ - − )* + ≥ 0 = )* 2 ∪ - − )* 2
§ (3) ! &, $ < ! ", $ < !(', $): & detects early
)* + ∪ - − )* + = 5* ∞ − 5* 7 -, 8 − )* (+) ≥
5* ∞ − 5* 7 -, 8 − )* (2) = )* 2 ∪ - − )* 2
§ Inequality is due to non-decreasingness of )* (⋅), i.e., )* + ≤ )* (2)

§ So, :$ (⋅) is submodular!

¡

So, :(⋅) is also submodular

) ; = < = 8 )* ;
*

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

15



¡
Hill-climbing
reward
d

a
b
c

b

a
c

d
e
Add sensor with
highest marginal gain

11/7/18

e

¡

What do we know about
optimizing submodular
functions?
§ Hill-climbing (i.e., greedy) is near

"
optimal: (" − $) ⋅ '()

But:

§ (1) This only works for unit cost
case! (each sensor costs the same)
§ For us each sensor * has cost +(*)

§ (2) Hill-climbing algorithm is slow
§ At each iteration we need to re-evaluate
marginal gains of all nodes
§ Runtime '(|-| · /) for placing / sensors
Jure Leskovec, Stanford CS224W: Analysis of Networks,

Part 2-16



¡

Consider the following algorithm to solve
the outbreak detection problem:
Hill-climbing that ignores cost
§ Ignore sensor cost !(#)
§ Repeatedly select sensor with highest marginal gain
§ Do this until the budget is exhausted

Q: How well does this work?
¡ A: It can fail arbitrarily badly! L

¡

§ There exists a problem setting where the hill-climbing
solution is arbitrarily far from OPT

§ Next we come up with an example
11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

18


¡

Bad example when we ignore cost:
§
§
§
§

¡

! sensors, budget "
#$: reward %, cost ",
#( … #!: reward % − +, c = $
Hill-climbing always prefers more expensive sensor
#$ with reward % (and exhausts the budget).
It never selects cheaper sensors with reward % − +
→ For variable cost it can fail arbitrarily badly!


Idea: What if we optimize benefit-cost ratio?
8 9/:; ∪ {.} − 8(9/:;)
./ = arg max
5∈7
A #

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

Greedily pick sensor
#B that maximizes
benefit to cost ratio.
19


¡
¡

Benefit-cost ratio can also fail arbitrarily badly!
Consider: budget !:
§ 2 sensors "# and "$:
§ Costs: %("#) = ), %("$) = !
§ Benefit (only 1 cascade): *("#) = $), *("$) = !

§ Then benefit-cost ratio is:
§ * "# /%("#) = $ and *("$ )/%("$) = #

§ So, we first select "# and then can not afford "$

→We get reward $) instead of !! Now send ) → .
and we get an arbitrarily bad solution!
This algorithm incentivizes choosing nodes with very low cost, even when slightly
more expensive ones can lead to much better global results.
11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

20


¡

CELF (Cost-Effective Lazy Forward-selection)
A two pass greedy algorithm:

§ Set (solution) !′: Use benefit-cost greedy
§ Set (solution) !′′: Use unit-cost greedy

§ Final solution: ! = $%& '$((*(!′), *(!′′))

How far is CELF from (unknown) optimal
solution?
¡ Theorem: CELF is near optimal [Krause&Guestrin, ‘05]
¡

§ CELF achieves ½(1-1/e) factor approximation!
This is surprising: We have two clearly suboptimal solutions, but taking best of the
two is guaranteed to give a near-optimal solution.
11/7/18


Jure Leskovec, Stanford CS224W: Analysis of Networks,

21



¡
Hill-climbing
reward
d

a
b
c

b

§ Hill-climbing (i.e., greedy) is near
$
optimal (that is, (1 − ) ⋅ ()*)

a
c

d
e
Add sensor with
highest marginal gain


e

What do we know about
optimizing submodular
functions?

¡

But:

%

§ (2) Hill-climbing algorithm is slow!
§ At each iteration we need to reevaluate marginal gains of all nodes
§ Runtime ((|,| · .) for placing .
sensors

11/7/18

Jure Leskovec, Stanford CS224W: Analysis of Networks,

23


¡

In round ! + #: So far we picked $% = {(1, … , (% }
§ Now pick -!.# = /01 2/3 5(7! ∪ {4}) − 5(7! )
4


§ This our old friend – greedy hill-climbing algorithm.
It maximizes the “marginal gain”
;! 4 = 5(7! ∪ {4}) − 5(7! )

¡

By submodularity property:
< $= ∪ >

¡

− < $@ for % < B

Observation: By submodularity:
For every 4
C= (>) ≥ C@ (>) for % < B since $% ⊂ $B
Marginal benefits di(u) only shrink!
(as i grows)

11/7/18

− < $= ≥ < $@ ∪ >

di(u) ³ dj(u)

u

Activating node u in step i helps
more than activating it at step j (j>i)


Jure Leskovec, Stanford CS224W: Analysis of Networks,

24


¡

Idea:
§ Use di as upper-bound on dj (j > i)

¡

Lazy hill-climbing:

Marginal gain

§ Keep an ordered list of marginal
benefits di from previous iteration
§ Re-evaluate di only for top node
§ Re-order and prune

a
b
c
d
e

f(S È {u}) – f(S) ≥ f(T È {u}) – f(T)
11/7/18


Jure Leskovec, Stanford CS224W: Analysis of Networks,

S1={a}

SÍT
25


×