CS224W: Analysis of Networks
Jure Leskovec, Stanford University
¡
¡
(1) New problem: Outbreak detection
(2) Develop an approximation algorithm
§ It is a submodular opt. problem!
¡
(3) Speed-up greedy hill-climbing
§ Valid for optimizing general submodular functions
(i.e., also works for influence maximization)
¡
(4) Prove a new “data dependent” bound
on the solution quality
§ Valid for optimizing any submodular function
(i.e., also works for influence maximization)
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
2
¡
Given a real city water
distribution network
¡
And data on how
contaminants spread
in the network
¡
Detect the
contaminant as quickly
as possible
¡
11/7/18
S
Problem posed by the
US Environmental
Protection Agency
Jure Leskovec, Stanford CS224W: Analysis of Networks,
3
Posts
Users/blogs
Information
cascade
Time
ordered
hyperlinks
Which users/news sites should
one follow to detect cascades
as effectively as possible?
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
4
Want to read things
before others do.
Detect blue & yellow
stories soon but miss
the red story.
Detect all
stories but late.
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
5
¡
Both of these two are instances of the same
underlying problem!
¡
Given a dynamic process spreading over
a network we want to select a set of nodes
to detect the process effectively
¡
Many other applications:
§ Epidemics
§ Influence propagation
§ Network security
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
6
¡
Utility of placing sensors:
§ Water flow dynamics, demands of households, …
¡
For each subset S Í V compute utility f(S)
High impact
outbreak
Contamination
Low impact
outbreak
S3
S1S2
S1
S4
Set V of all
network junctions
High sensing “quality” (e.g., f(S) = 0.9)
11/7/18
Medium
impact
outbreak
S3
Sensor reduces
impact through
early detection!
S2
S4
S1
Low sensing “quality” (e.g. f(S)=0.01)
Jure Leskovec, Stanford CS224W: Analysis of Networks,
7
Given:
¡
¡
Graph !(#, %)
Data about how outbreaks spread over the ':
§ For each outbreak ( we know the time )(*, ()
when outbreak ( contaminates node *
Water distribution network
(physical pipes and junctions)
11/7/18
Simulator of water consumption & flow
(built by Mech. Eng. people)
We simulate the contamination spread for every
possible location.
Jure Leskovec, Stanford CS224W: Analysis of Networks,
8
Given:
¡
¡
Graph !(#, %)
Data about how outbreaks spread over the ':
§ For each outbreak ( we know the time )(*, ()
when outbreak ( contaminates node *
a
c
b
a
c
b
The network of
newsmedia
11/7/18
Traces of the information flow and
identify influence sets
Collect lots of articles and trace them to
obtain data about information flow from a
given news site.
Jure Leskovec, Stanford CS224W: Analysis of Networks,
9
Given:
¡
¡
Graph !(#, %)
Data on how outbreaks spread over the ':
¡
Goal: Select a subset of nodes S that
maximizes the expected reward:
§ For each outbreak ( we know the time )(*, ()
when outbreak ( contaminates node *
max 1 2 = 4 6 ( 15 2
.⊆0
5
Expected reward for
detecting outbreak i
subject to: cost(S) < B
P(i)… probability of outbreak i occurring.
f(i)… reward for detecting outbreak i using sensors S.
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
10
¡
Reward (one of the following three):
§ (1) Minimize time to detection
§ (2) Maximize number of detected propagations
§ (3) Minimize number of infected people
¡
Cost (context dependent):
§ Reading big blogs is more time consuming
§ Placing a sensor in a remote location is expensive
5
outbreak i
Monitoring blue node saves more
people than monitoring the green node
11/7/18
9
2
1
8
11
6
3
10
7
Jure Leskovec, Stanford CS224W: Analysis of Networks,
f(S)
11
¡
Objective functions:
§ 1) Time to detection (DT)
§ How long does it take to detect a contamination?
§ Penalty for detecting at time !: "# (%) = %
§ 2) Detection likelihood (DL)
§ How many contaminations do we detect?
§ Penalty for detecting at time !: "# (%) = 0, "# (∞) = 1
§ Note, this is binary outcome: we either detect or not
§ 3) Population affected (PA)
¡
11/7/18
§ How many people drank contaminated water?
§ Penalty for detecting at time !: "# (%) = {# of infected
nodes in outbreak + by time %}.
Observation:
In all cases detecting sooner does not hurt!
Jure Leskovec, Stanford CS224W: Analysis of Networks,
12
We define !" # as penalty reduction:
$% & = (% ∅ − (% (,(&, .))
¡
Observation: Diminishing returns
New sensor:
x1
S’
s’
x1
x2
x3
x2
x4
Placement S={x1, x2}
Adding s’ helps a lot
11/7/18
Placement S’={x1, x2, x3, x4}
Adding s’ helps
very little
Jure Leskovec, Stanford CS224W: Analysis of Networks,
13
Claim: For all ! ⊆ # ⊆ $ and sensors % ∈ $\#
( !∪ * −( ! ≥( #∪ * −( #
¡ Proof: All our objectives are submodular
¡
§
§
§
§
Fix cascade/outbreak Show (- ! = /- ∞ − /- (2(!, -)) is submodular
Consider ! ⊆ # ⊆ $ and sensor * ∈ $\#
When does node % detect cascade -?
§ We analyze 3 cases based on when * detects outbreak i
§ (1) 2 #, - < 2 !, - < 2(*, -): * detects late, nobody benefits:
67 8 ∪ 9 = 67 8 , also 67 : ∪ 9 = 67 : and so
67 8 ∪ 9 − 67 8 = 0 = 67 : ∪ 9 − 67 :
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
14
¡
Proof (contd.):
Remember ' ⊆ "
§ (2) ! ", $ ≤ ! &, $ ≤ ! ', $ : & detects after B but before A
& detects sooner than any node in ' but after all in ".
So & only helps improve the solution ' (but not ")
)* + ∪ - − )* + ≥ 0 = )* 2 ∪ - − )* 2
§ (3) ! &, $ < ! ", $ < !(', $): & detects early
)* + ∪ - − )* + = 5* ∞ − 5* 7 -, 8 − )* (+) ≥
5* ∞ − 5* 7 -, 8 − )* (2) = )* 2 ∪ - − )* 2
§ Inequality is due to non-decreasingness of )* (⋅), i.e., )* + ≤ )* (2)
§ So, :$ (⋅) is submodular!
¡
So, :(⋅) is also submodular
) ; = < = 8 )* ;
*
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
15
¡
Hill-climbing
reward
d
a
b
c
b
a
c
d
e
Add sensor with
highest marginal gain
11/7/18
e
¡
What do we know about
optimizing submodular
functions?
§ Hill-climbing (i.e., greedy) is near
"
optimal: (" − $) ⋅ '()
But:
§ (1) This only works for unit cost
case! (each sensor costs the same)
§ For us each sensor * has cost +(*)
§ (2) Hill-climbing algorithm is slow
§ At each iteration we need to re-evaluate
marginal gains of all nodes
§ Runtime '(|-| · /) for placing / sensors
Jure Leskovec, Stanford CS224W: Analysis of Networks,
Part 2-16
¡
Consider the following algorithm to solve
the outbreak detection problem:
Hill-climbing that ignores cost
§ Ignore sensor cost !(#)
§ Repeatedly select sensor with highest marginal gain
§ Do this until the budget is exhausted
Q: How well does this work?
¡ A: It can fail arbitrarily badly! L
¡
§ There exists a problem setting where the hill-climbing
solution is arbitrarily far from OPT
§ Next we come up with an example
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
18
¡
Bad example when we ignore cost:
§
§
§
§
¡
! sensors, budget "
#$: reward %, cost ",
#( … #!: reward % − +, c = $
Hill-climbing always prefers more expensive sensor
#$ with reward % (and exhausts the budget).
It never selects cheaper sensors with reward % − +
→ For variable cost it can fail arbitrarily badly!
Idea: What if we optimize benefit-cost ratio?
8 9/:; ∪ {.} − 8(9/:;)
./ = arg max
5∈7
A #
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
Greedily pick sensor
#B that maximizes
benefit to cost ratio.
19
¡
¡
Benefit-cost ratio can also fail arbitrarily badly!
Consider: budget !:
§ 2 sensors "# and "$:
§ Costs: %("#) = ), %("$) = !
§ Benefit (only 1 cascade): *("#) = $), *("$) = !
§ Then benefit-cost ratio is:
§ * "# /%("#) = $ and *("$ )/%("$) = #
§ So, we first select "# and then can not afford "$
→We get reward $) instead of !! Now send ) → .
and we get an arbitrarily bad solution!
This algorithm incentivizes choosing nodes with very low cost, even when slightly
more expensive ones can lead to much better global results.
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
20
¡
CELF (Cost-Effective Lazy Forward-selection)
A two pass greedy algorithm:
§ Set (solution) !′: Use benefit-cost greedy
§ Set (solution) !′′: Use unit-cost greedy
§ Final solution: ! = $%& '$((*(!′), *(!′′))
How far is CELF from (unknown) optimal
solution?
¡ Theorem: CELF is near optimal [Krause&Guestrin, ‘05]
¡
§ CELF achieves ½(1-1/e) factor approximation!
This is surprising: We have two clearly suboptimal solutions, but taking best of the
two is guaranteed to give a near-optimal solution.
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
21
¡
Hill-climbing
reward
d
a
b
c
b
§ Hill-climbing (i.e., greedy) is near
$
optimal (that is, (1 − ) ⋅ ()*)
a
c
d
e
Add sensor with
highest marginal gain
e
What do we know about
optimizing submodular
functions?
¡
But:
%
§ (2) Hill-climbing algorithm is slow!
§ At each iteration we need to reevaluate marginal gains of all nodes
§ Runtime ((|,| · .) for placing .
sensors
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
23
¡
In round ! + #: So far we picked $% = {(1, … , (% }
§ Now pick -!.# = /01 2/3 5(7! ∪ {4}) − 5(7! )
4
§ This our old friend – greedy hill-climbing algorithm.
It maximizes the “marginal gain”
;! 4 = 5(7! ∪ {4}) − 5(7! )
¡
By submodularity property:
< $= ∪ >
¡
− < $@ for % < B
Observation: By submodularity:
For every 4
C= (>) ≥ C@ (>) for % < B since $% ⊂ $B
Marginal benefits di(u) only shrink!
(as i grows)
11/7/18
− < $= ≥ < $@ ∪ >
di(u) ³ dj(u)
u
Activating node u in step i helps
more than activating it at step j (j>i)
Jure Leskovec, Stanford CS224W: Analysis of Networks,
24
¡
Idea:
§ Use di as upper-bound on dj (j > i)
¡
Lazy hill-climbing:
Marginal gain
§ Keep an ordered list of marginal
benefits di from previous iteration
§ Re-evaluate di only for top node
§ Re-order and prune
a
b
c
d
e
f(S È {u}) – f(S) ≥ f(T È {u}) – f(T)
11/7/18
Jure Leskovec, Stanford CS224W: Analysis of Networks,
S1={a}
SÍT
25