IEEE International Conference on Computer Communications
10-15 April 2016 San Francisco, CA, USA
Targeted Viral Marketing in
Billion-scale Networks
Hung T. Nguyen1, My T. Thai2 and Thang N. Dinh1
1 CS Dept., Virginia Commonwealth University, Richmond, VA 23284
2CISE Dept., University of Florida, Gainesville, FL 32611
Thang N. Dinh
1
I. Introduction: Viral Marketing
Marketing via the “word-of-mouth” effect
Influence Maximization: Find a small set of
users(seed) to influence most of the network.
Thang N. Dinh
2
Intro.: Viral Marketing Examples
VIRAL MARKETING
ALS Ice Bucket Challenge
o 2.4 M videos uploaded on Facebook
o $98.2 M donated to ALS association
ToyRUs
#PlayItForward
o $35.5 donation
Always #LikeAGirl (youtube)
~60 mil. views
Thang N. Dinh
3
Intro.: Targeted Viral Marketing
What’s wrong with choosing Mr. President to advertise
Shampoo?
Thang N. Dinh
4
Intro.: Targeted Viral Marketing
Targeted Marketing: Focus on customers with
certain traits
Age: 18-30, Like: Music
Tech hobbyists, Age: 25-50
Targeted Viral Marketing:
Seeding strategies to
influence customers of
certain traits.
Thang N. Dinh
5
Targeted Viral Marketing Problem
Real-world data: Social networks Twitter,
Stackexchange, etc.
o Users relationship: Who follows whom?
o User attributes: Geo-location,
o User-generated contents: Tweets, posts, etc.
Targeted Viral Marketing:
o Company has a budget B to incentivize users
o Hope to trigger large cascade of adoption
o Whom to target for “3d printing”, “android”, etc.?
Thang N. Dinh
6
Targeted Viral Marketing (TVM)
Input: Given graph 𝐺 = (𝑉, 𝐸, 𝑤) and a budget B and
a propagation model.
Each node 𝑢 have a cost 𝑐(𝑢) and a relevant score
𝑏(𝑢)
Output: A seed set of total cost at most B that
maximize the expected relevance of the influenced
users (influence spread).
Thang N. Dinh
7
Related Work: Influence Maximization
𝟏
(𝟏 − − 𝝐)-approximation with
𝒆
Method
Time complexity
a probability 𝟏 − 𝒏−𝟏
Note
Greedy
(KDD’03)
𝑂(𝑘𝑚𝑛𝜖 −3 )
Original greedy
CELF
(KDD’07)
𝑂(𝑘𝑚𝑛𝜖 −3 )
Lazy-forward, up to 700
times faster than Greedy
𝑂( 𝑚 + 𝑛
ln
𝑛
+ ln 2𝑛 𝜖 −2 )
𝑘
Up to 1000 times faster than
CELF
IMM
(SIGMOD’15)
𝑂( 𝑚 + 𝑛
ln
𝑛
+ ln 2𝑛 𝜖 −2 )
𝑘
Up to 100 times faster
TIM/TIM+
SSA/D-SSA
(To appear ACM
SIGMOD’16)
Near-linear time
+ Up to 1000 times faster
Guarantee minimum samples than IMM for InfMax
Sub-linear time for dense graph.
TIM/TIM+
(SIGMOD’14)
Thang N. Dinh
8
Related Work
Nguyen et al. JSAC’13: Budgeted influence
maximization
o Not scalable, not consider users’ relevance
Topic-aware influence: No theoretical guarantees on
the quality (Barbieri et al. KAIS 2013, Barbieri et al.
EDBT 2014, Chen et al. VLDB 2015)
Thang N. Dinh
9
Cascading Models
Describe the cascading processes
Popular models:
o
o
o
o
Linear Threshold
Independent Cascades (or Bayesian Network)
SI/ SIS, SIR, SIRS, SEIRS, …
Load shedding, DC/AC Power Flow Models
Thang N. Dinh
10
Independent Cascade Model
When node v becomes active, it has a chance of
activating each currently inactive neighbor w.
The activation attempt succeeds with probability pvw .
0.6
0.3
0.2
X
0.4
0.5
w
0.2
0.1
U
0.3
0.5
Thang N. Dinh
11
Example
0.6
Inactive Node
0.3
0.2
X
0.4
0.5
w
0.2
Active Node
U
0.1
0.3
0.2
0.5
Newly active
node
Successful
attempt
Unsuccessful
attempt
v
Thang N. Dinh
12
Challenges: Targeted VM
Exponential number of
possible worlds
Scale of the social networks
o Facebook: ≥ 1.5 billions nodes,
100 billion edges, 111 PB
adjacency matrix, 2.92 TB
adjacency list
o Twitter: ≥ 5 billions edges,
3 billions tweets/mo.
Heterogeneous nodes’ costs and relevance (benefit)
Difficult to estimate the number of needed samples.
Thang N. Dinh
13
General Framework
RIS sampling
max𝑆 ∈Ω 𝑓(𝑆)
(𝛼 − 𝜖)-approx.
solution 𝑆𝒜
𝑓 𝑆𝒜 ≥ 𝛼 − 𝜖 𝑂𝑃𝑇𝑓
Sample generator 𝒯
[size 𝑇 = 𝜃(ϵ, δ)]
𝑓መ𝑇 𝑆 ∼ 𝑓 𝑆 𝑤. ℎ. 𝑝
Max-coverage
(1-1/e) approx.
Bounding techniques
max𝑆 ∈Ω 𝑓መ𝑇 (𝑆)
𝛼-approx.
algorithm 𝒜
𝑆𝒜 ∈ Ω
𝑓መ𝑇 𝑆𝒜 ≥ 𝛼 ∙ 𝑂𝑃𝑇𝑓መ𝑇
with prob. (1 − δ)
Difficult to get (𝛼 − 𝜖)OPT multiplicative error
How many samples? 𝜽(𝝐, 𝜹) = ???
How to achieve minimum number of samples???
Thang N. Dinh
14
RIS Sampling(Borg. Et al. 14’)
Generate hypergraph ℋ with hyperedges:
o Select a random 𝑢 ∈ 𝑉 and a random graph sample 𝑔
o Hyperedge ℰ = { nodes that can reach 𝑢 in 𝑔}
• Note: Instead of generating 𝑔, we can use reverse BFS
0.6
a
u=a
u=b
u=c
b
0.2
0.3
c
Example: Assuming
Independent Cascade model
ℰ1 = { 𝑎, 𝑏 }
ℰ2 = 𝑏, 𝑎, 𝑐
ℰ3 = 𝑐, 𝑎
ℋ = (𝑉, ℰ1 , ℰ2 , ℰ3 )
Thang N. Dinh
15
RIS Sampling (cont.)
0.6
a
0.2
0.3
Observation:
b
ℰ1 = { 𝑎, 𝑏 }
ℰ2 = 𝑏, 𝑎, 𝑐
ℰ3 = 𝑐, 𝑎
c
o Influential nodes appear more often in the hyperedges
o Influential seed set = one that covers most hyperedges
RIS framework (Borgs. et al., Tang et al. 2014)
1. Generate multiple hyperedges
2. Find seed set that covers most hyperedges
using greedy algorithm for Max-Coverage.
Thang N. Dinh
16
Number of Samples (Threshold)
Time complexity (expected) =
#Hyperedges [𝒎ℋ ] x (Time to generate a hyperedge) [EPT]
Decide the running-time
A - How many hyperedges are sufficient?
𝜃 ≥ 8+𝜖
Unknown in advance
𝑛
ln
+ln 2𝑛
𝟏
𝑘
𝑛
[(𝟏 − − 𝝐)-approx.
2
𝒆
𝑂𝑃𝑇𝑘 𝜖
with a probability 𝟏 − 𝒏−𝟏] (Tang et al. ‘14)
B- Can we generate just a little than 𝜃 hyperedges?
- TIM:Lowerbound OPT by KPT ≤ OPT
- TIM+: Lowerbound KPT+ by KPT+ ∈ [KPT, OPT]
Highly sophisticated estimation
No guarantees on the number of samples
Thang N. Dinh
17
BCT Algorithm
Thang N. Dinh
18
BCT Algorithm
Effective stopping conditions to generate “just
enough” samples.
Importance sampling to guarantee a almost linear
number of samples
Provable bounded errors and high confidence
Thang N. Dinh
19
Provable Guarantees
Thang N. Dinh
20
Experiments
Datasets
Thang N. Dinh
21
Results: Benefit comparison
BCT results in the the best benefit
with the same budget!
Thang N. Dinh
22
Results: Quality & Running time
Thang N. Dinh
23
Results: Running time on Twitter
Thang N. Dinh
24
Seeding Quality
Twitter: 40 million nodes, 1.5 billion edges, 106
millions tweets
Thang N. Dinh
25