Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 39 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (86.57 KB, 10 trang )

360 Steve Donoho
is similar to paper citations in academia. A paper that is cited often is considered to
contain important ideas. A paper that is seldom or never cited is considered to be less
important. The following paragraphs present two algorithms for incorporating link
information into search engines: PageRank (Page et al., 1998) and Kleinberg’s Hubs
and Authorities (Kleinberg, 1999).
The PageRank algorithm takes a set of interconnected pages and calculates a
score for each. Intuitively, the score for a page is based on how many other pages
point to that page and what their scores are. A page that is pointed to by a few other
important pages is probably itself important. Similarly, a pages that is pointed to by
numerous other marginally important pages is probably itself important. But a page
that is not pointed to by anything probably isn’t important.
A more formal definition taken from (Page et al., 1998) is: Let u be a web page.
Then let
F
u
be the set of pages
u
points to and
B
u
be the set of pages that point to
u. Let N
u
= |F
u
| be the number of links from u. Then let E(u) be an a priori score
assigned to u. Then R(u), the score for u, is calculated:
R(u)=

v∈B


u
R(v)
N
v
+ E(u)
So the score for a page is some constant plus the sum of the scores of its incoming
links. Each incoming link has the score of the page it is from divided by the number
of outgoing links from that page (so a page’s score is divided evenly among its out-
going links). The constant E(u) serves a couple functions. First, it counterbalances
the effect of ”sinks” in the network. These are pages or groups of pages that are dead
ends – they are pointed to, but they don’t point out to any other pages. E(u) pro-
vides a ”source” of score that counterbalances the ”sinks” in the network. Secondly,
it provides a method of introducing a priori scores if certain pages are known to be
authoritative.
The PageRank algorithm can be combined with other techniques to create a
search engine. For example, PageRank is first used to assign a score to all pages
in a population. Next a simple keyword search is used to find a list of relevant can-
didate pages. The candidate pages are ordered according to their PageRank score.
The top ranked pages presented to the user are both relevant (based on the keyword
match) and authoritative (based on the PageRank score). The Google search engine
is based on PageRank combined with other factors including standard IR techniques
and the text of incoming links.
Kleinberg’s algorithm (Kleinberg, 1999) differs from the PageRank approach in
two important respects:
1. Whereas the PageRank approach assigns a score to each page before applying
a text search, Kleinberg assigns a score to a page within the context of a text
search. For example, a page containing both the words ”Microsoft” and ”Oracle”
may receive a high score if the search string is ”Oracle” but a low score if the
search string is ”Microsoft.” PageRank would assign a single score regardless of
the text search.

18 Link Analysis 361
2. Kleinberg’s algorithm draws a distinction between ”hubs” and ”authorities.”
Hubs are pages that point out to many other pages on a topic (preferably many
are authorities). Authorities are pages that are pointed to by many other pages
(preferably many are hubs). Thus the two have a symbiotic relationship.
The first step in a search is to create a ”focused subgraph.” This is a small subset
of the Internet that is rich in pages relevant to the search and also contains many
of the strongest authorities on the searched topic. This is done by doing a pure text
search and retrieving the top t pages (t about 200). This set is augmented by adding
all the pages pointed to by pages in t and all pages that point to pages in t (for pages
with a large in-degree, only a subset of the in-pointing pages are added). Note that
adding these pages may add pages that do not contain the original search string!
This is actually a good thing because often an authority on a topic may not contain
the search string. For example, toyota.com may not contain the string ”automobile
manufacturer” (Rokach and Maimon, 2006) or a page may discuss several machine
learning algorithms but not have the phrase ”machine learning” because the authors
always use the phrase ”Data Mining.” Adding the linked pages pulls in related pages
whether they contain the search text or not.
The second step calculates two scores for each page: an authority score and a hub
score. Intuitively, a page’s authority score is the normalized sum of the hub scores of
all pages that point to it. A page’s hub score is the normalized sum of the authority
scores of all the pages it points to. By iteratively recalculating each pages’ hub and
authority score, the scores converge to an equilibrium. The reinforcing relationship
between hubs and authorities helps the algorithm differentiate true authorities on a
topic from generally popular web sites such as amazon.com and yahoo.com.
In summary, both algorithms measure a web page’s importance by its relation-
ships with other web pages – an extension of the notion of importance in a social
network being determined by position in the network.
18.4 Viral Marketing
Viral marketing relies heavily on ”word-of-mouth” advertising where one individ-

ual who has bought a product tells their friends about the product (Domingos,
1996, Richardson, 2002, Kempe et al., 2003). A famous example of viral market-
ing is the rapid spread of Hotmail as a free email service. Attached to each email
was a short advertisement and Hotmail’s URL, and customers spread the word about
Hotmail simply by emailing their family and friends. Hotmail grew from zero to 12
million users in 18 months (Richardson, 2002). Word-of-mouth is a powerful form
of advertising because if a friend tells me about a product, I don’t believe that they
have a hidden motive to sell the product. Instead I believe that they really believe in
the inherent quality of the product enough to tell me about it.
Products are not the only things that spread by word of mouth. Fashion trends
spread from person to person. Political ideas are transferred from one person to the
next. Even technological innovations are transferred among a network of coworkers
362 Steve Donoho
and peers. These problems can all be viewed as the diffusion of information, ideas,
or influence among the members of a social network (Kempe et al., 2003).
These social networks take on a variety of forms. The most easily understood
is the old fashion ”social network” where people are friends, neighbors, coworkers,
etc. and their connections are personal and often face-to-face. For example, when a
new medical procedure is invented, some physicians will be early adopters, but oth-
ers will wait until close friends have tried the procedure and been successful. The
Internet has also created social networks with virtual connections. In a collaborative
filtering system, a recommendation for a book, movie, or musical CD may be made
to Customer A based on N ”similar” customers – customers who have bought similar
items in the past. Customer A is influenced by these N other customer even though
Customer A never meets them, communicates with them, or even knows their identi-
ties. Knowledge-sharing sites provide a second type of virtual connection with more
explicit interaction. On these sites people provide reviews and ratings on things rang-
ing from books to cars to restaurants. As an individual follows the advice offered by
various ”experts” they grow to trust the advice of some and not trust the advice of
others.

Formally, this can be modeled as a social network where the nodes are people
and node X
i
is linked to node X
j
if the person represented by X
i
in some way in-
fluences X
j
. From a marketing standpoint some natural questions emerge. ”Which
nodes should I market to to maximize my profit?” Or alternatively, ”If I only have
the budget to market to k of the n nodes in the network, which k should I choose
to maximize the spread of influence?” Based on work from the field of social net-
work analysis, two plausible approaches would be to pick nodes with the highest
out-degree (nodes that influence a lot of other nodes) or nodes with good distance
centrality (nodes that have a short average distance to the rest of the network). Two
recent approaches to these questions are described in the following paragraphs.
The first approach proposed by (Domingos, 1996) and (Richardson, 2002) model
the social network as a Markov random field. The probability that each node will pur-
chase the product or adopt the idea is modeled as P(X
i
| N
i
, Y, M) where N
i
are the
neighbors of X
i
, the ones who directly influence X

i
. Y is a vector of attributes de-
scribing the product. This reflects the fact that X
i
is influenced not only by neighbors
but also by the attributes of the product itself. A bald man probably won’t buy a hair-
brush even if all the people he trusts most do. M
i
is the marketing action taken for X
i
.
This reflects the fact that a customer’s decision to buy is influenced by whether he is
marketed to such as if he receives a discount. This probability can be combined with
other information (how much it costs to market to a customer, what the revenue is
from a customer that was marketed to, and what the revenue is from a customer who
was not marketed) to calculate the expected profit from a particular marketing plan.
Various search techniques such as greedy search and hill-climbing can be employed
to find local maxima for the profit.
A second approach proposed by (Kempe et al., 2003) uses a more operational
model of how ideas spread within a network. A set of nodes are initialized to be
active (indicating they bought the product or adopted the idea) at time t=0.Ifan
inactive node X
i
has active neighbors, those neighbors exert some influence on X
i
18 Link Analysis 363
to become active. As more of X
i
’s neighbors become active, this may cause X
i

to
become active. Thus the process unfolds in a set of discrete steps where a set of
nodes change their values at time t based on the set of active nodes at time t −1. Two
models for how nodes become activated are:
1. Linear Threshold Model. Each link coming into X
i
from its neighbors has a
weight. When the sum of the weights of the links from X
i
’s active neighbors
surpasses a threshold
θ
i
, then X
i
becomes active at time t + 1. The process runs
until the network reaches an equilibrium state.
2. Independent Cascade Model. When the neighbor of an inactive node X
i
first
becomes active, it has one chance to activate X
i
. It succeeds at activating X
i
with
a probability of p
i, j
, and X
i
becomes active at time t + 1. The process runs until

no new nodes become active.
Kempe presents a greedy hill-climbing algorithm and proves that its performance
is within a factor of 63% of optimal. Empirical experiments show that the greedy
algorithm performs better than picking nodes with the highest out-degree or the best
distance centrality.
Areas of further research in viral marketing include dealing with the fact that
network knowledge is often incomplete. Network knowledge can be acquired, but
this involves a cost that must be factored into the overall marketing cost.
18.5 Law Enforcement & Fraud Detection
Link analysis was used in law enforcement long before the advent of computers.
Police and detectives would manually create charts showing how people and pieces
of evidence in a crime were connected. Computers greatly advanced these techniques
in two key ways:
1. Visualization of crime/fraud networks. Charts that were previously manually
drawn and static became automatically drawn and dynamic. A variety of link
analysis visualization tools allowed users to perform such operations as:
a) Automatically arranging networks to maximize clarity (e.g. minimize link
crossings),
b) Rearranging a network by dragging and dropping nodes,
c) Filtering out links by weight or type,
d) Grouping nodes by types.
2. Proliferation of databases containing information to link people, events, accounts,
etc. Two people or accounts could be linked because:
a) One sent a wire transfer to the other,
b) They were in the same auto accident and were mentioned in the insurance
claim together,
c) They both owned the same house at different times,
d) They share a phone number
364 Steve Donoho
All these pieces of information were gathered for non-law enforcement reasons and

stored in databases, but they could be used to detect fraud and crime rings.
A pioneering work in automated link analysis for law enforcement is FAIS (Sen-
ator et al., 1995), a system for investigating money laundering developed at FINCEN
(Financial Crimes Enforcement Network). The data supporting FAIS was a database
of Currency Transaction Reports (CTRs) and other forms filed by banks, brokerages,
casinos, businesses, etc. when a customer conducts a cash transaction over $10,000.
Entities were linked to each other because they appeared on the same CTR, shared
an address, etc. The FAIS system provided leads to investigators on which people,
businesses, accounts, or locations they should investigate. Starting with a suspicious
entity, an investigator could branch out to everyone linked to that entity, then ev-
eryone linked to those entities, and so on. This information was then displayed in a
link analysis visualization tool where it could be more easily manipulated and un-
derstood.
Insurance fraud is a crime that is usually carried out by rings of professionals.
A ringleader orchestrates staged auto accidents and partners with fraudulent doc-
tors, lawyers, and repair shops to file falsified claims. Over time, this manifests itself
in the claim data as many claims involving an interlinked group of drivers, passen-
gers, doctors, lawyers, and body shops. Each claim in isolation looks legitimate, but
taken together they are extremely suspicious. The ”NetMap for Claims” solution
from NetMap Analytics like FAIS allows users to start with a person, find everyone
directly linked to them (on a claim together), then find everyone two links away, then
three links away, etc. These people and their interconnections are then displayed in a
versatile visualization tool.
More recent work has focused on generating leads not based on any one node but
instead on the relationships among nodes. This is because many nodes in isolation
are perfectly legitimate or are only slightly suspicious but innocuous enough to stay
”under the radar.” But when several of these ”slightly suspicious” nodes are linked
together, it becomes very suspicious (Donoho and Johnstone, 1995). For example,
if a bank account has cash deposits of between $2000 and $5000 in a month’s time,
this is only slightly suspicious. At a large bank there will be numerous accounts that

meet this criteria – too many to investigate. But if it is found that 10 such accounts
are linked by shared personal information or by transactions, this is suddenly very
suspicious because it is highly unlikely this would happen by chance and this is
exactly what money launderers do to hide their behavior.
Scenario-based approaches have instances of crimes represented as networks,
and new situations are suspicious if they are sufficiently similar to a known crime
instance. Fu et al. (2003) describes a system to detect contract murders by the Rus-
sian mafia. The system contains a library of known contract murders described as
people linked by phone calls, meetings, wire transfers, etc. A new set of events may
match a known instance if it has a similar network topology even if a phone call
in one matches a meeting in the other (both are communication events) or if a wire
transfer in one matches a cash payment in the other (both are payment events). The
LAW system (Ruspini et al., 2003) is a similar system to detect terrorist activities
before a terrorist act occurs. LAW measures the similarity between two networks
18 Link Analysis 365
using edit distance – the number of edits needed to convert network #1 to be exactly
link network #2.
In summary, crime rings and fraud rings are their own type of social network, and
analyzing the relationships among entities is a powerful method of detecting those
rings.
18.6 Combining with Traditional Methods
Many recent works focus on combining link analysis techniques with traditional
knowledge discovery techniques such as inductive inference and clustering.
Jensen (1999) points out that certain challenges arise when traditional induction
techniques are applied to linked data:
1. The linkages in the data may cause instances to no longer be statistically inde-
pendent. If multiple instances are all linked to the same entity and draw some of
their characteristics from that entity, then those instances are no long indepen-
dent.
2. Sampling becomes very tricky. Because instances are interlinked with each other,

sampling breaks many of these linkages. Relational attributes of an instance such
as degree, closeness, betweenness can be drastically changed by sampling.
3. Attribute combinatorics are greatly increased in linked data. In addition to an in-
stance’s k intrinsic attributes, an algorithm can draw upon attributes from neigh-
bors, neighbors’ neighbors, etc. Yet more attributes arise from the combinations
of neighbors and the topologies with which they are linked.
These and other challenges are discussed more extensively in (Jensen, 1999).
Neville and Jensen (2000) present an iterative method of classification using re-
lational data. Some of an instance’s attributes are static. These include intrinsic at-
tributes (which contain information about an instance by itself regardless of linkages)
and static relational attributes (which contain information about an instance’s linked
neighbors but are not dependent on the neighbors’ classification). More interesting
are the dynamic relational attributes. These attributes may change value as an in-
stance’s neighbors change classification. So the output of one instance (its class) is
the input of a neighboring instance (in its dynamic relational attributes), and vice
versa. The algorithm iteratively recalculates instances’ classes. There are m itera-
tions, and at iteration i it accepts class labels on the N ∗(i/m) instances with the
highest certainty. Classification proceeds from the instances with highest certainty to
those with lowest certainty until all N instances are classified. In this way instances
with the highest certainty have more opportunity to affect their neighbors.
Using linkage data in classification has also been applied to the classification of
text (Chakrabarti et al., 1998,Slattery, 2000,Oh et al., 2000,Lu, 2003). Simply incor-
porating words from neighboring texts was not found to be helpful, but incorporating
more targeted information such as hierarchical category information, predicted class,
and anchor text was found to improve accuracy.
366 Steve Donoho
In order to cluster linked data, Neville et al. (2003) combines traditional clus-
tering with graph partitioning techniques. Their work uses similarity metrics taken
from traditional clustering to assign weights to linked graphs. Once this is done,
several standard graph partitioning techniques can be used to partition the graph into

clusters. Taskar et al. (2001) cluster linked data using probabilistic relational models.
While most work in link analysis assumes that the graph is complete and fairly
correct, this is often far from the truth. Work in the area of link completion (Kubica
et al., 2002,Goldenberg et al., 2003,Kubica et al., 2003) induces missing links from
previously observed data. This allows users to ask questions about a future state of
the graph such as ”Who is person XYZ likely to publish a paper with in the next
year?”
There are many possible ways link analysis can be combined with traditional
techniques, and many remain unexplored making this a fruitful area for future re-
search.
18.7 Summary
Link analysis is a collection of techniques that operate on data that can be repre-
sented as nodes and links. A variety of applications rely on link analysis techniques
or can be improved by link analysis techniques. Among these are Internet search,
viral marketing, fraud detection, crime prevention, and sociologic study. To support
these applications, a number of new link analysis techniques have emerged in recent
years. This chapter has surveyed several of these including subgraph matching, find-
ing cliques and K-plexes, maximizing spread of influence, visualization, and finding
hubs and authorities. A fruitful area for future research is the combination of link
analysis techniques with traditional Data Mining techniques.
References
Chakrabarti S, Dom B, Agrawal R, & Raghavan P. Scalable feature selection, classification
and signature generation for organizing large text databases into hierarchical topic tax-
onomies. VLDB Journal: Very Large Data Bases 1998. 7:163 – 178.
Domingos P & Richardson M. Mining the network value of customers. Proceedings of the
Seventh International Conference on Knowledge Discovery and Data Mining; 2001 Au-
gust 26 – 29; San Francisco, CA. ACM Press, 2001.
Donoho S. & Lewis S. Understand behavior detection technology: Emerging approaches to
dealing with three major consumer protection threats. April 2003.
Fu D, Remolina E, & Eilbert J. A CBR approach to asymmetric plan detection. Proceed-

ings of Workshop on Link Analysis for Detecting Complex Behavior; 2003 August 27;
Washington, DC.
Goldenberg A, Kubica J, & Komerak P. A comparison of statistical and machine learning
algorithms on the task of link completion. Proceedings of Workshop on Link Analysis
for Detecting Complex Behavior; 2003 August 27. Washington, DC.
Hanneman R. Introduction to social network methods. Univ of California, Riverside, 2001.
18 Link Analysis 367
Jensen D. Statistical challenges to inductive inference in linked data. Preliminary papers of
the 7th International Workshop on Artificial Intelligence and Statistics; 1999 Jan4–6;
Fort Lauderdale. FL.
Kempe D, Kleinberg J, & Tardos E. Maximizing the spread of influence through a social
network. Proceedings of The Ninth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining; 2003 August 24 – 27; Washington, DC. ACM Press,
2003.
Kleinberg J. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999,
46,5:604-632.
Kubica J, Moore A, Schneider J, & Yang Y. Stochastic link and group detection. Proceedings
of The Eighth National Conference on Artificial Intelligence; 2002 July 28 – Aug 1;
Edmonton, Alberta, Canada. ACM Press, 2002.
Kubica J, Moore A, Cohn D, & Schneider J. cGraph: A fast graph-based method for link anal-
ysis and queries. Proceedings of Text-Mining & Link-Analysis Workshop; 2003 August
9; Acapulco, Mexico.
Lu Q & Getoor L. Link-based text classification. Proceedings of Text-Mining & Link-
Analysis Workshop; 2003 August 9; Acapulco, Mexico.
Neville J & Jensen D. Iterative classification in relational data. Proceeding of AAAI-2000
Workshop on Learning Statistical Models from Relational Data; 2000 August 3; Austin,
TX. AAAI Press, 2000.
Neville J, Adler M, & Jensen D. Clustering relational data using attribute and link informa-
tion. Proceedings of Text-Mining & Link-Analysis Workshop; 2003 August 9; Acapulco,
Mexico.

Oh H, Myaeng S, & Lee M. A practical hypertext categorization method using links and
incrementally available class information. Proceedings of the 23rd ACM International
Conference on Research and Development in Information Retrieval (SIGIR-00); 2000
July; Athens, Greece.
Page L, Brin S, Motwani R, & Winograd T. The PageRank Citation Ranking: Bringing Order
to the Web. Stanford Digital Library Technologies Project. 1998.
Richardson M & Domingos P. Mining knowledge-sharing sites for viral marketing. Pro-
ceedings of Eighth International Conference on Knowledge Discovery and Data Mining;
2002 July 28 – Aug 1; Edmonton, Alberta, Canada. ACM Press, 2002.
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-
tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag
(2004).
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-
tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag
(2004).
Rokach L. and Maimon O., Data mining for improving the quality of manufacturing: A
feature set decomposition approach. Journal of Intelligent Manufacturing 17(3): 285299,
2006.
Ruspini E, Thomere J, & Wolverton M. Database-editing metrics for pattern matching. SRI
Intern, March 2003.
Senator T, Goldberg H, Wooton J, Cottini A, Umar A, Klinger C, et al. The FinCEN Artificial
Intelligence System: Identifying Potential Money Laundering from Reports of Large
Cash Transactions. Proceedings of the 7th Conference on Innovative Applications of AI;
1995 August 21 – 23; Montreal, Quebec, Canada. AAAI Press, 1995.
Slattery S & Craven M. Combining statistical and relational methods for learning in hyper-
text domains. Proceedings of ILP-98, 8th International Conference on Inductive Logic
368 Steve Donoho
Programming; 1998 July 22 – 24; Madison, WI. Springer Verlag, 1998.
Taskar B, Segal E, & Koller D. Probabilistic clustering in relational data. Proceedings of
Seventh International Joint Conference on Artificial Intelligence; 2001 August4–10;

Seattle, Washington.
Wasserman S & Faust K, Social Network Analysis. Cambridge University Press, 1994.
Part IV
Soft Computing Methods

×