Tải bản đầy đủ (.pdf) (124 trang)

automatic text classification using a multi-agent framework

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (947.67 KB, 124 trang )





AUTOMATIC TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK





Yueyu Fu





Submitted to the faculty of the University Graduate School
in partial fulfillment of the requirements
for the degree
Doctor of Philosophy
in the School of Library and Information Science,
Indiana University
October 2006






UMI Number: 3238501








Copyright 2006 by

Fu, Yueyu


All rights reserved.











____________________________________________________________

UMI Microform 3238501
Copyright 2007 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.


_______________________________________________________________

ProQuest Information and Learning Company
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106-1346


ii






Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the
requirements for the degree of Doctor of Philosophy.




Doctoral Committee



























Date of Oral Examination
(August 2
nd
, 2006)


Javed Mostafa, Ph.D.




Charles Davis, Ph.D.





Kiduk Yang, Ph.D.




David Leake (Computer Science, minor),
Ph.D.



iii

























© 2006
Yueyu Fu
ALL RIGHTS RESERVED













iv

DEDICATION










To my beloved parents Guanghui Fu and Lan Chen, my dear wife Wenjie Sun, and my
grandparents for their unconditional love and encouragement
v

ACKNOWLEDMENTS



I feel so grateful to numerous people who generously provide me the guidance, support, and
encouragement to complete this dissertation.

First and foremost, I would like to thank Dr. Javed Mostafa, my committee chair, for his
professional and personal guidance that goes far beyond his responsibilities. It is his patient
guidance, sharp mind, and gentle encouragement that led me to the achievement I have
today.

Special thanks also go to the rest of my committee, Dr. Charles Davis, Dr. Kiduk Yang, and
Dr. David Leake, for their insightful comments and enduring support during the entire
process of my dissertation research.

I would like to thank my colleagues and the staff at Indiana University, especially Weimao
Ke, Kazuhiro Seki, Mary Kennedy, Arlene Merkel, Erica Bodnar, and Rhonda Spencer, for
their kind help and support throughout all these memorable years in Bloomington.


Finally, I must express my deepest gratitude to my parents, Guanghui Fu and Lan Chen, for
opening my eyes to the world and encouraging me to pursue my career abroad, and to my
beloved wife, Wenjie Sun, for making our family full of joy, support, and understanding.


vi

ABSTRACT


Automatic text classification is an important operational problem in information systems.
Most automatic text classification efforts so far concentrated on developing centralized
solutions. However, centralized classification approaches often are limited due to
constraints on knowledge and computing resources. To overcome the limitations of
centralized approaches, an alternative distributed approach based on a multi-agent
framework is proposed. Three major challenges associated with distributed text
classification are examined: 1) Coordinating classification activities in a distributed
environment, 2) Achieving high quality classification, and 3) Minimizing communication
overhead. This study presents solutions to these specific challenges and describes a
prototype system implementation. As agent coordination is the key component in
conducting multi-agent text classification, two agent coordination protocols, namely
blackboard-bidding protocol and adaptive-blackboard protocol, are proposed in the study.
To analyze the performance of the distributed approach a comparative evaluation
methodology is described, which treats outcome of a centralized approach as baseline
performance. A series of experiments was conducted in a simulation environment. The
simulation environment permitted manipulation of independent variables such as
scalability and coordination strategy, and investigation of the impact on two critical
dependent variables, namely efficiency and effectiveness. There were three critical
findings. First, in dealing with automatic text classification the multi-agent approach can

achieve improved system efficiency while maintaining classification effectiveness
comparable to a centralized approach. Second, the agent protocols were effective in
coordinating the text classification activities of distributed agents. Third, the application
of content-based adaptive learning for acquiring knowledge about the agent community
reduced communication cost and improved system efficiency.




vii

TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 MANUAL CLASSIFICATION 1
1.2 AUTOMATIC CLASSIFICATION 2
1.3 MULTI-AGENT PARADIGM 5
2 PROBLEM STATEMENT 7
2.1 SPECIFIC CHALLENGES 8
2.2 VARIABLES 10
2.3 IMPLICATIONS OF THIS RESEARCH 16
3 LITERATURE REVIEW 18
3.1 AUTOMATIC TEXT CLASSIFICATION 18
3.1.1 Text classification task 19
3.1.2 Text classification methods 20
3.1.3 Evaluation metrics for text classification 24
3.1.4 Test Collections 26
3.1.5 Centralized Text Classification Procedure 27
3.2 TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK 29
3.2.1 Multi-agent paradigm 29
3.2.2 Differences between multi-agent systems and other concurrent systems 29

3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm 31
3.2.4 Recent applications of the multi-agent paradigm 34
3.2.5 Centralized vs. Multi-agent text classification 35
3.3 MULTI-AGENT COORDINATION PROTOCOLS 39
3.3.1 Definition of coordination 40
3.3.2 Coordination Protocols 41
3.3.2.1 Organizational Structuring 41
3.3.2.2 Multi-agent planning 43
3.3.2.3 Contract net protocol 44
3.3.2.4 Negotiation 45
4 METHODOLOGY 50
4.1 DATA 50
4.2 DESIGN METHODOLOGY 51
4.2.1 Multi-Agent Community for Text Classification 51
4.2.2 Classification Module 53
4.2.3 Algorithms of Agent Coordination Protocols 55
4.2.4 Proposed Agent Coordination Protocols 59
4.2.4.1 Blackboard-bidding Protocol 59
4.2.4.2 Adaptive-blackboard Protocol 61
4.3 IMPLEMENTATION 65
4.3.1 System Architecture 65
4.3.2 Alternative approach 67
4.4 EVALUATION METHODOLOGY 67
4.4.1 Measurements 67
viii

4.4.1.1 Effectiveness Measurements 67
4.4.1.2 Efficiency Measurements 68
4.4.2 Variables 70
4.4.2.1 Centralized vs. Distributed 70

4.4.2.2 Coordination Protocols 71
4.4.2.3 Number of Agents 71
4.4.3 Experimental Settings 72
5 RESULTS 72
5.1 CENTRALIZED VS. DISTRIBUTED 72
5.2 COORDINATION PROTOCOLS 75
5.2.1 Effectiveness 76
5.2.2 Efficiency Measured by Messages 79
5.2.3 Efficiency Measured by Time 82
5.3 NUMBER OF AGENTS 85
5.3.1 Impact of the number of agents on effectiveness 86
5.3.2 Impact of the number of agents on efficiency 89
6 CONCLUSIONS 91
6.1 SUMMARY 91
6.2 FUTURE RESEARCH 95
REFERENCES 97

1

1 Introduction
Automatic text classification is an important operational problem in information systems.
Many tasks, such as retrieval, filtering, and indexing, in information systems can be
considered as classification problems. Most text classification efforts so far concentrated
on developing centralized solutions, where data and computation are located on a single
computer. However, centralized classification approaches often are limited due to
constraints on knowledge and computing resources. In addition, centralized approaches are
more vulnerable to attacks or system failures and less robust in dealing with them. This
research presents an alternative classification approach, called distributed text
classification using a multi-agent framework, where data and computation are distributed
across a network of computers.

1.1 Manual Classification
In library and information science, class/classification and category/categorization are
sometimes considered as distinct terms (Jacob, 2004). Although they are both used to
organize related entities, these two terms have a fundamental difference. Classification
groups entities into mutually exclusive classes based on a set of predefined rules
regardless of the context, whereas categorization associates entities solely based on their
similarities within a given context (Jacob, 2004). This distinction makes categorization
more flexible than classification in organizing similar entities. However, for the purpose
of broader audience, this study uses class/classification and category/categorization
interchangeably.

2

One approach to text classification is manual classification, which involves human
experts manually classifying documents based on classification rules and subjective
judgment. This approach has been used in library practice for many years to organize,
index, and retrieve documents. Human experts typically assign each book a code
representing a category according to a set of classification schemes, such as the Dewey
Decimal Classification, the Universal Decimal Classification, and the Library of
Congress Classification. A recent application of this approach on the web is the Yahoo
Directory, which organizes web pages into a hierarchical structure.

The main challenge of manual classification is its demand on resources. Manual
classification is a time-consuming process that relies heavily on domain knowledge. It
requires significant investment of time from many human experts with knowledge of
different domains. Another associated problem is that subjective judgments can generate
inconsistent classification results. Because of these limitations, manual classification
works best for relatively small document collections.
1.2 Automatic Classification
To address the problems of manual classification, researchers have explored automatic

text classification as an alternative approach. Using machine learning techniques,
automatic text classification assigns documents to a set of pre-defined categories. This
approach has been applied in many areas, such as patent classification, news delivery,
and email spam filtering. In contrast to manual classification, automatic classification
offers the advantages of automation, efficiency, and consistency.

3

In automatic classification, documents are typically classified by a single classification
software system running on a single machine. This is also called centralized text
classification. Significant efforts have focused on developing document classifiers in a
centralized manner and various classification algorithms have been developed to improve
the performance of centralized classification systems. The advantages of centralized
classification stem from the centralized architecture. Because data and computing
resources are located in the same place, the management of the classification task is easy
and the classification speed is fast. Since the communication in centralized classification
takes place in the same machine, the communication cost is relatively small.

However, as information becomes more distributed and its volume increases
exponentially, several critical disadvantages of centralized classification are revealed.
The effectiveness of a classification system is mostly determined by the artificial
knowledge
1
maintained by the system, which typically comes from training data.
Currently, centralized classification systems suffer from the problem of scarcity of local
knowledge
2
. The extent of local knowledge is limited by the cost and constraints of
storing complete knowledge in a single place, and it is sometimes impossible to collect
all the necessary knowledge and store it in a central location. Since the classification

system can successfully classify only documents that are within the scope of a limited
amount of local knowledge, it is likely to fail when the expansion of its domain (i.e., local
knowledge) does not keep up with growing diversity in knowledge. Another disadvantage
is that due to its centralized architecture centralized classification has only a certain

1
Knowledge learned from training documents using machine learning techniques.
2
Knowledge maintained by a single classification system.
4

amount of computing power and input/output capacity. When an information system has
to handle a large number of documents, the classification component may become a
performance bottleneck and suffer from the problem of single point of failure.

Distributed text classification, which is an alternative approach to automatic centralized
classification, employs a de-centralized architecture for organizing knowledge and
computing resources. This approach allows multiple classification software systems to
collaborate with each other to fulfill the classification task in a distributed computing
environment. Distributed classification has several advantages over centralized
classification. The distributed architecture offers computational scalability for
classification. Mukhopadhyay et al. (2005) demonstrate that classification time decreases
dramatically with the increasing number of collaborating classification software systems.
Also, not completely relying on a single classification software system allows the
classification system to avoid the problem of single point of failure. When one of the
classification software systems fails, its tasks can be carried out by alternative
classification software system. Lastly, distributed classification fits the web model better.
The Internet is a distributed system and it offers the opportunity to take advantage of
distributed computing paradigms and distributed knowledge resources for classification.


However, distributed classification has some disadvantages. Unlike centralized
classification, distributed classification, which consists of multiple independent
classification software systems, does not have global control of all the classification
activities. Such global control is essential for achieving coherent system performance.
5

Without global control, distributed classification activities can produce conflicting and
inconsistent results. Therefore, an alternative mechanism for coordinating distributed
classification activities is needed. Another limitation of distributed classification is the
large communication overhead. In order for classification software systems to collaborate,
they must communicate with each other and the amount of exchanged information can be
very large. For example, Mukhopadhyay et al. (2003) show that the average response
time for classification increases almost linearly with the number of classification software
systems, which is the direct result of increasing communication overhead. They also
show that the classification performance quickly saturates with increasing number of
classification software systems. This later result points to the potential of improving
overall performance by reducing communication overhead.
1.3 Multi-Agent Paradigm
This research employs a multi-agent paradigm for conducting distributed text
classification. The multi-agent paradigm evolved from distributed artificial intelligence in
the late 80’s, where agents are considered as autonomous, intelligent computer software.
An agent exhibits three major characteristics, namely reactivity, proactiveness, and social
ability (Wooldridge & Jennings, 1995). Reactivity refers to the capability of sensing the
changes in its environment and taking fast corresponding actions. Proactiveness refers to
the capability of operating in an active fashion according to its design goal. Social ability
refers to the capability of working with other agents. A group of such agents form a
multi-agent system. Durfee and Montgomery (1989) define a multi-agent system (MAS)
as “a loosely coupled network of problem solvers that work together to solve problems
that are beyond their individual capabilities.”
6



For text classification, a multi-agent paradigm offers several critical advantages.
According to Sycara (1998), the multi-agent paradigm distributes computing resources
and capabilities across a network of agents, which can avoid the single point of failure
problem. The modular, scalable architecture of the multi-agent paradigm provides
computational scalability and flexibility for agents entering and leaving agent
communities. The multi-agent paradigm can also make efficient use of spatially
distributed information resources and serve as a solution when expertise is distributed.
Because of these advantages, the multi-agent paradigm has been utilized in designing of
information retrieval systems and information management systems. However, the
applicability of the multi-agent paradigm in text classification has not been thoroughly
examined yet.

Agent coordination is a critical component of the multi-agent paradigm. It determines the
relationship among the agents in a multi-agent environment and governs the behaviors of
the interacting agents. The overall system performance, including both quality and
efficiency, depends on the appropriate design of the coordination mechanism. Quality
measures the correctness of the system behavior, which is the collective result of the
coordinated agents’ behaviors. Efficiency measures the timeliness of the system process,
which counts mainly the communication among the coordinated agents. Due to its
importance in system performance optimization, agent coordination has been well studied
in various domains, such as transportation, economics, and management. As the multi-
7

agent paradigm is applied in text classification, agent coordination will be the focus of
this research.

Evaluation of system performance is an essential aspect of the multi-agent
implementation plan. As the overall performance of a multi-agent system is a collective

result of multiple agents’ behaviors, the result alone may be not directly understandable.
The evaluation framework typically reflects the system performance at different levels,
including the agent level and the overall system level. Also, the evaluation metric covers
different aspects of the system performance, including effectiveness and efficiency.
Integration of the evaluation metric of text classification and the multi-agent paradigm
may provide us a powerful tool to validate the approach of automatic text classification
using a multi-agent framework.
2 Problem Statement
The primary purpose of this study is to investigate automatic text classification using a
multi-agent framework. Automatic text classification and the multi-agent paradigm
respectively have been extensively studied over the years. Although, problems within
each area have been investigated, new problems that arise with the introduction of the
multi-agent paradigm into automatic text classification remain mostly unexplored. In this
section, three major challenges associated with distributed text classification will be
examined and key variables related to these challenges will be discussed.
8

2.1 Specific Challenges
Distributed text classification is different from centralized text classification because of
its distributed architecture. One of the main challenges in distributed text classification is
coordinating classification activities in a distributed environment. Unlike centralized
classification relying on a mediator to ensure the coherence of the overall system
performance, distributed classification lacks centralized control, and thus may produce
conflicting and inconsistent classification results. Consequently, an effective mechanism
of coordinating distributed classification activities is greatly needed. In the multi-agent
paradigm, such mechanisms (e.g., agent coordination protocols) have been extensively
studied. The agent coordination research has drawn on various domains including
artificial intelligence, social science, game theory, and economics. Many agent
coordination protocols, such as blackboard and contracting, have been explored in those
domains. Although these agent coordination protocols have been successfully applied in

many domains, they have not been seriously studied in information science, particularly
for text classification. The question is whether these agent coordination protocols will
work well for the classification task. Different coordination protocols will be explored for
designing suitable coordination mechanisms for text classification.

Another challenge in distributed classification is achieving high quality of classification
in multi-agent environments. In automatic centralized classification, quality of
classification is mainly influenced by the quality of the test collection and the
classification algorithm. Most efforts so far have concentrated on developing new
methods to improve the classification performance. Several classification methods, such
9

as Support Vector Machines and k-Nearest Neighbor, have been applied in centralized
environments. In multi-agent environments, the classification task is distributed across a
network of classification agents. The classification process involves the actual
classification conducted by individual agents, the interactions among agents, and the
merge of individual classification results. Whether those well-established classification
methods are applicable in multi-agent environments has to be examined. To validate the
performance of these classification methods in multi-agent environments and identify the
suitable classification methods for distributed classification, a thorough evaluation of
quality of distributed classification needs to be conducted. The evaluation may cover the
comparison among different classification methods and the comparison between
distributed classification and centralized classification. The result from such an
evaluation may tell us about whether certain distributed classification approaches can
achieve satisfactory quality of classification.

Minimizing communication overhead in distributed classification without compromising
quality of classification is yet another challenge. Communication is a key issue in
distributed classification, where agents exchange information, interact with each other,
and work together through the means of communication to achieve satisfactory quality of

classification. In such an environment, the amount of communication greatly affects
system efficiency. Consequently, an appropriate agent coordination protocol that governs
the agent’s communication behavior in an effective and efficient manner may ensure high
quality of classification and reduce communication overhead. The key objective of the
agent coordination protocol is to balance between quality of classification and system
10

efficiency. To achieve such a balance, an evaluation procedure has to be established to
measure quality of classification and system efficiency in multi-agent environments.
However, there is no standard evaluation framework to fulfill this goal. An evaluation
framework for measuring system efficiency needs to be established. To summarize, the
three main challenges are: 1) Coordinating classification activities in a distributed
environment, 2) Achieving high quality of classification in multi-agent environments, and
3) Minimizing communication overhead in distributed classification without
compromising quality of classification.
2.2 Variables
This research will be conducted using an experimental study design. The study will
explore the applicability of different classification methods in multi-agent environments.
The result of this exploration will help researchers to choose appropriate classification
methods in distributed computing environments and design new methods for distributed
classification. The primary focus of this study will be to investigate the coordination of
distributed classification activities in multi-agent environments. A comparative study of
different coordination protocols for multi-agent classification will help in identifying the
best coordination protocol, which can achieve satisfactory classification performance
with acceptable communication overhead. A comparative study between centralized
classification and distributed classification will also be conducted to evaluate the
performance of the distributed classification approach. The evaluation framework will
draw on the centralized classification research and new approaches will be developed that
are uniquely suitable for distributed classification environments. To carry out this study
11


and address the three challenges discussed above, three variables will be studied: quality
of classification, system efficiency, and agent granularity.

One of the main goals is to achieve satisfactory classification performance in a multi-
agent environment. Therefore, quality of classification must be taken into consideration
throughout the study. The quality of classification refers to the accuracy of a completed
classification task. In contrast to centralized text classification, quality of classification in
a multi-agent context is determined by not only the performances of individual classifiers,
but also the agent coordination protocol. Researchers in information retrieval and
machine learning communities have tested various effectiveness measurements for
classification tasks. Lewis (1995) demonstrated using different families of single
effectiveness measures to estimate and optimize the performances of classification
systems. Joachims (2001) summarized the most commonly used effectiveness measures
for evaluation text classification systems. In this study, precision, recall, and F measure
have been chosen to measure quality of classification.

Figure 1.1. Precision
12


Figure 1.2. Recall
A study by Mukhopadhyay et al. (2005) demonstrates how these evaluation measures can
be applied in a multi-agent environment. The study, which shows that as the number of
agent increases, precision drops (see Figure 1.1) while recall increases (see Figure 1.2),
proposes that quality of classification must be evaluated at both the system level and the
individual agent level. In additional to the overall performance evaluation at the system
level, the classification performance of individual agent will help in understanding each
agent’s behavior and relationship with other agents. In the study, the quality of
classification at the system level is calculated by averaging the corresponding

measurement scores across all agents. For each category, classification decisions can be
represented as a contingency table as following:

Expert decision: Yes
Expert decision: No
System decision: Yes
TP
FP
System decision: No
FN
TN
Table 1: Contingency table
13

Based on the contingency table, recall and precision are defined as following:
FNTP
TP
call

Re
,
FPTP
TP
ecision

Pr

Typically, both micro-averaging and macro-averaging methods are applied to calculate
the average scores. In the micro-averaging method, precision and recall are computed
based on a “global” contingency table, which is the sum of individual contingency tables.

In the macro-averaging method, precision and recall is computed by averaging the
precision and recall scores of all categories (Sebanstiani, 2002). The micro-averaging and
macro-averaging scores reflect the classification performance on different categories.
Yang and Liu (1999) note that micro-averaging gives equal weights to each item (e.g.,
document) and can be dominated by large (common) categories, whereas macro-
averaging gives equal weights to each category, so small (rare) categories can unduly
influence the score.

In this study, the main goal is not only to achieve high quality of classification, but also
to attain acceptable system efficiency. The efficiency of a multi-agent system largely
depends on its communication overhead. Agent interaction and coordination, and the
agent environment influence communication overhead in multi-agent systems. Different
agent coordination protocols produce different amount of communication overhead. This
variable, efficiency, will help in identifying appropriate coordination protocols which can
achieve acceptable communication overhead. Efficiency here refers to the time spent
during the completion of a classification task. A study by Mukhopadhyay et al. (2003)
shows that as the number of agent increases, the communication overhead increases
almost linearly (see Figure 2.1) while the classification performance quickly saturates
14

(see Figure 2.2). This result shows the possibility of achieving a satisfactory classification
performance with reduced communication overhead by interacting with fewer agents.
Efficiency can be measured at both the system level and the individual agent level.
Efficiency at the system level represents the time spent to classify all the documents in
the multi-agent environment. Efficiency at the agent level represents the time that an
individual agent spends to classify its own documents.


Figure 2.1. Average response time


15


Figure 2.2. Number of successful classification

Agent granularity refers to the amount of knowledge possessed by an agent. Agent
granularity has impact on not only their classification capabilities, but also efficiency at
the system level. Each agent possesses a certain amount of knowledge, which is a
proportion of the complete global knowledge. In an extreme case, each agent has only the
knowledge of one class. When the total number of classes is fixed, the number of agents
decreases as the number of classes possessed by each agent increases. Theoretically, the
classification capability of each agent gets enhanced as the number of classes increase
because the probability of a document being classified by such an agent increases. Also,
with increased knowledge, the communication overhead decreases because there are
fewer agents and less coordination needs (see Figure 3.1 & Figure 3.2).

16


Figure 3.1 Precision


Figure 3.2. Average response time
2.3 Implications of this Research
This research has been developed to investigate automatic text classification using a
multi-agent framework. Automatic text classification has not been seriously tested in
0.1
0.15
0.2
0.25

0.3
0.35
0.4
0.45
0.5
0 2 4 6 8 10 12
Number of agents
Precision
450
500
550
600
650
700
750
800
850
900
950
0 5 10 15 20 25 30
Number of agents
Average response time (milliseconds)

×