Adaptive Dual Control of
Topic-Based Information Retrieval
A dissertation submitted in partial fulfilment of the requirement
for the degree of Doctor of Philosophy
by
Vitaliy Vitsentiy
M.Sc. Ternopil Academy of National Economy, Ukraine
School of Information Technology
Faculty of Science and Technology
Queensland University of Technology
Brisbane, Australia
2009
ii
Keywords
topic-based information retrieval, dual control, stochastic programming
iii
iv
Abstract
Information Retrieval is an important albeit imperfect component of
information technologies. A problem of insufficient diversity of retrieved documents
is one of the primary issues studied in this research. This study shows that this
problem leads to a decrease of precision and recall, traditional measures of
information retrieval effectiveness.
This thesis presents an adaptive IR system based on the theory of adaptive
dual control. The aim of the approach is the optimization of retrieval precision after
all feedback has been issued. This is done by increasing the diversity of retrieved
documents. This study shows that the value of recall reflects this diversity.
The Probability Ranking Principle is viewed in the literature as the “bedrock”
of current probabilistic Information Retrieval theory. Neither the proposed approach
nor other methods of diversification of retrieved documents from the literature
conform to this principle. This study shows by counterexample that the Probability
Ranking Principle does not in general lead to optimal precision in a search session
with feedback (for which it may not have been designed but is actively used).
Retrieval precision of the search session should be optimized with a
multistage stochastic programming model to accomplish the aim. However, such
models are computationally intractable. Therefore, approximate linear multistage
stochastic programming models are derived in this study, where the multistage
improvement of the probability distribution is modelled using the proposed feedback
correctness method. The proposed optimization models are based on several
assumptions, starting with the assumption that Information Retrieval is conducted in
units of topics.
v
The use of clusters is the primary reasons why a new method of probability
estimation is proposed.
The adaptive dual control of topic-based IR system was evaluated in a series
of experiments conducted on the Reuters, Wikipedia and TREC collections of
documents. The Wikipedia experiment revealed that the dual control feedback
mechanism improves precision and S-recall when all the underlying assumptions are
satisfied. In the TREC experiment, this feedback mechanism was compared to a
state-of-the-art adaptive IR system based on BM-25 term weighting and the Rocchio
relevance feedback algorithm. The baseline system exhibited better effectiveness
than the cluster-based optimization model of ADTIR. The main reason for this was
insufficient quality of the generated clusters in the TREC collection that violated the
underlying assumption.
vi
Table of Contents
Keywords .................................................................................................................... iii
Abstract ........................................................................................................................ v
Table of Contents ....................................................................................................... vii
List of Tables............................................................................................................. xiii
List of Figures ............................................................................................................ xv
List of Acronyms and Abbreviations ....................................................................... xvii
Basic Mathematical Notation .................................................................................... xix
Statement of Original Authorship ........................................................................... xxiii
Acknowledgements .................................................................................................. xxv
I. Introduction ............................................................................................................... 1
1.
Motivation ..................................................................................................... 1
Importance of IR ........................................................................................... 1
Empirical Evidence of Problems in IR ......................................................... 1
Uncertainty as the Main Cause of the Problems ........................................... 4
Feedback in IR and its Problems .................................................................. 5
Summary ....................................................................................................... 7
2.
Definition of Adaptive Dual Topic-Based IR ............................................... 7
Adaptive Dual IR .......................................................................................... 7
Topic-Based IR ............................................................................................. 9
vii
The Proposed Vision of IR .......................................................................... 11
Summary ..................................................................................................... 11
3.
A Counterexample to PRP .......................................................................... 12
Relevance for a Minority User .................................................................... 12
Expected Relevance across all Users .......................................................... 14
Summary ..................................................................................................... 16
II. Design of the Research .......................................................................................... 17
1.
Methodology of the Research ..................................................................... 17
Guidelines ................................................................................................... 17
Hypotheses of the Research ........................................................................ 18
Contributions ............................................................................................... 19
Summary ..................................................................................................... 21
2.
Taxonomy of the Research Problem ........................................................... 21
Information Retrieval .................................................................................. 21
Adaptive Dual Control and Stochastic Programming ................................. 21
Theory of Algorithms .................................................................................. 22
Artificial Intelligence .................................................................................. 23
Machine Learning ....................................................................................... 23
Summary ..................................................................................................... 24
3.
Outline of the Further Narrative .................................................................. 24
III. Review of Probabilistic and Topic-Based IR ....................................................... 26
1.
Probabilistic Approaches to IR ................................................................... 26
Probability Ranking Principle ..................................................................... 26
Probabilistic Models.................................................................................... 27
Language models ........................................................................................ 28
Summary ..................................................................................................... 30
viii
2.
Topic-Based IR ........................................................................................... 31
Latent Semantic Analysis ........................................................................... 31
Cluster model .............................................................................................. 32
Probabilistic Latent Semantic Analysis ...................................................... 34
Latent Dirichlet Allocation ......................................................................... 35
Summary ..................................................................................................... 39
3.
Feedback in Adaptive IR ............................................................................ 40
Feedback for Vector Space Models ............................................................ 40
Feedback for Probabilistic Models ............................................................. 41
Feedback for Language Models .................................................................. 41
Feedback for LSA model ............................................................................ 42
Summary ..................................................................................................... 42
IV. Review of Uncertainty-Related Methods............................................................. 43
1.
Problems of Uncertainty and Diversity in IR.............................................. 43
Uncertainty in IR ......................................................................................... 43
Diversity in IR ............................................................................................. 45
Evaluation of Diversity in IR ...................................................................... 46
Summary ..................................................................................................... 47
2.
Approaches to Tackle Uncertainty and Diversity in IR .............................. 48
Diversity Stimulation .................................................................................. 48
Multicriterion Matching Scores .................................................................. 49
Active Learning .......................................................................................... 50
Reinforcement Learning ............................................................................. 52
Summary ..................................................................................................... 52
3.
Adaptive Dual Control and Stochastic Programming ................................. 52
Adaptive Dual Control ................................................................................ 52
ix
Direct Methods ............................................................................................ 54
Indirect Methods ......................................................................................... 55
Stochastic Programming ............................................................................. 56
Summary ..................................................................................................... 58
V. Relevance Estimation ............................................................................................ 60
1.
Probability Estimation ................................................................................. 60
Modelling Probabilities Based on Searched Features ................................. 60
The Language Modelling Approach ........................................................... 62
The Document Sampling Approach ............................................................ 62
Probabilistic User-based Model .................................................................. 65
Summary ..................................................................................................... 67
2.
Expected Relevance .................................................................................... 67
The General Approach ................................................................................ 67
Smoothing ................................................................................................... 67
Learning Topic-Relevance and Bias Coefficients....................................... 68
Feedback ..................................................................................................... 70
Summary ..................................................................................................... 71
VI. Decision Optimization ......................................................................................... 73
1.
Two-stage Stochastic Program .................................................................... 73
Optimization in Space of Documents ......................................................... 73
Optimization in the Space of Clusters ......................................................... 75
Approximate Formulation ........................................................................... 77
Relaxed Approximate Formulation ............................................................. 78
Linear Approximate Formulation ............................................................... 80
Summary ..................................................................................................... 80
2.
Multistage Stochastic Program ................................................................... 81
x
Feedback Correctness Approach ................................................................. 82
Linear Equivalent ........................................................................................ 84
Summary ..................................................................................................... 91
VII. Experiments ........................................................................................................ 92
1.
Experimental Design ................................................................................... 92
The Goals .................................................................................................... 92
Document Collections ................................................................................. 93
Queries and Relevance Judgments .............................................................. 94
Baseline Systems ........................................................................................ 95
ADTIR Systems .......................................................................................... 96
Evaluation Measures ................................................................................... 97
Reuters experiment ..................................................................................... 97
Software ...................................................................................................... 98
Summary ..................................................................................................... 98
2.
Results and Discussion ............................................................................... 99
Reuters Experiment ..................................................................................... 99
Wikipedia Experiment ................................................................................ 99
TREC Experiment ..................................................................................... 103
Summary ................................................................................................... 106
VIII. Conclusions ..................................................................................................... 108
Addressed Problems .................................................................................. 108
Findings ..................................................................................................... 109
Further Research ....................................................................................... 110
A. Derivation of Expected Relevance in the Example............................................. 111
B. Retrieved Documents for Query “Apple” ........................................................... 114
C. Query Set for Reuters Experiment ...................................................................... 121
xi
D. Candidate User-based Model Functions.............................................................. 123
E. An Example of Program Output .......................................................................... 124
Bibliography ............................................................................................................. 129
xii
List of Tables
Table 1. Estimated relevance of one document ......................................................... 12
Table 2. Actual relevance of one document ............................................................... 13
Table 3. PRP-based approach. Case of the user’s given relevant topic ..................... 13
Table 4. Not-PRP-based approach. Case of the user’s given relevant topic .............. 13
Table 5. Comparison of the results for the user’s given relevant topic...................... 14
Table 6. Second page. PRP-based approach .............................................................. 15
Table 7. Second page. Non-PRP-based approach ...................................................... 15
Table 8. Document Collections .................................................................................. 93
Table 9. Parameters of the TREC’s experiment baseline system .............................. 96
Table 10. Parameters of the ADTIR systems ............................................................. 96
Table 11. Topic-relevance and bias coefficients ........................................................ 99
Table 12. Bias coefficients only ................................................................................. 99
Table 13. Retrieval effectiveness in Wikipedia experiment .................................... 100
Table 14. Retrieval effectiveness in TREC experiment ........................................... 104
Table 15. PRP-based approach ................................................................................ 111
Table 16. Not-PRP-based approach ......................................................................... 113
xiii
xiv
List of Figures
Figure 1. Dynamics of adaptive IR with feedback ....................................................... 6
Figure 2. IR system as a control system ....................................................................... 6
Figure 3. Adaptive Dual Information Retrieval System as a control system ............... 9
Figure 4. Algorithm of ADTIR .................................................................................. 10
Figure 5. Comparison of the results for an average user............................................ 15
Figure 6. Graphical model representation of PLSA ................................................... 34
Figure 7. Graphical model representation of LDA .................................................... 37
Figure 8. Precision in Wikipedia experiment ........................................................... 101
Figure 9. S-Recall in Wikipedia experiment ............................................................ 101
Figure 10. Precision on iterations in Wikipedia experiment .................................... 102
Figure 11. S-Recall on iterations in Wikipedia experiment ..................................... 103
Figure 12. Precision in TREC experiment ............................................................... 105
Figure 13. Recall in TREC experiment .................................................................... 105
Figure 14. Precision on iterations in TREC experiment .......................................... 106
Figure 15. Recall on iterations in TREC experiment ............................................... 106
xv
xvi
List of Acronyms and Abbreviations
ADIR
Adaptive Dual Information Retrieval
ADTIR
Adaptive Dual Topic-Based Information Retrieval
DB
Database
EM
Expectation Maximization
IN
Information Need
IR
Information Retrieval
KL
Kullback-Laibler
LSA
Latent Semantic Analysis
LSI
Latent Semantic Indexing
MMR
Maximal Marginal Relevance
PDF
Probability Density Function
PMF
Probability Mass Function
PRP
Probability Ranking Principle
RL
Reinforcement Learning
SVD
Singular Value Decomposition
SVM
Support Vector Machine
xvii
xviii
Basic Mathematical Notation1
xg
document’s feature provided in the user’s query;
xb
p(z | q )
document’s feature not provided in the user’s query;
probability that topic z is searched given query q ;
p z | xg
probability that topic z is searched given x g ;
( )
p (z | x g , xb )
g
b
probability that topic z is searched given both features x and x ;
p ( z | q, f )
probability that topic z is searched given query q and feedback f ;
(
p z | q, z f
)
probability that topic z is searched given query q and relevant topic
z f given in feedback;
p f (z | q )
derived from the user’s feedback, probability that topic z is searched
given query q ;
( )
p (x g , xb )
g
b
probability of a document with features x and x ;
p( xi | q )
probability that document xi is searched given query q ;
p xg | q
(
pt x j | q, f
g
probability that the searched feature has value x given query q ;
)
probability at iteration t that document x j is searched given query q
and feedback f ;
Ωz
domain of documents that belong to topic z in the space of features;
1
Notation from chapters III and IV is not included; see these chapters for the
description.
xix
Ω gz
g
domain of documents with feature x that belong to topic z in the
space of features;
n
number of documents;
nz
number of documents in cluster z ;
n zg
g
number of documents with feature x in cluster z ;
xi
document;
g
value of feature x in document xi ;
xig
tf kj
skj
p( x | t )
( )
f v xig
frequency of term tk in document
xj
;
number of words of term tk in document
xj
;
probability of document x given term t ;
a component of function of xig with coefficient bv that models
selection of query terms by the user;
Γzz ′
topic-relevance between topics z and z ′ ;
rlz
relevance of topic z for query ql judged by the user;
rzf
given in feedback, relevance of topic z for the query judged by the
user
a zz ′
topic-relevance between topics z and z′ multiplied by the sample
bias coefficient and used in the optimization algorithm, usually is
found from the learning query set;
bij
{Rz | q}
topic-relevance between documents xi and
xj
expected relevance of topic z given query q ;
kz
coefficient to correct sample bias of topic z ;
T
set of topics;
F
set of feedback values;
S
set of iterations;
D
set of the documents
xx
;
N
set of natural numbers
Lt
= {z ∈ T | ∀τ < t hzτ = 0}, the set of topics not retrieved up to
iteration t ;
= {z ∈ T | ∃τ < t hzτ > 0}, the set of topics retrieved before iteration
Rt
t;
m
number of documents in a page;
nt
number of topics;
bij
topic-relevance between documents xi and x j ;
hi′
number of documents xi to retrieve, hi′ ∈ {0,1};
′
yitf
number of documents xi to retrieve on iteration t given feedback f ;
(
p f | q, x j
)
probability of feedback f given query q and document x j ;
hz
number of documents to retrieve from cluster z
yztf
number of documents to retrieve from cluster z on iteration t given
feedback f ;
p( f | q )
probability of feedback f for the given query q ;
hzt
number of documents to retrieve from cluster z on iteration t ;
g ztz′ , yztz′
number of documents to retrieve from cluster z if according to
feedback topic of cluster z′ is relevant;
ρu
probability of correct feedback on a topic if number of retrieved pages
with a document on a topic is u ;
cu
a combination of numbers of retrieved documents on the iterations up
to u .
xxi
xxii
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution. To the
best of my knowledge and belief, the thesis contains no material previously
published or written by another person except where due reference is made.
Signed:............................................Date.......................
xxiii
xxiv
Acknowledgements
We sincerely thank the supervisor of this research, Prof. Peter Bruza, for his
continuous support throughout the dissertation work, and especially at those times
when the preliminary research results have been difficult to convey to other people.
We also thank the associate supervisor, Prof. Amanda Spink, for her encouragement
to work on this research subject at QUT. We are grateful to Prof. Anatoly Sachenko
and Prof. George Markowsky, with whom we worked previously on this research
subject. Furthermore, we are in debt to our proofreaders.
This research has been supported by NICTA and QUT with the PhD
scholarship, equipment and an excellent working environment.
xxv