Cluster analysis and ontology generation techniques for the development of scholarly semantic web

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.69 MB, 251 trang )

CLUSTER ANALYSIS AND ONTOLOGY GENERATION
TECHNIQUES FOR THE DEVELOPMENT OF
SCHOLARLY SEMANTIC WEB

By
Quan Thanh Tho

SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
SCHOOL OF COMPUTER ENGINEERING
NANYANG TECHNOLOGICAL UNIVERSITY
NANYANG AVENUE, SINGAPORE 639798
2005

Table of Contents
Table of Contents

ii

List of Tables

viii

List of Figures

x

Abstract

xiv

Acknowledgements

xvi

1 Introduction

1

1.1

Scholarly Web Information . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Scholarly Information Retrieval . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Semantic Web-based Retrieval Systems . . . . . . . . . . . . . . . . . . .

4

1.5

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.6

Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.7

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 The Semantic Web
2.1

11

Markup Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1.1

Hypertext Markup Language . . . . . . . . . . . . . . . . . . . .

12

2.1.2

Extensible Markup Language . . . . . . . . . . . . . . . . . . . .

12

2.1.3

Resource Description Framework . . . . . . . . . . . . . . . . . .

13

2.2

The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3

Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4

Ontology Description Languages . . . . . . . . . . . . . . . . . . . . . . .

17

2.4.1

17

HTML-based Ontology Description Languages . . . . . . . . . . .

ii

2.4.2

XML-based Ontology Description Languages . . . . . . . . . . . .

18

2.4.3

RDF-based Ontology Description Languages . . . . . . . . . . . .

20

Semantic Web Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.5.1

Semantic Web Portal Architecture . . . . . . . . . . . . . . . . . .

25

2.5.2

Requirements on Semantic Web Portals . . . . . . . . . . . . . . .

25

2.6

Web Services and Semantic Web Services . . . . . . . . . . . . . . . . . .

27

2.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.5

3 Context-based Cluster Analysis

3.1

3.2

3.3

3.4

31

Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.1.1

Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . .

32

3.1.2

Partitioning Clustering Methods . . . . . . . . . . . . . . . . . . .

33

3.1.3

Other Clustering Methods . . . . . . . . . . . . . . . . . . . . . .

35

3.1.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Context-based Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . .

38

3.2.1

Cross-Clustering Relation Generation . . . . . . . . . . . . . . . .

39

3.2.2

Cross-Clustering Context Generation . . . . . . . . . . . . . . . .

51

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.3.1

Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.3.2

Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.3.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .

59

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4 Expert and Expertise Finding
4.1

65

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.1.1

Expertise Recommender Systems . . . . . . . . . . . . . . . . . .

66

4.1.2

Web Mining for Finding Expertise . . . . . . . . . . . . . . . . . .

66

4.1.3

Author Co-citation Analysis Approach . . . . . . . . . . . . . . .

67

4.1.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.2

CCA-based Expert Finding . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.3

Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.3.1

Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.3.2

Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.3.3

Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.3.4

Document Clusters Generation . . . . . . . . . . . . . . . . . . .

71

Author Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.4

iii

4.4.1

Creating Author Co-Citation Pairs . . . . . . . . . . . . . . . . .

72

4.4.2

Creating Raw Co-Citation Matrix . . . . . . . . . . . . . . . . . .

72

4.4.3

Converting into Correlation Matrix . . . . . . . . . . . . . . . . .

73

4.4.4

Generating Author Clusters . . . . . . . . . . . . . . . . . . . . .

73

4.5

Context-based Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . .

74

4.6

Expert Information Generation . . . . . . . . . . . . . . . . . . . . . . .

75

4.6.1

Identifying Researchers’ Research Areas . . . . . . . . . . . . . .

75

4.6.2

Ranking Expert . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

4.6.3

Retrieving Expert Information . . . . . . . . . . . . . . . . . . . .

76

Expert Retrieval and Visualization . . . . . . . . . . . . . . . . . . . . .

76

4.7.1

Expert Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

4.7.2

Expert Visualization . . . . . . . . . . . . . . . . . . . . . . . . .

77

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.8.1

Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.8.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.8.3

Comparison with Other Approaches . . . . . . . . . . . . . . . . .

84

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.7

4.8

4.9

5 Research Trend Detection
5.1

90

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.1.1

Semi-automatic Approaches . . . . . . . . . . . . . . . . . . . . .

91

5.1.2

Automatic Approaches . . . . . . . . . . . . . . . . . . . . . . . .

92

5.1.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

5.2

CCA-based Trend Detection . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.3

Keyword-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5.3.1

Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.3.2

Publisher Clustering . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.3.3

Temporal Clustering . . . . . . . . . . . . . . . . . . . . . . . . .

98

5.4

Context-based Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . .

99

5.5

Trend Information Generation . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1

Current Trend Identification . . . . . . . . . . . . . . . . . . . . . 103

5.5.2

Trend Information Extraction . . . . . . . . . . . . . . . . . . . . 103

5.5.3

Emerging Trend Identification . . . . . . . . . . . . . . . . . . . . 104

5.6

Trend Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.7

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

iv

5.8

5.7.1

Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.7.2

Trend Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.7.3

Trend Information Extraction and Retrieval . . . . . . . . . . . . 109

5.7.4

Trend Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 Fuzzy Concept Hierarchy Generation
6.1

112

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.1

Concept Hierarchy Generation . . . . . . . . . . . . . . . . . . . . 114

6.1.2

Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . . . . 114

6.1.3

Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . . 115

6.1.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2

Fuzzy Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3

Fuzzy Concept Hierarchy Generation . . . . . . . . . . . . . . . . . . . . 119

6.4

6.5

6.6

6.7

6.3.1

Fuzzy Formal Concept Analysis . . . . . . . . . . . . . . . . . . . 119

6.3.2

Fuzzy Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 127

6.3.3

Hierarchical Relation Generation . . . . . . . . . . . . . . . . . . 129

Research Concept Hierarchy Generation . . . . . . . . . . . . . . . . . . 132

6.4.1

Fuzzy Formal Concept Analysis . . . . . . . . . . . . . . . . . . . 133

6.4.2

Fuzzy Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 133

6.4.3

Hierarchical Relation Generation . . . . . . . . . . . . . . . . . . 134

6.4.4

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 135

Machine Faults Concept Hierarchy Generation . . . . . . . . . . . . . . . 142
6.5.1

Fuzzy Formal Concept Analysis . . . . . . . . . . . . . . . . . . . 143

6.5.2

Fuzzy Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 143

6.5.3

Hierarchical Relation Generation . . . . . . . . . . . . . . . . . . 144

6.5.4

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 145

News Topic Themes Concept Hierarchy Generation . . . . . . . . . . . . 149
6.6.1

Fuzzy Formal Concept Analysis . . . . . . . . . . . . . . . . . . . 150

6.6.2

Fuzzy Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 150

6.6.3

Hierarchical Relation Generation . . . . . . . . . . . . . . . . . . 151

6.6.4

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 152

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

v

7 Scholarly Ontology Generation
7.1

7.2

7.3

7.4

157

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.1.1

Ontology Generation . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.1.2

Generating Ontology from Scholarly Knowledge . . . . . . . . . . 159

7.1.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Fuzzy Ontology Generation . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2.1

The FOGA Approach . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.2.2

Incremental Ontology Update . . . . . . . . . . . . . . . . . . . . 166

7.2.3

Research Hierarchy Ontology Generation . . . . . . . . . . . . . . 170

Cluster-based Ontology Generation . . . . . . . . . . . . . . . . . . . . . 170
7.3.1

The COGA Approach . . . . . . . . . . . . . . . . . . . . . . . . 171

7.3.2

Experts Ontology Generation . . . . . . . . . . . . . . . . . . . . 172

7.3.3

Trends Ontology Generation . . . . . . . . . . . . . . . . . . . . . 173

Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.4.1

Ontology Integration Framework . . . . . . . . . . . . . . . . . . 174

7.4.2

Scholarly Ontology Generation

. . . . . . . . . . . . . . . . . . . 175

7.5

Semantic Web Representation . . . . . . . . . . . . . . . . . . . . . . . . 178

7.6

Browsing Scholarly Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8 Scholarly Semantic Web
8.1

184

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.1.1

Citation-based Retrieval . . . . . . . . . . . . . . . . . . . . . . . 184

8.1.2

Semantic Web-based Information Retrieval . . . . . . . . . . . . . 187

8.1.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.2

System Overview of SSWeb . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.3

Scholarly Semantic Web Services . . . . . . . . . . . . . . . . . . . . . . 189

8.4

8.3.1

Scholarly Service Provider . . . . . . . . . . . . . . . . . . . . . . 190

8.3.2

Scholarly Service Requester . . . . . . . . . . . . . . . . . . . . . 192

8.3.3

Matchmaking Agent . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.3.4

Scholarly Information Retrieval . . . . . . . . . . . . . . . . . . . 194

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

9 Conclusions

198

9.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

9.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
vi

9.2.1

Discovering Other Scholarly Knowledge . . . . . . . . . . . . . . . 201

9.2.2

Fuzzy Semantic Query Languages . . . . . . . . . . . . . . . . . . 201

9.2.3

Automatic Ontology Integration . . . . . . . . . . . . . . . . . . . 203

9.2.4

Fuzzy Query Expansion using Fuzzy Concept Hierarchy . . . . . . 204

A List of Publications

205

A.1 Refereed Conferences and Workshops . . . . . . . . . . . . . . . . . . . . 205
A.2 Book Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

A.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
B 20 Queries for Performance Evaluation on Expert Finding

207

Bibliography

208

vii

List of Tables
3.1

A distance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.2

A cross-table of a document clustering context. . . . . . . . . . . . . . .

52

3.3

A cross-table of an author clustering context. . . . . . . . . . . . . . . . .

53

3.4

A cross-clustering context from the document and author clustering contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.5

Different combinations of clusters mining.

. . . . . . . . . . . . . . . . .

57

4.1

An example of the Keyword-Author Cross-Clustering Context. . . . . . .

74

4.2

Manually classified experts. . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.3

Performance results based on the average F-measure. . . . . . . . . . . .

80

5.1

An example of a document clustering context. . . . . . . . . . . . . . . . 101

5.2

An example of a topic clustering context. . . . . . . . . . . . . . . . . . . 101

5.3

An example of a temporal clustering context. . . . . . . . . . . . . . . . . 101

5.4

An example of the Keyword-Topic-Temporal Cross-Clustering Context. . 102

5.5

Manually predefined trends in the Information Retrieval field. . . . . . . 106

5.6

Trends identification results using the single link method. . . . . . . . . . 107

5.7

Trends identification results using the complete link method. . . . . . . . 107

5.8

Trends identification results using the average link method. . . . . . . . . 108

5.9

Trends identification results using the Ward’s method. . . . . . . . . . . 108

5.10 Performance results of trends information extraction. . . . . . . . . . . . 109
6.1

A cross-table of a formal context. . . . . . . . . . . . . . . . . . . . . . . 120

6.2

A cross-table of a fuzzy formal context. . . . . . . . . . . . . . . . . . . . 122

6.3

Fuzzy formal context in Table 6.2 with α-cut = 0.5. . . . . . . . . . . . . 122

6.4

A cross-table of a L-fuzzy context. . . . . . . . . . . . . . . . . . . . . . . 125

viii

6.5

Full context of a L-fuzzy context. . . . . . . . . . . . . . . . . . . . . . . 126

6.6

Number of research clusters using FCHG and LFCA-based conceptual
clustering methods based on different similarity thresholds Ts . . . . . . . 134

6.7

Runtime (in sec.) required to generate conceptual clusters. . . . . . . . . 134

6.8

Performance results based on precision. . . . . . . . . . . . . . . . . . . . 137

6.9

Performance results based on recall. . . . . . . . . . . . . . . . . . . . . . 137

6.10 Performance results based on F-measure. . . . . . . . . . . . . . . . . . . 137
6.11 Performance comparison based on precision. . . . . . . . . . . . . . . . . 138
6.12 Performance comparison based on precision. . . . . . . . . . . . . . . . . 138
6.13 Performance comparison based on F-measure. . . . . . . . . . . . . . . . 138
6.14 Number of research clusters using FCHG and LFCA-based conceptual
clustering methods based on difference confidence thresholds TC . . . . . . 144
6.15 Runtime (in sec.) required to generate conceptual clusters. . . . . . . . . 144
6.16 Retrieval accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.17 Number of research clusters using FCHG and LFCA-based conceptual
clustering methods based on difference confidence thresholds TC . . . . . . 151

6.18 Runtime (in sec.) required to generate conceptual clusters. . . . . . . . . 151
6.19 Manually classified themes of Reuters news topics. . . . . . . . . . . . . . 153
6.20 Performance results based on precision. . . . . . . . . . . . . . . . . . . . 154
6.21 Performance results based on recall. . . . . . . . . . . . . . . . . . . . . . 154
6.22 Performance results based on F-measure. . . . . . . . . . . . . . . . . . . 154
B.1 20 queries for performance evaluation on expert finding. . . . . . . . . . . 207

ix

List of Figures
1.1

System architecture of the proposed Scholarly Semantic Web. . . . . . .

6

2.1

Representation of a publication using XML. . . . . . . . . . . . . . . . .

13

2.2

Another representation of the publication using XML. . . . . . . . . . . .

13

2.3

RDF data model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.4

Representation of a publication using RDF. . . . . . . . . . . . . . . . .

14

2.5

Architecture of the Semantic Web. . . . . . . . . . . . . . . . . . . . . .

15

2.6

Representation of semantic information using SHOE. . . . . . . . . . . .

18

2.7

Representation of semantic information using Ontobroker. . . . . . . . .

19

2.8

Class representation using DAML-ONT. . . . . . . . . . . . . . . . . . .

21

2.9

Class representation using OIL. . . . . . . . . . . . . . . . . . . . . . . .

22

2.10 Class representation using DAML+OIL. . . . . . . . . . . . . . . . . . .

23

2.11 Class Representation using OWL. . . . . . . . . . . . . . . . . . . . . . .

24

2.12 Semantic Web Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.13 Operational mechanism in Web Services. . . . . . . . . . . . . . . . . . .

27

2.14 Technologies used on the operational mechanism in Web Services. . . . .

28

3.1

Fuzzy clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.2

Context-based Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . .

38

3.3

Cross-Clustering Relation Generation.

. . . . . . . . . . . . . . . . . . .

38

3.4

Vectorization from document and author clustering. . . . . . . . . . . . .

41

3.5

Algorithm for distance matrix generation. . . . . . . . . . . . . . . . . .

46

3.6

Clustering multi-dimensional combined vectors with AHC. . . . . . . . .

48

3.7

Relationship vector generation. . . . . . . . . . . . . . . . . . . . . . . .

50

3.8

A cross-clustering relation. . . . . . . . . . . . . . . . . . . . . . . . . . .

51

x

3.9

Number of clusters generated for different methods used in Vector Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.10 Performance results based on IR measures. . . . . . . . . . . . . . . . . .

61

3.11 Performance results based on entropy. . . . . . . . . . . . . . . . . . . . .

63

4.1

The CCA-based expert finding approach. . . . . . . . . . . . . . . . . . .

69

4.2

Document clustering process. . . . . . . . . . . . . . . . . . . . . . . . .

70

4.3

Author clustering process. . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.4

Context-based Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . .

74

4.5

Visualizing research experts in research areas. . . . . . . . . . . . . . . .

78

4.6

Performance results based on recall. . . . . . . . . . . . . . . . . . . . . .

81

4.7

Performance results based on precision. . . . . . . . . . . . . . . . . . . .

82

4.8

Performance results based on F-measure. . . . . . . . . . . . . . . . . . .

83

4.9

Performance comparison based on recall. . . . . . . . . . . . . . . . . . .

85

4.10 Performance comparison based on precision. . . . . . . . . . . . . . . . .

86

4.11 Performance comparison based on F-measure. . . . . . . . . . . . . . . .

87

4.12 Performance comparison based on average F-measure. . . . . . . . . . . .

89

5.1

The proposed CCA-based trend detection approach. . . . . . . . . . . . .

96

5.2

Document clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.3

Publisher clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

5.4

Temporal clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.5

Context-based Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . . 102

5.6

Performance results of trends information retrieval. . . . . . . . . . . . . 110

5.7

Trend visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1

The proposed Fuzzy Concept Hierarchy Generation technique. . . . . . . 119

6.2

A concept lattice generated from traditional FCA. . . . . . . . . . . . . . 121

6.3

A fuzzy concept lattice generated from FFCA. . . . . . . . . . . . . . . . 124

6.4

An L-fuzzy concept lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.5

The fuzzy conceptual clustering algorithm. . . . . . . . . . . . . . . . . . 128

6.6

Conceptual clusters with confidence threshold TS = 0.4. . . . . . . . . . . 130

6.7

Conceptual clusters with confidence threshold TS = 0.5. . . . . . . . . . . 130

6.8

Conceptual clusters with associated objects and attribute sets. . . . . . . 131
xi

6.9

Concept hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.10 An example of the Research Concept Hierarchy. . . . . . . . . . . . . . . 135
6.11 Performance results based on cluster goodness. . . . . . . . . . . . . . . . 140

6.12 Performance results based on AUP. . . . . . . . . . . . . . . . . . . . . . 141
6.13 An example fault-condition and checkpoint in a customer service record.

142

6.14 A part of the Machine Faults Concept Hierarchy for machine model
AV 2011. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.15 Performance results based on IR measures. . . . . . . . . . . . . . . . . . 147
6.16 A part of the News Topic Themes Concept Hierarchy. . . . . . . . . . . . 152
6.17 Average performance results based on IR measures. . . . . . . . . . . . . 155
6.18 Performance comparison based on IR measures. . . . . . . . . . . . . . . 156
7.1

An example scholarly ontology. . . . . . . . . . . . . . . . . . . . . . . . 162

7.2

Fuzzy ontology generation process. . . . . . . . . . . . . . . . . . . . . . 164

7.3

Research Hierarchy Ontology. . . . . . . . . . . . . . . . . . . . . . . . . 170

7.4

Cluster-based Ontology Generation Framework. . . . . . . . . . . . . . . 171

7.5

Experts Ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.6

Trends Ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.7

Ontology Integration Framework. . . . . . . . . . . . . . . . . . . . . . . 175

7.8

Sets of ontologies’ classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.9

Integration rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.10 Preliminary Scholarly Ontology. . . . . . . . . . . . . . . . . . . . . . . . 177
7.11 Scholarly Ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.12 An example of Trends Ontology classes represented in OWL Full. . . . . 180
7.13 An example of a part of the Scholarly Ontology represented in OWL Full. 181
7.14 Browsing research areas on the Scholarly Ontology. . . . . . . . . . . . . 182
7.15 Browsing documents on the Scholarly Ontology. . . . . . . . . . . . . . . 182
8.1

General architecture of a typical citation-based retrieval system. . . . . . 185

8.2

Scholarly Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

8.3

Template profile of Scholarly Service Requester represented in OWL-S. . 191

8.4

Search interface for Scholarly Service Requester. . . . . . . . . . . . . . . 192

8.5

A profile of a Scholarly Service Requester. . . . . . . . . . . . . . . . . . 193

xii

8.6

Scholarly information retrieval process. . . . . . . . . . . . . . . . . . . . 194

8.7

Expert finding results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

xiii

Abstract
The Web has become one of the most important media for storing scholarly related
information. Search engines such as Google and Yahoo are insufficient to help search

relevant scholarly information. The current citation-based retrieval systems such as
ISI (Institute for Scientific Information) and CiteSeer can only provide basic citation
search support which is unable to cater for the needs of the growing scholarly research
community. Moreover, the scholarly knowledge discovered from citation database cannot
be shared among different citation-based retrieval systems.
Nevertheless, a citation database contains very useful semantic scholarly knowledge
which can be further explored for supporting advanced search functions such as expert
finding and trend detection. The recent development of the Semantic Web has provided
a very suitable environment for supporting the sharing of scholarly knowledge among
different research communities. Therefore, in this research, we aim to develop a Semantic
Web-based system for the sharing and retrieval of scholarly information based on a
citation database.
To achieve the goal, we have proposed the Scholarly Semantic Web (or SSWeb), which
organizes scholarly knowledge as an ontology that is distributed across the Semantic
Web. As such, scholarly information can be managed and refined by domain experts,
and can be shared and accessed by programs. Moreover, the proposed SSWeb system
can support not only typical scholarly document and citation-based retrieval, but also
advanced scholarly search functions such as expert finding, trend detection and fuzzy
document retrieval.
In summary, this thesis has made the following contributions:
• A cluster analysis technique, namely Context-based Cluster Analysis (CCA), has
been developed to discover relationships among multiple sets of resultant clustering
xiv

data.
• A CCA-based expert finding approach has been developed to find experts in research areas. The advantage of the proposed approach is its capability in finding
information on experts from a global perspective instead of limiting only to a
specific organization or domain.
• A CCA-based trend detection approach has been developed to detect trends in

research areas. The proposed approach can detect research trends from a citation
database in a fully automatic manner.
• A fuzzy concept hierarchy generation technique has been developed for fuzzy concept hierarchy generation in domains with uncertain information. This technique
can be used for supporting fuzzy document retrieval on the Scholarly Semantic
Web.
• Ontology generation approaches have been developed for generating Scholarly Ontology. The Scholarly Ontology is generated from integrating the different scholarly
knowledge discovered from citation database.
• A distributed architecture for Scholarly Semantic Web has been developed to support scholarly information retrieval over the Semantic Web environment. In the
Scholarly Semantic Web, the scholarly knowledge is shareable over multiple locations through Semantic Web Services.

Acknowledgements
I would like to express my gratitude and sincere appreciation to my supervisor, Assoc.
Prof. Hui Siu Cheung, for his guidance, constant encouragement, thoughtful criticism
and invaluable suggestions. Without his support and invaluable patience, the thesis
would not have been possible. He has also always kept on pushing me to advance to
higher research levels.
I would like to thank Dr. Alvis Fong for his review and invaluable comments on my
research.
I would also like to thank Dr. Tru Hoang Cao for rendering help in my work and
career.
I would also like to thank Dr. He Yulan for her help and useful related materials
provided about the citation database in the initial stage of my research.
I would also like to thank Nanyang Technological University (NTU) for providing me
the financial support to do research in Singapore.
I want to thank all my friends and colleagues, who assisted me in many ways throughout the duration of the research. In particular, I would like to express my special thanks
to my friend, Mr. Do Tien Dung, for his constant supports and helpful advices.
My gratitude also goes to Mr. Teo Choo Eng and Ms. Eng Hui Fang, the laboratory
technicians of the Database Technology Laboratory, for their technical support and help.
I am also grateful to the staffs in the BioInformatics Research Centre (BIRC) for their

efforts to maintain a good working environment for my research.
Last but not least, my deepest gratitude goes to my family, my parents and sister,
for their love, support and patience. I would not have had these days without their
endless encouragement.

xvi

Chapter 1
Introduction
1.1

Scholarly Web Information

With the rapid growth of the World Wide Web (or Web), there is more and more information about scholarly scientific publications stored on the Web. Researchers can
access and even download scientific documents from various online repositories on the
Web. Electronic archives such as CiteBase [1] and Open Archives Initiative (OAI) [2]
provide free access to scholarly documents. In electronic journals such as Elsevier [3]
and IOP [4], scientific documents are also provided with links to discussions, related
documents, and notification and alerting services. Digital libraries [5] such as the ACM
Digital Library [6] and ScienceDirect [7] contain a large quantity of organized publications collected from various publishers in various forms (e.g. articles and books) and
formats (e.g. texts, animations and movies). In fact, most of the scientific research work
published in scholarly journals and conference proceedings are now available online in
the form of digital libraries [8].
As a result, researchers can always browse and search the Web to keep abreast of
the research trends that are relevant to them, and focus more on new research issues.
Currently, researchers from most universities and institutions rely heavily on the Web
to find their related scholarly information [9]. However, as the number of scientific documents available on the Web increases tremendously, efficient and effective mechanisms
are needed in order to help researchers to locate their related scholarly information and
documents.

1

Chapter 1: Introduction

1.2

Scholarly Information Retrieval

A number of search engines such as Google [10] and Yahoo [11] have been developed to
help users find information on the Web. In particular, Vivisimo [12] is a search engine
that applies clustering techniques to categorize search results. These search engines retrieve documents relevant to the search queries based on keywords. In order to support
search queries, search engines crawl the web sites and index them based on the extracted
keywords and linkage information. Generally, search engines are not very effective in
helping researchers to find related scholarly information because they are mainly developed for general searching purposes. The returned search results are too generalized
and often mixed up with non-scholarly related information. Thus, specialized search engines or systems that can overcome the over-generalization problem of traditional search
engines such as ISI [13] or CiteSeer [14, 15, 16, 17] have been developed for retrieving
scholarly-related information.
In scholarly publications, papers or books are usually cited as references. Citations
are used to help readers to locate relevant papers for further understanding on the discussed topic. Citation indexes contain references that the documents cite. They provide
links between source documents and the cited documents. Thus, citation indexes provide
useful information to help researchers to conduct scientific research, such as identifying
researchers working on their research areas, finding publications from a certain research
area, and even analyzing research trends. Therefore, citation indexes are employed by
specialized search engines to index scientific publications for retrieval. Citation indexes
are stored in citation databases. In a citation database, citation indexes are stored
together with publication-related information such as title, author, keywords, date of
publication, etc.
Citation-based retrieval systems have been developed for retrieving scholarly information. Institute for Scientific Information (ISI) [13] provides a commercial citation-based

retrieval system over the Web, which allows users to search for cited documents from
the citation database. CiteSeer [14, 15, 16, 17], which is also known as ResearchIndex,
provides another very powerful and popular citation-based retrieval system on the Web.
Both ISI and CiteSeer support citation-based retrieval. Users can search for documents
using keyword-based queries. The related documents and their citation information,

2

Chapter 1: Introduction

such as cited and citing documents, are then retrieved. In addition to traditional citation search, PubSearch [18, 19] is a retrieval system that also supports document clustering retrieval and author clustering retrieval. Similar documents and authors working
in similar research areas can be retrieved based on input keywords and author names
respectively.
Although citation-based retrieval systems have provided useful search functions for
finding scientific publications, they still have the following limitations:
• Search functions. Only basic search functions such as keyword search, author
search and traditional citation search are supported. However, this will not be
sufficient to keep up with the needs of the rapidly growing scholarly community.
Advanced search functions such as expert finding in certain research areas, and
the detection of current research trends are highly desirable.
• Sharing of scholarly information. There is currently no significant sharing of knowledge between different citation-based retrieval systems. As such, the exchange of
knowledge is difficult. This will hamper the development of a global, distributed
environment for supporting scholarly information retrieval.

1.3

The Semantic Web

Currently, information available on the Web has been designed for humans to understand. Programs can be written to process, analyze and index web pages to help human

to process the information. However, due to the lack of machine-readable structure and
knowledge representation in web documents, programs are unable to comprehend web
page contents precisely, and hence semantic information from web documents cannot
be extracted. As such, a method for representing knowledge such that programs can
understand, share and exchange the knowledge is needed.
To tackle this problem, Tim Berners-Lee et al. [20] has proposed the Semantic Web,
which is an extension of the current Web, in which information is given well-defined
meaning, better enabling computers and people to work in cooperation. Ontology is
used to represent knowledge on the Semantic Web. Generally, ontology is a conceptualization of a domain into a human understandable, but machine-readable format
3

Chapter 1: Introduction

consisting of entities, attributes, relationships and axioms [21]. As such, programs can
use the knowledge from the Semantic Web for processing information in a semantic manner. Web Services [22] have been introduced to make the knowledge conveyed by the
ontology on the Semantic Web accessible across different applications. Semantic Web
Services [23] represent Web Services as ontologies, thereby making the provided services
not only accessible but understandable by other programs.

1.4

Semantic Web-based Retrieval Systems

The Semantic Web has provided a very suitable environment for supporting knowledge
management and retrieval. In the traditional Web, semantic information can be annotated in web pages using enhanced metadata description languages as shown in a number
of systems such as SHOE [24], Ontobroker [25], WebKB [26], Quest [27], Expressive and
Efficient Language for XML Information Retrieval (ELIXIR) [28]. However, due to the
lack of a standard for knowledge annotation, such systems are still unable to share their
information.

With the advancement of the Semantic Web, Semantic Web Portals [29] (SW Portals)
have been developed based on the Semantic Web technologies. AIFB [30], Esperonto
[31], OntoWeb [32], Embolis K42 [33] and Mondeca ITM [34] are some well-known
portals that are currently being developed. Generally, Semantic Web-based information
retrieval is supported through Semantic Web Services provided by SW portals. To help
a retrieval system to locate suitable Semantic Web Services automatically, intelligent
agents called matchmaking agents [35] or matchmaker agents [36] have been proposed.
Swoogle [37], a Semantic Web-based search engine, has used intelligent agents as crawlers
to collect information provided by existing Semantic Web Services over the Web. The
collected information is then indexed for information retrieval.
Recently, several Semantic Web-based systems have been developed for the retrieval
of scholarly information. In E-scholar Knowledge Inference Model (ESKIMO) [38, 39],
a Semantic Web-based scholarly information management system has been developed
based on hypertext links borrowed from the traditional Web. Scholarly Ontology Project
[40] has been developed to support scholarly publications management using a collaborative approach. Research in Semantic Scholarly Publishing (RSSP) project [41] has
4

Chapter 1: Introduction

been developed to support the retrieval of scholarly publications from on-line archives
based on the Semantic Web.
However, one of the major obstacles for developing Semantic Web-based retrieval
systems is on the construction of ontology for the corresponding domain. In the scholarly
domain, the scholarly ontology of the existing Semantic Web-based retrieval systems is
constructed mainly based on explicit information from scientific documents (such as
titles, authors and abstracts) or using manual methods. However, the explicit document
information can only provide knowledge or ontology for supporting basic search functions
such as keyword search or author search. And it is a tedious and difficult task to
construct scholarly ontology manually.

1.5

Objectives

The Web has become one of the most important media for storing scholarly related
information. Search engines are insufficient to help search relevant scholarly information. The current citation-based retrieval systems can only provide basic citation search
support which is unable to cater for the needs of the growing scholarly research community. Moreover, the discovered scholarly knowledge from citation database cannot
be shared among different citation-based retrieval systems. Nevertheless, a citation
database contains very useful semantic scholarly knowledge that can be further explored
for supporting advanced search functions such as expert finding and trend detection.
The development of the Semantic Web has provided a very suitable environment
for supporting the sharing of scholarly knowledge among different scholarly research
communities. However, one of the challenges for the development of Semantic Webbased retrieval systems is on the construction of scholarly ontology. The construction
process for scholarly ontology should be easy and preferably automatic rather than
manual, which is tedious and time-consuming.
This research aims to develop a Semantic Web-based system for the sharing and
retrieval of scholarly information based on a citation database. The proposed system
is known as Scholarly Semantic Web (or SSWeb). The proposed SSWeb system will
organize scholarly knowledge as ontology, which is distributed on the Semantic Web. As
such, scholarly information can be managed and refined by the corresponding domain
5

Chapter 1: Introduction

Citation
Database

Cluster Analysis

Ontology
Generation

Scholarly
Semantic Web
Scholarly Ontology
Organization 1

Scholarly Web
Services

Scholarly Ontology

Web Browser

User

Organization 2

Scholarly Ontology
Organization 3

Figure 1.1: System architecture of the proposed Scholarly Semantic Web.
experts, and can be shared and accessed by programs. Moreover, the proposed SSWeb
system can support not only typical scholarly documents and citation-based retrieval,
but also advanced scholarly search functions such as expert finding, trend detection and
fuzzy document retrieval, in which fuzzy membership can be used to weight the query
terms accordingly to improve the retrieval performance.
Figure 1.1 shows the proposed distributed architecture of the Scholarly Semantic

Web. In SSWeb, each organization (or institution) maintains its own scholarly ontology.
To generate scholarly ontology, we first investigate data mining techniques based on
cluster analysis and fuzzy conceptual clustering for the discovery of semantic knowledge
from citation database. The discovered knowledge will be used for supporting advanced
search functions such as expert finding, trend detection and fuzzy document retrieval.
In ontology generation, the discovered knowledge will then be converted into scholarly
6

Chapter 1: Introduction

ontology. Semantic Web Services will also be investigated in order to provide the support
for scholarly information retrieval.
Therefore, to achieve our primary aim on developing the Scholarly Semantic Web,
we will carry out the research in the following areas:
• Cluster Analysis. Advanced search functions such as expert finding and trend
detection are highly desirable for scholarly research community. In this research,
we will investigate a clustering technique for analyzing cluster relationships among
multiple clustering results from data on documents, authors, publishers and date
of publication of citation database. The discovered cluster relationships will be
used for developing the functions for expert finding and trend detection.
• Fuzzy Conceptual Clustering. Traditional clustering techniques may not be sufficient for discovering and representing certain types of scholarly information such
as hierarchical relations and uncertain information that commonly occurs in scholarly documents. The incorporation of hierarchical structures of scholarly knowledge with uncertain information will provide support for fuzzy document retrieval,
that will enhance the retrieval performance. In this research, we will investigate a
fuzzy-based conceptual clustering technique for deriving scholarly knowledge from
uncertain information of scholarly documents from citation database. The derived
knowledge will be hierarchically organized as a fuzzy concept hierarchy.
• Ontology Generation. In order to make the scholarly knowledge derived from
cluster analysis and fuzzy conceptual clustering available on the Semantic Web
environment, it is necessary to convert the knowledge into an ontology formalism.

In this research, we will investigate techniques for automatic generation of scholarly ontology from the generated scholarly knowledge. In addition, we will also
investigate the integration of different types of ontologies that are generated from
cluster analysis and fuzzy conceptual clustering.
• Scholarly Semantic Web Services. To provide scholarly information retrieval services over the Semantic Web, we will investigate a Semantic Web-based architecture for the delivery of Scholarly Semantic Web Services. The proposed architecture should enable the retrieval of scholarly information from multiple Scholarly
7

Chapter 1: Introduction

Service Providers, thereby enabling the sharing and reusability of scholarly knowledge in a distributed manner.

1.6

Major Contributions

As a result of this research, we have developed different novel techniques in the areas of
knowledge management and data mining listed in Section 1.5. The major contributions
of this research are summarized as follows:
• Context-based Cluster Analysis (CCA) Technique. We have proposed a cluster
analysis technique based on the Formal Concept Analysis (FCA) [42] theory. The
proposed technique aims to find relationships among multiple sets of resultant
clustering data. The CCA technique is designed generically so that it can be used
by other applications. In this research, the CCA technique is used to support
advanced search functions for expert finding and research trend detection.
• CCA-based Expert Finding Approach. We have proposed a CCA-based approach
for finding experts in research areas. The proposed approach is able to find information on experts from a global perspective instead of limiting only to a specific
organization or domain as in most existing approaches. In addition, the proposed
approach can also provide information on expertise or research areas, which are
represented by some significant keywords, of the experts. The identified experts
and expertise information can be used for visualization and information retrieval.

• CCA-based Research Trend Detection Approach. We have proposed a CCA-based
approach for detecting trends in research areas. The proposed approach can detect
research trends in a fully automatic manner. In addition, the trend detected using
the proposed approach can always be treated as current trends and the associated
statistical information can be gathered without being limited to any specific time
periods. The detected trends can be used for information retrieval.
• Fuzzy Conceptual Clustering. We have proposed a fuzzy conceptual clustering
technique for generating a fuzzy concept hierarchy based on the FCA theory. The
proposed technique first incorporates fuzzy logic into FCA in order to deal with
8

Chapter 1: Introduction

uncertain data, and then constructs the fuzzy concept hierarchy through the proposed fuzzy conceptual clustering technique. This technique aims to construct
a concept hierarchy of conceptual clusters of research areas for supporting fuzzy
document retrieval on the Scholarly Semantic Web.
• Scholarly Ontology Generation. We have proposed an ontology generation approach for generating scholarly ontology from the scholarly knowledge discovered
from citation database. The scholarly knowledge that are converted into ontology
include expert knowledge, trend knowledge and research concept hierarchy. The
scholarly ontologies derived from a citation database are then integrated and populated into Scholarly Ontology. The generated Scholarly Ontology can be used to
support advanced scholarly information retrieval functions. There may be many
Scholarly Ontologies generated and stored in various academic Semantic Web sites
in a distributed manner using our techniques.
• Scholarly Semantic Web Architecture. In this research, we have proposed a distributed architecture for the Scholarly Semantic Web. With the proposed architecture, the system is capable of exploring, exchanging and sharing scholarly
information on the Semantic Web environment. The proposed system provides not
only basic citation-based search functions, but also advanced search functions for
expert finding, trend detection and fuzzy document retrieval.

1.7

Organization of the Thesis

This chapter has discussed the background and motivation of this research work. The
objectives of the research have been given. We have also listed the contributions that
have been achieved. The rest of the thesis is organized as follows.
Chapter 2 reviews the Semantic Web and the state-of-the-art Semantic Web technologies, which include ontology, Semantic Web Portals and Semantic Web Services.
In Chapter 3, we discuss the proposed cluster analysis technique for mining cluster
relationships from multiple clustering data. The technique, which is known as Contextbased Cluster Analysis, is capable of representing cluster relationships among multiple
clusters as mathematical models. The performance of the proposed technique is also
9

Cluster analysis and ontology generation techniques for the development of scholarly semantic web

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về