Springer data mining and knowledge discovery approaches based on rule induction techniques (2006) ISBN 038734294x

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (43.18 MB, 775 trang )

DATA MINING AND KNOWLEDGE
DISCOVERY APPROACHES BASED ON
RULE INDUCTION TECHNIQUES

DATA MINING AND KNOWLEDGE
DISCOVERY APPROACHES BASED ON
RULE INDUCTION TECHNIQUES

Edited by
EVANGELOS TRIANTAPHYLLOU
Louisiana State University, Baton Rouge, Louisiana, USA
GIOVANNI FELICI
Consiglio Nazionale delle Ricerche, Rome, Italy

^

Spri
ringer

Library of Congress Control Number: 2006925174
ISBN-10: 0-387-34294-X

e-ISBN: 0-387-34296-6

ISBN-13: 978-0-387-34294-8

Printed on acid-free paper.

© 2006 Springer Science-i-Business Media, LLC
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science-fBusiness Media, LLC, 233 Spring Street, New York, NY
10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed in the United States of America.
987654321
springer.com

I gratefully dedicate this book to my new life's inspiration, my mother
Helen and late father John (loannis), my late Grandfather (Evangelos), and
also to my beloved Ragus and Ollopa ("Ikasinilab"). It would had never
been prepared without their encouragement, patience, and unique
inspiration. —Evangelos Triantaphyllou
I wish to dedicate this book to la Didda, le Pullalle, and Misty—four special
girls who are always on my side—and to all my friends, who make me strong;
to them goes my gratitude for their warm support. —Giovanni Felici

TABLE OF CONTENTS
List of Figures
List of Tables
Foreword
Preface
Acknowledgements

xxiii
xxix
xxxvii
xxxix
xlvii

Chapter 1
A COMMON LOGIC APPROACH TO DATA MINING
AND PATTERN RECOGNITION, by A. Zakrevskij
1. Introduction
1.1
Using Decision Functions
1.2
Characteristic Features of the New Approach
2. Data and Knowledge
2.1
General Definitions
2.2
Data and Knowledge Representation the Case of Boolean Attributes
2.3
. Data and Knowledge Representation the Case of Multi-Valued Attributes
3. Data Mining - Inductive Inference
3.1
Extracting Knowledge from the Boolean Space of Attributes
3.2
The Screening Effect
3.3
Inductive Inference from Partial Data
3.4

The Case of Multi-Valued Attributes
4. Knowledge Analysis and Transformations
4.1
Testing for Consistency
4.2
Simplification
5. Pattern Recognition - Deductive Inference
5.1
Recognition in the Boolean Space
5.2
Appreciating the Asymmetry in Implicative Regularities
5.3
Deductive Inference in Finite Predicates
5.4
Pattern Recognition in the Space of Multi-Valued Attributes
6. Some Applications
7. Conclusions
References
Author's Biographical Statement

1
2
2
4
6
6
9
10
12
12

18
20
21
23
23
27
28
28
31
34
36
38
40
41
43

Chapter 2
THE ONE CLAUSE AT A TIME (OCAT)
APPROACH TO DATA MINING AND
KNOWLEDGE DISCOVERY, by E. Triantaphyllou

45

viii

Data Mining & Knowledge Discovery Based on Rule Induction

1.
2.

3.
4.
4.1
4.2
4.3
4.4
4.5
5.
6.
6.1
6.2
6.3

46
49
52
54
54
58
59
62
65
70
72
72
74

Introduction
Some Background Information
Definitions and Terminology

The One Clause at a Time (OCAT) Approach
Data Binarization
The One Clause at a Time (OCAT) Concept
A Branch-and-Bound Approach for Inferring Clauses
Inference of the Clauses for the Illustrative Example
A Polynomial Time Heuristic for Inferring Clauses
A Guided Learning Approach
The Rejectability Graph of Two Collections of Examples
The Definition of the Rej ectability Graph
Properties of the Rejectability Graph
On the Minimum Clique Cover
of the Rej ectability Graph
7. Problem Decomposition
7.1
Connected Components
7.2
Clique Cover
8. An Example of Using the Rejectability Graph
9. Conclusions
References
Author's Biographical Statement

76
77
77
78
79
82
83
87

Chapter 3
AN INCREMENTAL LEARNING ALGORITHM FOR
INFERRING LOGICAL RULES FROM EXAMPLES IN
THE FRAMEWORK OF THE COMMON REASONING
PROCESS, by X. Naidenova
1. Introduction
2. A Model of Rule-Based Logical Inference
2.1
Rules Acquired from Experts or Rules of the First Type
2.2
Structure of the Knowledge Base
2.3
Reasoning Operations for Using
Logical Rules of the First Type
2.4
An Example of the Reasoning Process
3. Inductive Inference of Implicative Rules From Examples
3.1
The Concept of a Good Classification Test
3.2
The Characterization of Classification Tests
3.3
An Approach for Constructing Good Irredundant Tests
3.4
Structure of Data for Inferring Good Diagnostic Tests
3.5
The Duality of Good Diagnostic Tests
3.6
Generation of Dual Objects with the Use

89
90
96
97
98
100
102
103
103
105
106
107
109

Table of Contents

of Lattice Operations
Inductive Rules for Constructing Elements of a Dual Lattice
Special Reasoning Operations for Constructing Elements
of a Dual Lattice
3.8.1
The Generalization Rule
3.8.2
The Diagnostic Rule
3.8.3
The Concept of an Essential Example
4. Algorithms for Constructing All
Good Maximally Redundant Tests

4.1
NIAGaRa: A Non-Incremental Algorithm for Constructing
All Good Maximally Redundant Tests
4.2
Decomposition of Inferring Good Classification
Tests into Subtasks
4.2.1
Forming the Subtasks
4.2.2
Reducing the Subtasks
4.2.3
Choosing Examples and Values for the Formation
of Subtasks
4.2.4
An Approach for Incremental Algorithms
4.3
DIAGaRa: An Algorithm for Inferring All GMRTs with
the Decomposition into Subtasks of the First Kind
4.3.1
The Basic Recursive Algorithm for Solving a Subtask
Ofthe First Kind
4.3.2
An Approach for Forming the Set STGOOD
4.3.3
The Estimation of the Number of Subtasks to Be Solved
4.3.4
CASCADE: Incrementally Inferring GMRTs
Based on the Procedure DIAGaRa
4.4
INGOMAR: An Incremental Algorithm for

Inferring All GMRTs
5. Conclusions
Acknowledgments
Appendix
References
Author's Biographical Statement
3.7
3.8

ix

110
111
112
112
113
114
115
115
122
123
125
127
129
130
130
131
131
132
132

138
138
139
143
147

Chapter 4
DISCOVERING RULES THAT GOVERN MONOTONE
PHENOMENA, by V.I. Torvik and E. Triantaphyllou
1. Introduction
2. Background Information
2.1
Problem Descriptions
2.2
Hierarchical Decomposition of Variables
2.3
Some Key Properties of Monotone Boolean Functions

149
150
152
152
155
157

X

Data Mining & Knowledge Discovery Based on Rule Induction

2.4
Existing Approaches to Problem 1
2.5
An Existing Approach to Problem 2
2.6
Existing Approaches to Problem 3
2.7
Stochastic Models for Problem 3
3. Inference Objectives and Methodology
3.1
The Inference Objective for Problem 1
3.2
The Inference Objective for Problem 2
3.3
The Inference Objective for Problem 3
3.4
Incremental Updates for the Fixed
Misclassification Probability Model
3.5
Selection Criteria for Problem 1
3.6
Selection Criteria for Problems 2.1,2.2, and 2.3
3.7
Selection Criterion for Problem 3
4. Experimental Results
4.1
Experimental Results for Problem 1
4.2
Experimental Results for Problem 2
4.3

Experimental Results for Problem 3
5. Summary and Discussion
5.1
Summary of the Research Findings
5.2
Significance of the Research Findings
5.3
Future Research Directions
6. Concluding Remarks
References
Authors' Biographical Statements

160
162
162
162
165
165
166
166
167
167
168
169
174
174
176
179
183
183

186
187
187
188
191

Chapter 5
LEARNING LOGIC FORMULAS AND RELATED ERROR
DISTRIBUTIONS, by G. Felici, F. Sun, and K. Truemper
1. Introduction
2. Logic Data and Separating Set
2.1
Logic Data
2.2
Separating Set
3. Problem Formulation
3.1
Logic Variables
3.2
Separation Condition for Records in ^
3.3
Separation Condition for Records in 5
3.4
Selecting a Largest Subset
3.5
Selecting a Separating Vector
3.6
Simplification for 0/1 Records
4. Implementation of Solution Algorithm
5. Leibniz System

6. Simple-Minded Control of Classification Errors

193
194
197
197
198
200
201
201
201
202
203
204
204
205
206

Table of Contents

7. Separations for Voting Process
8. Probability Distribution of Vote-Total
8.1
Mean and Variance for Z ^
8.2
Random Variables Yi
8.3
Distribution for 7
8.4

Distribution for Z ^
8.5
Probabilities of Classification Errors
8.6
Summary of Algorithm
9. Computational Results
9.1
Breast Cancer Diagnosis
9.2
Australian Credit Card
9.3
Congressional Voting
9.4
Diabetes Diagnosis
9.5
Heart Disease Diagnosis
9.6
Boston Housing
10. Conclusions
References
Authors' Biographical Statements

xi

207
208
209
211
212
213

213
216
216
218
219
219
219
220
221
221
222
226

Chapter 6
FEATURE SELECTION FOR DATA MINING
by V. de Angelis, G. Felici, and G. Mancinelli
1. Introduction
2. The Many Routes to Feature Selection
2.1
Filter Methods
2.2
Wrapper Methods
3. Feature Selection as a Subgraph Selection Problem
4. Basic IP Formulation and Variants
5. Computational Experience
5.1
Test on Generated Data
5.2
An Application
6. Conclusions

References
Authors' Biographical Statements

227
228
229
232
234
237
238
241
242
246
248
249
252

Chapter 7
TRANSFORMATION OF RATIONAL AND SET DATA
TO LOGIC DATA, by S. Bartnikowski, M. Cranberry,
J. Mugan, and K. Truemper
1. Introduction
1.1
Transformation of Set Data
1.2
Transformation of Rational Data

253
254
254

254

xii

Data Mining & Knowledge Discovery Based on Rule Induction

1.3
Computational Results
1.4
Entropy-Based Approaches
1.5
Bottom-up Methods
1.6
Other Approaches
2. Definitions
2.1
Unknown Values
2.2
Records
2.3
Populations
2.4
DNF Formulas
2.5
Clash Condition
3. Overview of Transformation Process
4. Set Data to Logic Data
4.1
Case of Element Entries

4.2
Case of Set Entries
5. Rational Data to Logic Data
6. Initial Markers
6.1
Class Values
6.2
Smoothed Class Values
6.3
Selection of Standard Deviation
6.4
Definition of Markers
6.5
Evaluation of Markers
7. Additional Markers
7.1
Critical Interval
7.2
Attractiveness of Pattern Change
7.3
Selection of Marker
8. Computational Results
9. Summary
References
Authors' Biographical Statements

256
257
258
258

259
259
260
260
260
261
262
262
262
264
264
265
265
266
266
269
271
271
272
272
273
274
275
276
278

Chapter 8
DATA FARMING: CONCEPTS AND METHODS, by A. Kusiak
1. Introduction
2. Data Farming Methods

2.1
Feature Evaluation
2.2
Data Transformation
2.2.1
Filling in Missing Values
2.2.2
Discretization
2.2.3
Feature Content Modification
2.2.4
Feature Transformation
2.2.5
Data Evolution
2.3
Knowledge Transformation

279
280
281
282
282
282
283
283
286
289
290

Table of Contents

xiii

2.4
Outcome Definition
2.5
Feature Definition
3. The Data Farming Process
4. A Case Study
5. Conclusions
References
Author's Biographical Statement

295
297
298
299
301
302
304

Chapter 9
RULE INDUCTION THROUGH DISCRETE SUPPORT
VECTOR DECISION TREES, by C. Orsenigo and C. Vercellis
1. Introduction
2. Linear Support Vector Machines
3. Discrete Support Vector Machines with Minimum Features
4. A Sequential LP-based Heuristic for
Problems LDVM and FDVM

5. Building a Minimum Features Discrete Support
Vector Decision Tree
6. Discussion and Validation of the Proposed Classifier
7. Conclusions
References
Authors' Biographical Statements

305
306
308
312
314
316
319
322
324
326

Chapter 10
MULTI-ATTRIBUTE DECISION TREES AND
DECISION RULES, by J.-Y. Lee and S. Olafsson
1. Introduction
2. Decision Tree Induction
2.1
Attribute Evaluation Rules
2.2
Entropy-Based Algorithms
2.3
Other Issues in Decision Tree Induction
3. Multi-Attribute Decision Trees

3.1
Accounting for Interactions between Attributes
3.2
Second Order Decision Tree Induction
3.3
The SODI Algorithm
4. An Illustrative Example
5. Numerical Analysis
6. Conclusions
Appendix: Detailed Model Comparison
References
Authors' Biographical Statements

327
328
329
330
332
333
334
334
335
339
334
347
349
351
355
358

xiv

Data Mining & Knowledge Discovery Based on Rule Induction

Chapter 11
KNOWLEDGE ACQUISITION AND UNCERTAINTY IN
FAULT DIAGNOSIS: A ROUGH SETS PERSPECTIVE,
by L.-Y. Zhai, L.-P. Khoo, and S.-C. Fok

1. Introduction
2. An Overview of Knowledge Discovery and Uncertainty
2.1
Knowledge Acquisition and Machine Learning
2.1.1
Knowledge Representation
2.1.2
Knowledge Acquisition
2.1.3
Machine Learning and Automated
Knowledge Extraction
2.1.4
Inductive Learning Techniques for Automated
Knowledge Extraction
2.2
Uncertainties in Fault Diagnosis
2.2.1
Inconsistent Data
2.2.2
Incomplete Data

2.2.3
Noisy Data
2.3
Traditional Techniques for Handling Uncertainty
2.3.1
MYCIN'S Model of Certainty Factors
2.3.2
Bayesian Probability Theory
2.3.3
The Dempster-Shafer Theory of Belief Functions
2.3.4
The Fuzzy Sets Theory
2.3.5
Comparison of Traditional Approaches for
Handling Uncertainty
2.4
The Rough Sets Approach
2.4.1
Introductory Remarks
2.4.2
Rough Sets and Fuzzy Sets
2.4.3
Development of Rough Set Theory
2.4.4
Strengths of Rough Sets Theory and Its
Applications in Fault Diagnosis
3. Rough Sets Theory in Classification and
Rule Induction under Uncertainty
3.1
Basic Notions of Rough Sets Theory

3.1.1
The Information System
3.1.2
Approximations
3.2
Rough Sets and Inductive Learning
3.2.1
Inductive Learning, Rough Sets and the RClass
3.2.2
Framework of the RClass
3.3
Validation and Discussion
3.3.1
Example 1: Machine Condition Monitoring
3.3.2
Example 2: A Chemical Process
4. Conclusions

359

360
361
3 61
3 61
362
362
364
366
366
367

368
369
369
370
371
372
3 73
374
374
375
376
376
378
378
378
3 79
3 81
3 81
3 82
3 84
385
386
388

Table of Contents

References
Authors' Biographical Statements

XV

389
394

Chapter 12
DISCOVERING KNOWLEDGE NUGGETS WITH A GENETIC
ALGORITHM, by E. Noda and A.A. Freitas
1. Introduction
2. The Motivation for Genetic
Algorithm-Based Rule Discovery
2.1
An Overview of Genetic Algorithms (GAs)
2.2
Greedy Rule Induction
2.3
The Global Search of Genetic Algorithms (GAs)
3. GA-Nuggets
3.1
Single-Population GA-Nuggets
3.1.1
Individual Representation
3.1.2
Fitness Function
3.1.3
Selection Method and Genetic Operators
3.2
Distributed-Population GA-Nuggets
3.2.1
Individual Representation

3.2.2
Distributed Population
3.2.3
Fitness Function
3.2.4
Selection Method and Genetic Operators
4. A Greedy Rule Induction Algorithm
for Dependence Modeling
5. Computational Results
5.1
The Data Sets Used in the Experiments
5.2
Results and Discussion
5.2.1
Predictive Accuracy
5.2.2
Degree of Interestingness
5.2.3
Summary of the Results
6. Conclusions
References
Authors' Biographical Statements

395
396
399
400
402
404
404

404
405
406
410
411
411
412
414
415
415
416
416
417
419
422
426
428
429
432

Chapter 13
DIVERSITY MECHANISMS IN PITT-STYLE
EVOLUTIONARY CLASSIFIER SYSTEMS, by M. Kirley,
H.A. Abbass and R.I. McKay
1. Introduction
2. Background - Genetic Algorithms
3.
Evolutionary Classifier Systems
3.1
The Michigan Style Classifier System

433
434
436
439
43 9

xvi

Data Mining & Knowledge Discovery Based on Rule Induction

3.2
The Pittsburgh Style Classifier System
4. Diversity Mechanisms in Evolutionary Algorithms
4.1
Niching
4.2
Fitness Sharing
4.3
Crowding
4.4 Isolated Populations
5. Classifier Diversity
6. Experiments
6.1
Architecture of the Model
6.2
Data Sets
6.3
Treatments

6.4
Model Parameters
7. Results
8. Conclusions
References
Authors' Biographical Statements

440
440
441
441
443
444
446
448
448
449
449
449
450
452
454
457

Chapter 14
FUZZY LOGIC IN DISCOVERING ASSOCIATION
RULES: AN OVERVIEW, by G. Chen, Q, Wei and E.E. Kerre
1. Introduction
1.1
Notions of Associations

1.2
Fuzziness in Association Mining
1.3
Main Streams of Discovering Associations with Fuzzy Logic
2. Fuzzy Logic in Quantitative Association Rules
2.1
Boolean Association Rules
2.2
Quantitative Association Rules
2.3
Fuzzy Extensions of Quantitative Association Rules
3.
Fuzzy Association Rules with Fuzzy Taxonomies
3.1
Generalized Association Rules
3.2
Generalized Association Rules with Fuzzy Taxonomies
3.3
Fuzzy Association Rules with Linguistic Hedges
4. Other Fuzzy Extensions and Considerations
4.1
Fuzzy Logic in Interestingness Measures
4.2
Fuzzy Extensions of Dsupport / Dconfidence
4.3
Weighted Fuzzy Association Rules
5. Fuzzy Implication Based Association Rules
6. Mining Functional Dependencies with Uncertainties
6.1
Mining Fuzzy Functional Dependencies

6.2
Mining Functional Dependencies with Degrees
7. Fuzzy Logic in Pattern Associations
8. Conclusions

459
460
460
462
464
465
465
466
468
469
470
471
473
474
474
476
478
480
482
482
483
484
486

Table of Contents

xvii

References
Authors' Biographical Statements

487
493

Chapter 15
MINING HUMAN INTERPRETABLE KNOWLEDGE WITH
FUZZY MODELING METHODS: AN OVERVIEW,
by T.W. Liao
1. Background
2. Basic Concepts
3. Generation of Fuzzy If-Then Rules
Grid Partitioning
3.1
Fuzzy Clustering
3.2
Genetic Algorithms
3.3
Sequential Pittsburgh Approach
3.3.1
Sequential IRL+Pittsburgh Approach
3.3.2
Simultaneous Pittsburgh Approach
3.3.3
Neural Networks

3.4
3.4.1
Fuzzy Neural Networks
3.4.2
Neural Fuzzy Systems
Starting Empty
3.4.2.1
Starting Full
3.4.2.2
Starting with an Initial Rule Base
3.4.2.3
Hybrids
3.5
Others
3.6
From Exemplar Numeric Data
3.6.1
3.6.2
From Exemplar Fuzzy Data
4. Generation of Fuzzy Decision Trees
Fuzzy Interpretation of Crisp Trees
4.1
with Discretized Intervals
Fuzzy ID3 Variants
4.2.
From Fuzzy Vector-Valued Examples
4.2.1
From Nominal-Valued and Real-Valued Examples
4.2.2
5. Applications

Function Approximation Problems
5.1
Classification Problems
5.2
Control Problems
5.3
Time Series Prediction Problems
5.4
Other Decision-Making Problems
5.5
6. Discussion
7. Conclusions
Referen ices
Appendlix 1: A Summary of Grid Partitioning Methods

495
496
498
500
501
506
509
510
511
513
517
518
519
519
520

524
526
526
527
527
527
528
529
529
530
532
532
532
533
534
534
534
537
538

xviii

Data Mining & Knowledge Discovery Based on Rule Induction

for Fuzzy Modeling
Appendix 2: A Summary of Fuzzy Clustering Methods
for Fuzzy Modeling
Appendix 3: A Summary of GA Methods for Fuzzy Modeling
Appendix 4: A Summary of Neural Network Methods for

Fuzzy Modeling
Appendix 5: A Summary of Fuzzy Decision Tree Methods for
Fuzzy Modeling
Author's Biographical Statement

545
546
547
548
549
550

Chapter 16
DATA MINING FROM MULTIMEDIA PATIENT RECORDS,
by A.S. Elmaghraby, M.M. Kantardzic, and M.P. Wachowiak
1. Introduction
2. The Data Mining Process
3. Clinical Patient Records: A Data Mining Source
3.1
Distributed Data Sources
3.2
Patient Record Standards
4. Data Preprocessing
5. Data Transformation
5.1
Types of Transformation
5.2
An Independent Component Analysis:
Example of an EMG/ECG Separation
5.3

Text Transformation and Representation:
A Rule-Based Approach
5.4
Image Transformation and Representation:
A Rule-Based Approach
6. Dimensionality Reduction
6.1
The Importance of Reduction
6.2
Data Fusion
6.3
Example 1: Multimodality Data Fusion
6.4
Example 2: Data Fusion in Data Preprocessing
6.5
Feature Selection Supported By Domain Experts
7. Conclusions
References
Authors' Biographical Statements

551
552
554
556
560
560
563
567
567
5 71

573
575
579
5 79
581
584
584
588
589
591
595

Chapter 17
LEARNING TO FIND CONTEXT BASED SPELLING
ERRORS, by H. Al-Mubaid and K. Truemper
1. Introduction
2. Previous Work

597
598
600

Table of Contents

3. Details of Ltest
3.1
Learning Step
3.2
Testing Step

3.2.1
Testing Regular Cases
3.2.2
Testing Special Cases
3.2.3
An Example
4. Implementation and Computational Results
5. Extensions
6. Summary
References
Appendix A: Construction of Substitutions
Appendix B: Construction of Training and History Texts
Appendix C: Structure of Characteristic Vectors
Appendix D: Classification of Characteristic Vectors
Authors' Biographical Statements

xix

601
602
605
605
606
607
607
614
616
616
619
620

621
624
627

Chapter 18
INDUCTION AND INFERENCE WITH FUZZY RULES
FOR TEXTUAL INFORMATION RETRIEVAL, by J. Chen,
D.H. Kraft, M.J. Martin-Bautista, and M. -A., Vila
1. Introduction
2. Preliminaries
2.1
The Vector Space Approach To Information Retrieval
2.2
Fuzzy Set Theory Basics
2.3
Fuzzy Hierarchical Clustering
2.4
Fuzzy Clustering by the Fuzzy C-means Algorithm
3. Fuzzy Clustering, Fuzzy Rule Discovery
and Fuzzy Inference for Textual Retrieval
3.1
The Air Force EDC Data Set
3.2
Clustering Results
3.3
Fuzzy Rule Extraction from Fuzzy Clusters
3.4
Application of Fuzzy Inference for
Improving Retrieval Performance
4. Fuzzy Clustering, Fuzzy Rules and User Profiles

for Web Retrieval
4.1
Simple User Profile Construction
4.2
Application of Simple User Profiles
in Web Information Retrieval
Retrieving Interesting Web Documents
4.2.2
User Profiles for Query Expansion by Fuzzy Inference
4.3
Experiments of Using User Profiles
4.4
Extended Profiles and Fuzzy Clustering

629
630
632
632
634
634
634
635
63 6
637
63 8
639
640
641
642
642

643
644
646

XX

Data Mining & Knowledge Discovery Based on Rule Induction

5.
Conclusions
Acknowledgements
References
Authors' Biographical Statements

646
647
648
652

Chapter 19
STATISTICAL RULE INDUCTION IN THE PRESENCE OF
PRIOR INFORMATION: THE BAYESIAN RECORD
LINKAGE PROBLEM, by D.H. Judson
1. Introduction
2. Why is Record Linkage Challenging?
3. The Fellegi-Sunter Model of Record Linkage
4. How Estimating Match Weights and Setting Thresholds
is Equivalent to Specifying a Decision Rule
5. Dealing with Stochastic Data:

A Logistic Regression Approach
5.1
Estimation of the Model
5.2
Finding the Implied Threshold and
Interpreting Coefficients
6. Dealing with Unlabeled Data in the
Logistic Regression Approach
7. Brief Description of the Simulated Data
8. Brief Description of the CPS/NHIS
to Census Record Linkage Project
9. Results of the Bayesian Latent Class Method
with Simulated Data
9.1
Case 1: Uninformative
9.2
Case 2: Informative
9.3
False Link and Non-Link Rates in the
Population of All Possible Pairs
10. Results from the Bayesian Latent Class Method with
Real Data
10.1
Steps in Preparing the Data
10.2
Priors and Constraints
10.3
Results
11. Conclusions and Future Research
References

Author's Biographical Statement

655
656
657
658
660
661
665
665
668
669
670
672
673
677
678
679
679
681
682
690
691
694

Chapter 20
FUTURE TRENDS IN SOME DATA MINING AREAS,
by X. Wang, P. Zhu, G. Felici, and E. Triantaphyllou

695

Table of Contents

xxi

1. Introduction
2. Web Mining
2.1
Web Content Mining
2.2
Web Usage Mining
2.3
Web Structure Mining
2.4
Current Obstacles and Future Trends
3. Text Mining
3.1
Text Mining and Information Access
3.2
A Simple Framework of Text Mining
3.3
Fields of Text Mining
3.4
Current Obstacles and Future Trends
4. Visual Data Mining
4.1
Data Visualization
4.2
Visualizing Data Mining Models

4.3
Current Obstacles and Future Trends
5. Distributed Data Mining
5.1
The Basic Principle of DDM
5.2
Grid Computing
5.3
Current Obstacles and Future Trends
6. Summary
References
Authors' Biographical Statements

696
696
697
698
698
699
700
700
701
701
702
703
704
705
705
706
707

707
708
708
710
715

Subject Index
Author Index
Contributor Index
About the Editors

717
727
739
747

LIST OF FIGURES
Chapter 1
A COMMON LOGIC APPROACH TO DATA MINING
AND PATTERN RECOGNITION, by A. Zakrevskij
Figure 1.
Using a Karnaugh Map to Find a Decision Boolean
Function
Figure 2.
Illustrating the Screening Effect
Figure 3.
A Search Tree
Figure 4.
The Energy Distribution of the Pronunciation of the

Russian Word "nooF (meaning "zero'')

1
3
19
25
39

Chapter 2
THE ONE CLAUSE AT A TIME (OCAT)
APPROACH TO DATA MINING AND
KNOWLEDGE DISCOVERY, by E. Triantaphyllou
Figure 1.
The One Clause At a Time Approach (for the CNF case)
Figure 2.
Continuous Data for Illustrative Example
and Extracted Sets of Classification Rules
Figure 3.
The RAl Heuristic [Deshpande and Triantaphyllou, 1998]
Figure 4.
The Rejectability Graph for E^ and E~
Figure 5.
The Rejectability Graph for the Second Illustrative
Example
Figure 6.
The Rejectability Graph for the new Sets E^ and E~

45
59
63

67
74
75
80

Chapter 3
AN INCREMENTAL LEARNING ALGORITHM FOR
INFERRING LOGICAL RULES FROM EXAMPLES IN
THE FRAMEWORK OF THE COMMON REASONING
PROCESS, by X. Naidenova
Figure 1.
Model of Reasoning, a) Under Pattern Recognition
b) Under Learning
Figure 2.
The Beginning of the Procedure for Inferring GMRTs
Figure 3.
The Procedure for Determining the Set of Indices for
Extending 5
Figure 4.
The Procedure for Generating All Possible
Extensions of 5
Figure 5.
The Procedure for Analyzing the Set of Extensions of ^
Figure 6.
The Main Procedure NIAGaRa for inferring GMRTs
Figure 7.
The Algorithm DIAGaRa
Figure 8.
The Procedure for Generalizing the Existing GMRTs

89
93
116
117
118
119
120
13 0
133

xxiv

Figure 9.
Figure 10.

Data Mining & Knowledge Discovery Based on Rule Induction

The Procedure for Preparing the Data for Inferring the
GMRTs Contained in a New Example
The Incremental Procedure INGOMAR

134
135

Chapter 4
DISCOVERING RULES THAT GOVERN MONOTONE
PHENOMENA, by V.I. Torvik and E. Triantaphyllou
Figure 1.
Hierarchical Decomposition of the Breast Cancer

Diagnosis Variables
Figure 2.
The Poset Formed by {0,1 j"^ and the Relation <
Figure 3.
The Average Query Complexities for Problem 1
Figure 4.
The Average Query Complexities for Problem 2
Figure 5.
Increase in Query Complexities Due to Restricted
Access to the Oracles
Figure 6.
Reduction in Query Complexity Due to the
Nestedness Assumption
Figure 7.
Average Case Behavior of Various Selection
Criteria for Problem 3
Figure 8.
The Restricted and Regular Maximum Likelihood
Ratios Simulated with Expected q = 0.2 and n = 3

149
156
157
175
177
178
178
181
183

Chapter 5
LEARNING LOGIC FORMULAS AND RELATED
ERROR DISTRIBUTIONS, by G. Felici, F. Sun, and K. Truemper
Figure 1.
Distributions for Z=Zji and Z = ZB and related Z
Figure 2.
Estimated and verified FA and GB for Breast Cancer
Figure 3.
Estimated and verified FA and GB for Australian
Credit Card
Figure 4.
Estimated and verified FA and GB for Congressional
Voting
Figure 5.
Estimated and verified FA and GB for Diabetes
Figure 6.
Estimated and verified FA and GB for Heart Disease
Figure 7.
Estimated and verified FA and GB for Boston Housing

193
217
218
219
220
220
221
221

Chapter 6

FEATURE SELECTION FOR DATA MINING
by V. de Angelis, G. Felici, and G. Mancinelli
Figure L
Wrappers and Filters

Chapter 7
TRANSFORMATION OF RATIONAL AND SET DATA TO
LOGIC DATA, by S. Bartnikowski, M. Granberry, J. Mugan,

227
231

List of Figures

and K. Truemper

XXV

253

Chapter 8
DATA FARMING: CONCEPTS AND METHODS, by A. Kusiak
Figure 1.
A Data Set with Five Features
Figure 2.
Rule Set Obtained from the Data Set in Figure 1
Figure 3.
Modified Data Set with Five Features
Figure 4.

Two Rules Generated from the Data Set of Figure 3
Figures.
(Parti). Cross-validation Results: (a) Confusion Matrix
for the Data Set in Figure 1, (b) Confusion Matrix for
the Modified Data Set of Figure 3
Figure 5.
(Part 2). Cross-validation Results; (c) Classification
Accuracy for the Data Set of Figure 1,
(d) Classification Accuracy for the Data Set in Figure 3
Figure 6.
A Data Set with Four Features
Figure 7.
Transformed Data Set of Figure 6
Figure 8.
Cross Validation Results: (a) Average Classification
Accuracy for the Data Set in Figure 6,
(b) Average Classification Accuracy for the
Transformed Data Set of Figure 7
Figure 9.
Data Set and the Corresponding Statistical
Distributions
Figure 10.
Rule-Feature Matrix with Eight Rules
Figure 11.
Structured Rule-Feature Matrix
Figure 12.
Visual Representation of a Cluster of Two Rules
Figure 13.
A Data Set with Five Features
Figure 14.

Rules from the Data Set of Figure 13
Figure 15.
Rules Extracted from the Transformed Data Set
of Figure 13
Figure 16.
Cross-validation Results: (a) Average Classification
Accuracy for the Modified Data Set in Figure 13,
(b) Average Classification Accuracy of the Data Set
with Modified Outcome
Figure 17.
Average Classification Accuracy for the 599-Object
Data Set
Figure 18.
Average Classification Accuracy for the 525-Object
Data Set
Figure 19.
Average Classification Accuracy for the 525-Object
Data Set with the Feature Sequence

Chapter 9
RULE INDUCTION THROUGH DISCRETE SUPPORT

279
284
284
285
285

286

287
288
288

289
290
291
291
294
296
296
296

297
300
301
301

xxvi

Data Mining & Knowledge Discovery Based on Rule Induction

VECTOR DECISION TREES, by C. Orsenigo and C. Vercellis
Figure 1.
Margin Maximization for Linearly non Separable Sets
Figure 2.
Axis Parallel Versus Oblique Splits

305

310
316

Chapter 10
MULTI-ATTRIBUTE DECISION TREES AND
DECISION RULES, by J.-Y. Lee and S. Olafsson
Figure 1.
The SODI Decision Tree Construction Algorithm
Figure 2.
The SODI Rules for Pre-Pruning
Figure 3.
Decision Trees Built by (a) ID3, and (b) SODI
Figure 4.
Improvement of Accuracy Over ID3 for
SODI, C4.5, and PART
Figure 5.
Reduction in the Number of Decision Rules
over ID3 for SODI, C4.5, and PART

327
341
343
345
348
349

Chapter 11
KNOWLEDGE ACQUISITION AND UNCERTAINTY IN
FAULT DIAGNOSIS: A ROUGH SETS PERSPECTIVE,
by L.-Y. Zhai, L.-P. Khoo, and S.-C. Fok

Figure 1.
Knowledge Acquisition Techniques
Figure 2.
Machine Learning Taxonomy
Figure 3.
Processes for Knowledge Extraction
Figure 4.
Basic Notions of Rough Set Theory for
Illustrative Example
Figure 5.
Framework of the RClass System

359
362
363
364
3 81
383

Chapter 12
DISCOVERING KNOWLEDGE NUGGETS WITH A GENETIC
ALGORITHM, by E. Noda and A.A. Freitas
Figure 1.
Pseudocode for a Genetic Algorithm at a High
Level of Abstraction
Figure 2.
An Example of Uniform Crossover in
Genetic Algorithms
Figure 3.
The Basic Idea of a Greedy Rule Induction Procedure

Figure 4.
Attribute Interaction in a XOR (eXclusive OR)
Function
Figure 5.
Individual Representation
Figure 6.
Examples of Condition Insertion/Removal Operations

Chapter 13
DIVERSITY MECHANISMS IN PITTSTYLE
EVOLUTIONARY CLASSIFIER SYSTEMS, by M. Kirley,

395
400
402
402
403
406
411

List of Figures

H.A. Abbass and R.I. McKay
Outline of a Simple Genetic Algorithm
Figure 1.
Figure 2.
The Island Model.

xxvii

433
437
446

Chapter 14
FUZZY LOGIC IN DISCOVERING ASSOCIATION
RULES: AN OVERVIEW, by G. Chen, Q. Wei and E.E. Kerre
Figure L
Fuzzy Sets Young(Y), Middle(M) and Old(O) with
Y(20, 65), M(25, 32, 53, 60), 0(20, 65)
Figure 2.
Exact Taxonomies and Fuzzy Taxonomies
Figure 3.
Part of a Linguistically Modified Fuzzy Taxonomic
Structure
Figure 4.
Static Matching Schemes

459
468
470
473
485

Chapter 15
MINING HUMAN INTERPRETABLE KNOWLEDGE WITH
FUZZY MODELING METHODS: AN OVERVIEW,
by T.W. Liao

495

Chapter 16
DATA MINING FROM MULTIMEDIA PATIENT RECORDS,
by A.S. Elmaghraby, M.M. Kantardzic, and M.P. Wachowiak
Figure 1.
Phases of the Data Mining Process
Figure 2.
Multimedia Components of the Patient Record
Figure 3.
Phases in Labels and Noise Elimination for Digitized
Mammography Images
Figure 4.
The Difference Between PC A and ICA Transforms
Figure 5.
Three EMG/ECG Mixtures (left) Separated into EMG
and ECG Signals by ICA (right). Cardiac Artifacts in
the EMG are Circled in Gray (upper left)
Figure 6.
Sample of an Image of Size 5 x 5
Figure 7.
Feature Extraction for the Image in Figure 6 by Using
the Association Rules Method
Figure 8.
Shoulder Scan
Figure 9.
Parameter Maps: (a) INV (Nakagami Distribution);
(b) TP (Nakagami Distribution); (c) SNR Values
{K Distribution); (d) Fractional SNR {K Z)istribution)

551
555
558
567
570

572
577
578
585

587

Chapter 17
LEARNING TO FIND CONTEXT-BASED SPELLING
ERRORS, by H. Al-Mubaid and K. Truemper

597

Springer data mining and knowledge discovery approaches based on rule induction techniques (2006) ISBN 038734294x

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về