Data Mining and Knowledge Discovery Handbook, 2 Edition part 83 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (363.05 KB, 10 trang )

800 Haixun Wang, Philip S. Yu, and Jiawei Han
Table 40.6. Beneﬁts (US $) using Single Classiﬁers and Classiﬁer Ensembles (Original
Stream).
Chunk
G
0
G
1
=E
1
G
2
E
2
G
4
E
4
G
8
E
8
12000 201717 203211 197946 253473 211768 269290 215692 289129
6000
103763 98777 101176 121057 102447 138565 106576 143620
4000
69447 65024 68081 80996 69346 90815 70325 96153
3000
43312 41212 42917 59293 44977 67222 46139 71660
Cost-sensitive Learning
For cost-sensitive applications, we aim at maximizing beneﬁts. In Figure 40.7(a),

we compare the single classiﬁer approach with the ensemble approach using the
credit card transaction stream. The beneﬁts are averaged from multiple runs with
different chunk size (ranging from 3000 to 12000 transactions per chunk). Starting
from K = 2, the advantage of the ensemble approach becomes obvious.
In Figure 40.7(b), we average the beneﬁts of E
k
and G
k
(K = 2,···,8) for each
ﬁxed chunk size. The beneﬁts increase as the chunk size does, as more fraudulent
transactions are discovered in the chunk. Again, the ensemble approach outperforms
the single classiﬁer approach.
To study the impact of concept drifts of different magnitude, we derive data
streams from the credit card transactions. The simulated stream is obtained by sorting
the original 5 million transactions by their transaction amount. We perform the same
test on the simulated stream, and the results are shown in Figure 40.7(c) and 40.7(d).
Detailed results of the above tests are given in Table 40.6 and 40.5.
40.5 Discussion and Related Work
Data stream processing has recently become a very important research domain. Much
work has been done on modeling (Babcock et al., 2002), querying (Babu and Widom,
2001,Gao and Wang, 2002,Greenwald and Khanna, 2001), and mining data streams,
for instance, several papers have been published on classiﬁcation (Domingos and
Hulten, 2000, Hulten et al., 2001, Street and Kim, 2001), regression analysis (Chen
et al., 2002), and clustering (Guha et al., 2000).
Traditional Data Mining algorithms are challenged by two characteristic features
of data streams: the inﬁnite data ﬂow and the drifting concepts. As methods that
require multiple scans of the datasets (Shafer et al., 1996) can not handle inﬁnite data
ﬂows, several incremental algorithms (Gehrke et al., 1999, Domingos and Hulten,
2000) that reﬁne models by continuously incorporating new data from the stream
have been proposed. In order to handle drifting concepts, these methods are revised

again to achieve the goal that effects of old examples are eliminated at a certain rate.
In terms of an incremental decision tree classiﬁer, this means we have to discard,
re-grow sub trees, or build alternative subtrees under a node (Hulten et al., 2001).
The resulting algorithm is often complicated, which indicates substantial efforts are
40 Mining Concept-Drifting Data Streams 801
required to adapt state-of-the-art learning methods to the inﬁnite, concept-drifting
streaming environment. Aside from this undesirable aspect, incremental methods are
also hindered by their prediction accuracy. Since old examples are discarded at a
ﬁxed rate (no matter if they represent the changed concept or not), the learned model
is supported only by the current snapshot – a relatively small amount of data. This
usually results in larger prediction variances.
Classiﬁer ensembles are increasingly gaining acceptance in the data mining com-
munity. The popular approaches to creating ensembles include changing the in-
stances used for training through techniques such as Bagging (Bauer and Kohavi,
1999) and Boosting (Freund and Schapire, 1996). The classiﬁer ensembles have sev-
eral advantages over single model classiﬁers. First, classiﬁer ensembles offer a sig-
niﬁcant improvement in prediction accuracy (Freund and Schapire, 1996, Tumer and
Ghosh, 1996). Second, building a classiﬁer ensemble is more efﬁcient than building a
single model, since most model construction algorithms have super-linear complex-
ity. Third, the nature of classiﬁer ensembles lend themselves to scalable paralleliza-
tion (Hall et al., 2000) and on-line classiﬁcation of large databases. Previously, we
used averaging ensemble for scalable learning over very-large datasets (Fan, Wang
, Yu, and Stolfo, 2003). We show that a model’s performance can be estimated be-
fore it is completely learned (Fan, Wang , Yu, and Lo, 2002, Fan, Wang , Yu, and
Lo, 2003). In this work, we use weighted ensemble classiﬁers on concept-drifting
data streams. It combines multiple classiﬁers weighted by their expected prediction
accuracy on the current test data. Compared with incremental models trained by data
in the most recent window, our approach combines talents of set of experts based on
their credibility and adjusts much nicely to the underlying concept drifts. Also, we
introduced the dynamic classiﬁcation technique (Fan, Chu, Wang, and Yu, 2002) to

the concept-drifting streaming environment, and our results show that it enables us
to dynamically select a subset of classiﬁers in the ensemble for prediction without
loss in accuracy.
Ackowledgement
We thank Wei Fan of IBM T. J. Watson Research Center for providing us with a
revised version of the C4.5 decision tree classiﬁer and running some experiments.
References
Babcock B., Babu S. , Datar M. , Motawani R. , and Widom J., Models and issues in data
stream systems, In ACM Symposium on Principles of Database Systems (PODS), 2002.
Babu S. and Widom J., Continuous queries over data streams. SIGMOD Record, 30:109–
120, 2001.
Bauer, E. and Kohavi, R., An empirical comparison of voting classiﬁcation algorithms: Bag-
ging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999.
Chen Y., Dong G., Han J., Wah B. W., and Wang B. W., Multi-dimensional regression anal-
ysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong,
China, 2002.
802 Haixun Wang, Philip S. Yu, and Jiawei Han
Cohen W., Fast effective rule induction. In Int’l Conf. on Machine Learning (ICML), pages
115–123, 1995.
Domingos P., and Hulten G., Mining high-speed data streams. In Int’l Conf. on Knowledge
Discovery and Data Mining (SIGKDD), pages 71–80, Boston, MA, 2000. ACM Press.
Fan W., Wang H., Yu P., and Lo S. , Progressive modeling. In Int’l Conf. Data Mining
(ICDM), 2002.
Fan W., Wang H., Yu P., and Lo S. , Inductive learning in less than one sequential scan, In
Int’l Joint Conf. on Artiﬁcial Intelligence, 2003.
Fan W., Wang H., Yu P., and Stolfo S., A framework for scalable cost-sensitive learning based
on combining probabilities and beneﬁts. In SIAM Int’l Conf. on Data Mining (SDM),
2002.
Fan W., Chu F., Wang H., and Yu P. S., Pruning and dynamic scheduling of cost-sensitive
ensembles, In Proceedings of the 18th National Conference on Artiﬁcial Intelligence

(AAAI), 2002.
Freund Y., and Schapire R. E., Experiments with a new boosting algorithm, In Int’l Conf. on
Machine Learning (ICML), pages 148–156, 1996.
Gao L. and Wang X., Continually evaluating similarity-based pattern queries on a streaming
time series, In Int’l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June
2002.
Gehrke J., Ganti V., Ramakrishnan R., and Loh W., BOAT– optimistic decision tree con-
struction, In Int’l Conf. Management of Data (SIGMOD), 1999.
Greenwald M., and Khanna S., Space-efﬁcient online computation of quantile summaries,
In Int’l Conf. Management of Data (SIGMOD), pages 58–66, Santa Barbara, CA, May
2001.
Guha S., Milshra N., Motwani R., and O’Callaghan L., Clustering data streams, In IEEE
Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000.
Hall L., Bowyer K., Kegelmeyer W., Moore T., and Chao C., Distributed learning on very
large data sets, In Workshop on Distributed and Parallel Knowledge Discover, 2000.
Hulten G., Spencer L., and Domingos P., Mining time-changing data streams, In Int’l Conf.
on Knowledge Discovery and Data Mining (SIGKDD), pages 97–106, San Francisco,
CA, 2001. ACM Press.
Quinlan J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Shafer C., Agrawal R., and Mehta M., Sprint: A scalable parallel classiﬁer for Data Mining,
In Proc. of Very Large Database (VLDB), 1996.
Stolfo S., Fan W., Lee W., Prodromidis A., and Chan P., Credit card fraud detection using
meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and
Risk Management, 1997.
Street W. N. and Kim Y. S., A streaming ensemble algorithm (SEA) for large-scale classiﬁ-
cation. In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001.
Tumer K. and Ghosh J., Error correlation and error reduction in ensemble classiﬁers, Con-
nection Science, 8(3-4):385–403, 1996.

Utgoff, P. E., Incremental induction of decision trees, Machine Learning, 4:161–186, 1989.
Wang H., Fan W., Yu P. S., and Han J., Mining concept-drifting data streams using ensemble
classiﬁers, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2003.
41
Mining High-Dimensional Data
Wei Wang
1
and Jiong Yang
2
1
Department of Computer Science, University of North Carolina at Chapel Hill
2
Department of Electronic Engineering and Computer Science, Case Western Reserve
University
Summary. With the rapid growth of computational biology and e-commerce applications,
high-dimensional data becomes very common. Thus, mining high-dimensional data is an ur-
gent problem of great practical importance. However, there are some unique challenges for
mining data of high dimensions, including (1) the curse of dimensionality and more crucial
(2) the meaningfulness of the similarity measure in the high dimension space. In this chapter,
we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent
pattern mining, clustering, and classiﬁcation. We will discuss how these methods deal with
the challenges of high dimensionality.
Key words: High-dimensional Data Mining, frequent pattern, clustering high-
dimensional data, classifying high-dimensional data
41.1 Introduction
The emergence of various new application domains, such as bioinformatics and e-
commerce, underscores the need for analyzing high dimensional data. In a gene ex-
pression microarray data set, there could be tens or hundreds of dimensions, each of
which corresponds to an experimental condition. In a customer purchase behavior
data set, there may be up to hundreds of thousands of merchandizes, each of which

is mapped to a dimension. Researchers and practitioners are very eager in analyzing
these data sets.
Various Data Mining models have been proven to be very successful for analyz-
ing very large data sets. Among them, frequent patterns, clusters, and classiﬁers are
three widely studied models to represent, analyze, and summarize large data sets. In
this chapter, we focus on the state-of-art techniques for constructing these three Data
Mining models on massive high-dimensional data sets.
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_41, © Springer Science+Business Media, LLC 2010
804 Wei Wang and Jiong Yang
41.2 Chanllenges
Before presenting any algorithm for building individual Data Mining models, we
ﬁrst discuss two common challenges for analyzing high-dimensional data. The ﬁrst
one is the curse of dimensionality. The complexity of many existing Data Mining
algorithms is exponential with respect to the number of dimensions. With increas-
ing dimensionality, these algorithms soon become computationally intractable and
therefore inapplicable in many real applications.
Secondly, the speciﬁcity of similarities between points in a high dimensional
space diminishes. It was proven in (Beyer et al., 1999) that, for any point in a high
dimensional space, the expected gap between the Euclidean distance to the closest
neighbor and that to the farthest point shrinks as the dimensionality grows. This
phenomenon may render many Data Mining tasks (e.g., clustering) ineffective and
fragile because the model becomes vulnerable to the presence of noise. In the re-
mainder of this chapter, we present several state-of-art algorithms for mining high-
dimensional data sets.
41.3 Frequent Pattern
Frequent pattern is a useful model for extracting salient features of the data. It was
originally proposed for analyzing market basket data (Agrawal, 1994). A market bas-
ket data set is typically represented as a set of transactions. Each transaction contains
a set of items from a ﬁnite vocabulary. In principle, we can represent the data as a

matrix, each row represents a transaction and each column represents an item. The
goal is to ﬁnd the collection of itemsets appearing in a large number of transactions,
deﬁned by a support threshold t. Most algorithms for mining frequent patterns utilize
the Apriori property stated as follows. If an itemset A is frequent (i.e., present in more
than t transactions), then every subset of A must be frequent. On the other hand, if
an itemset A is infrequent (i.e, present in less than t transactions), then any superset
of A is also infrequent. This property is the basis of all level-wise search algorithms.
The general procedure consists of a series of iterations beginning with counting item
occurrences and identifying the set of frequent items (or equivalently, frequent 1-
itemsets). During each subsequent iteration, candidates for frequent k-itemsets are
proposed from frequent (k-1)-itemsets using the Apriori property. These candidates
are then validated by explicitly counting their actual occurrences. The value of k is
incremented before the next iteration starts. The process terminates when no more
frequent itemset can be generated. We often refer to this level-wise approach as the
breadth-ﬁrst approach because it evaluates the itemsets residing at the same depth
in the lattice formed by imposing the partial order of subset-superset relationship
between itemsets.
It is a well-known problem that the full set of frequent patterns contains sig-
niﬁcant redundant information and consequently the number of frequent patterns is
often too large. To address this issue, Pasquier et al. (1999) proposed to mine a se-
lective subset of frequent patterns, called closed frequent patterns. If the number of
41 Mining High-Dimensional Data 805
occurrences of a pattern is the same to all its immediate subpatterns, then the pat-
tern is considered as a closed pattern. The CLOSET algorithm (Pei et al., 2000) is
proposed to expedite the mining of closed frequent patterns. CLOSET uses a novel
frequent pattern tree (FP structure) as a compact representation to organize the data
set. It performs a depth-ﬁrst search, that is, after discovering a frequent itemset A,it
searches for superpatterns of A before checking A’s siblings.
A more recent algorithm for mining frequent closed pattern is CHARM (Zaki
and Hsiao, 2002). Similar to CLOSET, CHARM searches for patterns in a depth-

ﬁrst manner. The difference between CHARM and CLOSET is that CHARM stores
the data set in a vertical format where a list of row IDs is maintained for each dimen-
sion. These row ID lists are then merged during a “column enumeration” procedure
that generates row ID lists for other nodes in the enumeration tree. In addition, a
technique called diffset is used to reduce the length of the row ID lists as well as the
computational complexity of merging them.
All previous algorithms can ﬁnd frequent closed patterns when the dimensional-
ity is low to moderate. When the number of dimensions is very high, e.g., greater than
100, the efﬁciency of these algorithms could be signiﬁcantly impacted. CARPEN-
TER (Pan et al., 2003) is therefore proposed to solve this problem. It ﬁrst transposes
the matrix representing the data set. Next, CARPENTER performs a depth-ﬁrst row-
wise enumeration on the transposed matrix. It has been shown that this algorithm can
greatly reduce the computation time especially when the dimensionality is high.
41.4 Clustering
Clustering is a widely adopted Data Mining model that partitions data points into a
set of groups, each of which is called a cluster. A data point has a shorter distance
to points within the cluster than those outside the cluster. In a high dimensional
space, for any point, its distance to its closest point and that to the farthest point
tend to be similar. This phenomenon may render the clustering result sensitive to
any small perturbation to the data due to noise and make the exercise of clustering
useless. To solve this problem, Agrawal et. al. proposed a subspace clustering model
(Agrawal et al., 1998). A subspace cluster consists of a subset of objects and a subset
of dimensions such that the distance among these objects is small within the given
set of dimensions. The CLIQUE algorithm (Agrawal et al., 1998) is proposed to ﬁnd
the subspace clusters.
In many applications, users are more interested in the objects that exhibit a con-
sistent trend (rather than points having similar values) within a subset of dimensions.
One such example is the bicluster model (Cheng and Church, 2000) proposed for an-
alyzing gene expression proﬁles. A bicluster is a subset of objects (U) and a subset
dimensions (D) such that objects in U have the same trend (i.e., ﬂuctuating simulta-

neously) across dimensions in D. This is particular useful in analyzing gene expres-
sion levels in a microarray experiment since the expression levels of some genes may
be inﬂated/deﬂated systematically in some experiments. Thus, the absolute value is
not as important as the trend. If two genes have similar trends across a large set
806 Wei Wang and Jiong Yang
of experiments, they are likely to be co-regulated. In the bicluster model, the mean
squared error residue is used to qualify a bicluster. Cheng and Church (2000) used a
heuristic randomized algorithm to ﬁnd biclusters. It consists of a series of iterations,
each of which locates one bicluster. To prevent the same bicluster from being re-
ported again in subsequent iterations, each time when a bicluster is found, the values
in the bicluster are replaced by uniform noise before the next iteration starts. This
procedure continues until a desired number of biclusters are discovered.
Although the bicluster model and algorithm have been used in several appli-
cations in bioinformatics, it has two major drawbacks: (1) the mean squared error
residue may not be the best measure to qualify a bicluster. A big cluster may have
small mean squared error residue even if it includes a small number of objects whose
trends are vastly different in the selected dimensions; (2) the heuristic algorithm may
be interfered by the noise artiﬁcially injected after each iteration and hence may
not discover overlapped clusters properly. To solve these two problems, the authors
of (Wang et al., 2002) proposed the p-cluster model. A p-cluster consists of a subset
of objects U and a subset of dimensions D where for each pair of objects u
1
and u
2
in
U and each pair of dimension d
1
and d
2
in D, the change of u

1
from d
1
to d
2
should
be similar to that of u
2
from d
1
to d
2
. A threshold is used to evaluate the dissimilarity
between two objects on two dimensions. Given a subset of objects and a subset of
dimensions, if the dissimilarity between every pair of objects on every pair of dimen-
sions is less than the threshold, then these objects constitute a p-cluster in the given
dimensions. A novel deterministic algorithm is developed in (Wang et al., 2002)to
ﬁnd all maximal p-clusters, which utilizes the Apriori property held on p-clusters.
41.5 Classiﬁcation
The classiﬁcation is also a very powerful data analysis tool. In a classiﬁcation prob-
lem, the dimensions of an object can be divided into two types. One dimension
records the class type of the object and the rest dimensions are attributes. The goal
of classiﬁcation is to build a model that captures the intrinsic associations between
the class type and the attributes so that an (unknown) class type can be accurately
predicted from the attribute values. For this purpose, the data is usually divided into
a training set and a test set, where the training set is used to build the classiﬁer which
is validated by the test set. There are several models developed for classifying high
dimensional data, e.g., na
¨
ıve Bayesian, neural networks, decision trees (Mitchell,

1997), SVMs, rule-based classiﬁers, and so on.
Supporting vector machine (SVM) (Vapnik, 1998) is one of the newly devel-
oped classiﬁcation models. The success of SVM in practice is drawn by its solid
mathematical foundation that conveys the following two salient properties. (1) The
classiﬁcation boundary functions of SVMs maximize the margin, which equivalently
optimize the general performance given a training data set. (2) SVMs handle a non-
linear classiﬁcation efﬁciently using the kernel trick that implicitly transforms the
input space into another higher dimensional feature space. However, SVM suffers
from two problems. First, the complexity of training an SVM is at least O(N
2
) where
41 Mining High-Dimensional Data 807
N is the number of objects in the training data set. It could be too costly when the
training data set is large. Second, since an SVM essentially draws a hyper-plain in
a transformed high dimensional space, it is very difﬁcult to identify the principal
(original) dimensions that are most responsible for the classiﬁcation.
Rule-based classiﬁers (Liu et al., 2000) offer some potential to address the above
two problems. A rule-based classiﬁer consists of a set of rules in the following form:
A
1
[l
1
,u
1
]∩A
2
[l
2
,u
2

]∩ ∩A
m
[l
m
,u
m
]→C, where A
i
[l
i
, u
i
] is the range of attribute
A
i
’s value and C is the class type. The above rule can be interpreted as that, if an
object whose attributes’ values fall in the ranges in the left hand side, then its class
type is likely to be C (with some high probability). Each rule is also associated with
a conﬁdence level that depicts the probability that such a rule holds. When an ob-
ject satisﬁes several rules, either the rule with the highest conﬁdence (e.g., CBA (Liu
et al., 2000)) or a weighted voting of all valid rules (e.g., CPAR (Yin and Han, 2003))
may be used for class prediction. However, neither CBA nor CPAR are targeted for
high dimensional data. An algorithm called FARMER (Cong et al., 2004) is proposed
to generate rule-based classiﬁers for high dimensional data set. It ﬁrst quantizes the
attributes into a set of bins. Each bin is treated as an item subsequently. FARMER
then generates the closed frequent itemsets using a method similar to CARPEN-
TER. These closed frequent itemsets are the basis to generate rules. Since the di-
mensionality is high, the number of possible rules in the classiﬁer could be very
large. FARMER ﬁnally organizes all rules into compact rule groups.
References

Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: ”Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications”, Proc. ACM SIGMOD Int. Conf. on
Management of Data, Seattle, WA, 1998, pp. 94-105.
Agrawal R., and Srikant R., Fast Algorithms for Mining Association Rules in Large
Databases. In Proc. of the 20th VLDB Conf., pages 487-499, 1994.
Beyer K.S., Goldstein J., Ramakrishnan R. and Shaft U.: ”When Is ‘Nearest Neigh-
bor’ Meaningful?”, Proceedings 7th International Conference on Database Theory
(ICDT’99), pp. 217-235, Jerusalem, Israel, 1999.
Cheng Y., and Church, G., Biclustering of expression data. In Proceedings of the Eighth
International Conference on Intelligent Systems for Molecular Biology, pp. 93-103. San
Diego, CA, August 2000.
Cong G., Tung Anthony K. H., Xu X., Pan F., and Yang J., Farmer: Finding interesting rule
groups in microarray datasets. In the 23rd ACM SIGMOD International Conference on
Management of Data, 2004.
Liu B., Ma Y., Wong C. K., Improving an Association Rule Based Classiﬁer, Proceedings of
the 4th European Conference on Principles of Data Mining and Knowledge Discovery,
p.504-509, September 13-16, 2000.
Mitchell T., Machine Learning. WCB McGraw Hill, 1997.
Pan F., Cong G., Tung A. K. H., Yang J., and Zaki M. J., CARPENTER: ﬁnding closed
patterns in long biological data sets. Proceedings of ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, 2003.
808 Wei Wang and Jiong Yang
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for as-
sociation rules. In Beeri, C., Buneman, P., eds., Proc. of the 7th Int’l Conf. on Database
Theory (ICDT’99), Jerusalem, Israel, Volume 1540 of Lecture Notes in Computer Sci-
ence., pp. 398-416, Springer-Verlag, January 1999.
Pei, J., Han, J., and Mao, R., CLOSET: an efﬁcient Algorithm for mining frequent closed
itemsets. In D. Gunopulos and R. Rastogi, eds., ACM SIGMOD Workshop on Research
Issues in Data Mining and Knowledge Discovery, pp 21-30, 2000.
Vapnik, V.N., Statistical Learning Theory. John Wiley and Sons, 1998.

Wang H., Wang W., Yang J. and Yu P., Clustering by pattern similarity in large data sets.
Proceedings of the ACM SIGMOD International Conference on Management of Data
(SIGMOD), pp. 394-405, 2002.
Yin X., Han J., CPAR: classiﬁcation based on predictive association rules. Proceedings of
SIAM International Conference on Data Mining, San Fransisco, CA, pp. 331-335, 2003.
Zaki M. J. and Hsiao C., CHARM: An efﬁcient algorithm for closed itemset mining. In
Proceedings of the Second SIAM International Conference on Data Mining, Arlington,
VA, 2002. SIAM
42
Text Mining and Information
Extraction
Moty Ben-Dov
1
and Ronen Feldman
2
1
MDX University, London
2
Hebrew university, Israel
Summary. Text Mining is the automatic discovery of new, previously unknown information,
by automatic analysis of various textual resources. Text mining starts by extracting facts and
events from textual sources and then enables forming new hypotheses that are further explored
by traditional Data Mining and data analysis methods. In this chapter we will deﬁne text min-
ing and describe the three main approaches for performing information extraction. In addition,
we will describe how we can visually display and analyze the outcome of the information ex-
traction process.
Key words: text mining, content mining, structure mining, text classiﬁcation, infor-
mation extraction, Rules Based Systems.
42.1 Introduction
The information age has made it easy for us to store large amounts of texts. The pro-

liferation of documents available on the Web, on corporate intranets, on news wires,
and elsewhere is overwhelming. However, while the amount of information avail-
able to us is constantly increasing, our ability to absorb and process this information
remains constant. Search engines only exacerbate the problem by making more and
more documents available in a matter of a few key strokes; So-called “push” tech-
nology makes the problem even worse by constantly reminding us that we are failing
to track news, events, and trends everywhere. We experience information overload,
and miss important patterns and relationships even as they unfold before us. As the
old adage goes, “we can’t see the forest for the trees.”
Text-mining (TM), also known as Knowledge discovery from text (KDT), refers
to the process of extracting interesting patterns from very large text database for the
purposes of discovering knowledge. Text-mining applies the same analytical func-
tions of data-mining but also applies analytic functions from natural language (NL)
and information retrieval (IR) techniques (Dorre et al., 1999).
The text-mining tools are used for:
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_42, © Springer Science+Business Media, LLC 2010

Data Mining and Knowledge Discovery Handbook, 2 Edition part 83 doc

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về