Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 127 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (640.58 KB, 10 trang )

1240 Oded Maimon and Abel Browarnik
Taxonomies and ontologies NHECD uses, at several stages, manually prepared tax-
onomies. It is arguable that using an ontology of the Nanotox domain could enhance the
quality of information extraction (either textual, graphic or tabular). On the other hand, no
Nanotox ontology exists. Research towards ontology learning could use NHECD results. In
turn, the learned ontology could improve information extraction, implementing a kind of boot-
strapping process. Data mining on the second NHECD product can have a strong influence on
the ontology learning process. As a result, the ontology can be further enhanced.
References
Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition
Letters, 27(14): 1619–1631, 2006, Elsevier.
Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context-
sensitive medical information retrieval, The 11th World Congress on Medical Informat-
ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286.
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine.
Comput. Netw. ISDN Syst. 30, 1-7 (Apr. 1998), 107-117.
Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with
Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007.
Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors
manufacturing case study, in Data Mining for Design and Manufacturing: Methods and
Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001.
Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro-
ceedings of the Second International Symposium on Foundations of Information and
Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002.
Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and
Data Mining: Theory and Applications, Series in Machine Perception and Artificial In-
telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005.
Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame-
work, Pattern Analysis and Applications, 9(2006):257–271.
Rokach L., Genetic algorithm-based feature set partitioning for classification prob-
lems,Pattern Recognition, 41(5):1676–1700, 2008.


Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-
proach, Proceedings of the 14th International Symposium On Methodologies For Intel-
ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,
2003, pp. 24–31.
Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical
Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-
Verlag, 2004.
Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor
fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3)
(2006), pp. 329–350.
Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In-
ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480,
2001.
64 NHECD - Nano Health and Environmental Commented Database 1241
Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-
ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158.
Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery
Handbook, pp. 321–352, 2005, Springer.
Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a
feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285–
299, 2006, Springer.
Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World
Scientific Publishing, 2008.

Part VIII
Software

65

Commercial Data Mining Software
Qingyu Zhang and Richard S. Segall
1
Arkansas State University, Department of Computer and Info. Tech., Jonesboro, AR
72467-0130,USA.
2
Arkansas State University, Department of Computer and Info. Tech., Jonesboro, AR
72467-0130,USA.
Summary. This chapter discusses selected commercial software for data mining, supercom-
puting data mining, text mining, and web mining. The selected software are compared with
their features and also applied to available data sets. The software for data mining are SAS
Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (formerly SPSS Clementine), IBM In-
telligent Miner, and BioDiscovery GeneSight. The software for supercomputing are Avizo by
Visualization Science Group and JMP Genomics from SAS Institute. The software for text
mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The software for web mining are
Megaputer PolyAnalyst and SPSS Clementine . Background on related literature and software
are presented. Screen shots of each of the selected software are presented, as are conclusions
and future directions.
65.1 Introduction
In the data mining community, there are three basic types of mining: data mining, web min-
ing, and text mining (Zhang and Segall, 2008). In addition, there is a special category called
supercomputing data mining, which is today used for high performance data mining and data
intensive computing of large and distributed data sets. Much software has been developed
for visualization of data intensive computing for use with supercomputers, including that for
large-scale parallel data mining.
Data mining primarily deals with structured data. Text mining mostly handles unstructured
data/text. Web mining lies in between and copes with semi-structured data and/or unstructured
data. The mining process includes preprocessing, patterns analysis, and visualization. To effec-
tively mine data, a software with sufficient functionalities should be used. Currently there are
many different software, commercial or free, available on the market. A comprehensive list of

mining software is available on web page of KDnuggets (http:// www.kdnuggets.com/software
/index.html).
This chapter discusses selected software for data mining, supercomputing data mining,
text mining, and web mining that are not available as free open source software. The selected
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_65, © Springer Science+Business Media, LLC 2010
1246 Qingyu Zhang and Richard S. Segall
software for data mining are SAS Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (for-
merly SPSS Clementine), IBM Intelligent Miner, and BioDiscovery GeneSight. The selected
software for text mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The selected
software for web mining are Megaputer PolyAnalyst and SPSS Clementine. The software for
supercomputing are Avizo by Visualization Science Group and JMP Genomics from SAS In-
stitute. Avizo is 3-D visualization software for scientific and industrial data that can process
very large datasets at interactive speed. JMP Genomics from SAS is used for discovering the
biological patterns in genomics data.
These software are described and compared as to the existing features and algorithms for
each and also applied to different available data sets. Background on related literature and
software are also presented. Screen shots of each of the selected software are reported as are
conclusions and future directions.
65.2 Literature Review
Data mining is defined by the Data Intelligence Group (1995) as the extraction of hidden
predictive information form large databases. According to them, “data mining tools scour
databases for hidden patterns, finding predictive information that experts may miss because
it lies outside their expectations.” According to StatSoft (2006), algorithms are operations or
procedures that will produce a particular outcome with a completely defined set of steps or op-
erations. This is opposed to heuristics that are general recommendations or guides based upon
theoretical reasoning or statistical evidence such as “data mining can be a useful tool if used
appropriately.” Data mining and algorithms are widely implemented and rapidly developed
(Kim et al., 2008; Nayak, 2008; Segall and Zhang, 2006).
According to Wikipedia (2009), supercomputers or HPC (High Performance Computing)

are used for highly calculation-intensive tasks such as problems involving quantum mechan-
ical physics, weather forecasting, global warming, molecular modeling, physical simulations
(such as for simulation of airplanes in wind tunnels and simulation of detonation of nuclear
weapons). Sanchez (1996) cited the importance of data mining using supercomputers by stat-
ing “Data mining with these big, superfast computers is a hot topic in business, medicine and
research because data mining means creating new knowledge from vast quantities of informa-
tion, just like searching for tiny bits of gold in a stream bed”. According to Sanchez (1996),
The Children’s Hospital of Pennsylvania took MRI scans of a child’s brain in 17 seconds us-
ing supercomputing for that which otherwise normally would require 17 minutes assuming no
movement of the patient.
The increasing availability of textual knowledge applications and online textual sources
has caused a boost in text mining and web mining research. Hearst (2003) defines text min-
ing as “the discovery of new, previously unknown information, by automatically extracting
information from different written sources.” He distinguishes text mining from data mining by
noting that “in text mining the patterns are extracted from natural language rather than from
structured database of facts.” Metz (2003) describes text mining as those for that “applications
are clever enough to run conceptual searches, locating, say, all the phone numbers and places
names buried in a collection of intelligence communiqus.” More impressive, the software can
identify relationships, patterns, and trends involving words, phrases, numbers, and other data.
Web mining is the application of data mining techniques to discover patterns from the
Web and can be classified into three different types of web content mining, web usage mining,
and web structure mining (Pabarskaite and Raudys, 2007; Sanchez et al., 2008). Web content
mining is the process to discover useful information from the content of a web page that may
65 Commercial Data Mining Software 1247
consist of text, image, audio or video data in the web; web usage mining is the application that
uses data mining to analyze and discover interesting patterns of user’s usage of data on the
web; and web structure mining is the process of using graph theory to analyze the node and
connection structure of a web site (Wikipedia, 2007). An example of the latter would be dis-
covering the authorities and hubs of any web document, e.g. identifying the most appropriate
web links for a web page.

There is a wealth of software today for data, supercomputing, text and web mining such
as presented in American Association for Artificial Intelligence (AAAI) (2002) and Ducatelle
(2006) for teaching data mining, Nisbet (2006) for CRM (Customer Relationship Manage-
ment) and software review of Deshmukah (1997). StatSoft (2006) presents screen shots of
several softwares that are used for exploratory data analysis and various data mining tech-
niques. Kim et al. (2008) classify software changes in data mining and Ceccato et al. (2006)
combine three mining techniques. Nayak (2008) develops and applies data mining techniques
in web services discovery and monitoring.
Davi et al. (2005) review two text mining packages of SAS text mining and Wordstat.
Chou et al. (2008) apply text mining approach to Internet abuse detection and Lau et al. (2005)
discuss text mining for the hotel industry. Lazarevic et al. (2006) discussed a software system
for spatial data analysis and modeling. Leung (2004) compares microarray data mining soft-
ware. National Center for Biotechnology Information (2006) referred to as NCBI provides
tools for data mining including those specifically for each of the following categories of nu-
cleotide sequence analysis, protein sequence analysis and proteomics, genome analysis, and
gene expression.
Chang and Lee (2006) find frequent itemsets using online data streams. Pabarskaite and
Raudys (2007) review the knowledge discovery process from web log data. Sanchez et al.
(2008) integrate software engineering and web mining techniques in the development of an e-
commerce recommender system capable of predicting the preferences of its users and present
them a personalized catalogue. Ganapathy et al. (2004) discuss visualization strategies and
tools for enhancing customer relationship management.
Some applications of supercomputers for data mining include that of Davies (2007) using
Internet distributed supercomputers, Seigle (2002) for CIA/FBI, Mesrobian et al. (1995) for
real time data mining, and Curry et al. (2007) for detecting changes in large data sets of
payment card data. DMMGS06 conducted a workshop on data mining and management on
the grid and supercomputers in Nottingham, UK. Grossman (2007) wrote a survey of high
performance and distributed data mining. Sekijima (2007) studied the application of HPC to
analysis of disease related protein.
65.3 Data Mining Software

The research is to compare the five selected software for data mining including SAS Enter-
prise Miner, Megaputer PolyAnalyst 5.0, PASW Modeler/ formerly SPSS Clementine, IBM
Intelligent Miner, and BioDiscovery GeneSight. The data mining algorithms to be performed
include those for neural networks, genetic algorithms, clustering, and decision trees. As can be
visualized from Table 1, SAS Enterprise Miner , PolyAnalyst 5, PASW, and IBM Intelligent
Miner offer more algorithms than GeneSight.
1248 Qingyu Zhang and Richard S. Segall
Table 65.1. Data Mining Software
ALGORITHMS GeneSight PolyAnalyst SAS Enter-
prise Miner
PASW
Modeler/
SPSS
Clementine
IBM In-
telligent
Miner
Statistical Analysis xxxxx
Neural Networks x x x(add on) x
Decision Trees x x x
Regression Analysis x x x x
Cluster Analysis xxxxx
Self-Organizing Map
(SOM)
xx
Link/Association Analysis x x x x
65.3.1 BioDiscovery GeneSight
GeneSight is a product of BioDiscovery, Inc. of El Segundo, CA that focuses on cluster anal-
ysis using two main techniques of hierarchical and partitioning for data mining of microarray
gene expressions.

Figure 1 shows the k-means clustering of global variations using the Pearson correlation.
This can also be done by self-organizing map (SOM) clustering using the Euclidean distance
metric for the first three variables of aspect, slope and elevation. Figure 2 shows the two-
dimensional self-organizing map (SOM) for the eleven variables for all of the data using the
Chebychev distance metric.
Fig. 65.1. K-means clustering of
global variations with the Pear-
son correlation using GeneSight
65.3.2 Megaputer PolyAnalyst 5.0
PolyAnalyst 5 is a product of Megaputer Intelligence, Inc. of Bloomington, IN and contains
sixteen (16) advanced knowledge discovery algorithms.
65 Commercial Data Mining Software 1249
Fig. 65.2. Self-organizing map
(SOM) with the Chebychev
distance metric using GeneSight
Figure 3 shows input data window for the forest cover type data in PolyAnalyst 5.0. The
link diagram given by Figure 4, illustrates for each of the six (6) forest cover types for each of
the 5 elevations present for each of the 40 soil types. Figure 5 provides the bin selection rule
for the variable of selection. The Decision Tree Report indicates a classification probability
of 80.19% with a total classification error of 19.81%. Per PolyAnalyst output the decision
tree has a tree depth of 100 with 210 leaves, and a depth of constructed tree of 16, and a
classification efficiency of 47.52%.
Fig. 65.3. Input data window
for the forest cover type data in
PolyAnalyst 5.0
65.3.3 SAS Enterprise Miner
SAS Enterprise Miner is a product of SAS Institute Inc. of Cary, NC and is based on the
SEMMA approach that is the process of Sampling (S), Exploring (E), Modifying (M), Model-
ing (M), and Assessing (A) large amounts of data. SAS Enterprise Miner utilizes a workspace
with a drop-and-drag of icons approach to constructing data mining models. SAS Enterprise

Miner utilizes algorithms for decision trees, regression, neural networks, cluster analysis, and
association and sequence analysis.

×