Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 129 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.23 MB, 10 trang )

1260 Qingyu Zhang and Richard S. Segall
Fig. 65.26. Concept Links for
Term of “statistical” in SAS
Text Miner using SASPDF-
SYNONYMS text file (Wood-
field, 2004)
SAS Text Miner uses the “drag-and-drop” principle by dragging the selected icon in the
tool set to dropping it into the workspace. The workspace of SAS Text Miner was constructed
with a data icon of selected animal data that was provided by SAS in their Instructor’s Trainer
Kit as shown in Figure 24. Figure 25 shows the results of using SAS Text Miner with indi-
vidual plots for “role by frequency”, “number of documents by frequency”, “frequency by
weight”, “attribute by frequency”, and “number of documents by frequency scatter plot.” Fig-
ure 26 shows “Concept Linking Figure” as generated by SAS Text Miner using SASPDF-
SYNONYMS text file.
65.5.2 Megaputer PolyAnalyst
Previous work by the authors Segall and Zhang (2006) have utilized Megaputer PolyAna-
lyst for data mining. The new release of PolyAnalyst version 6.0 includes text mining and
specifically new features for text OLAP (on-line analytical processing) and taxonomy based
categorization which is useful for when dealing with large collections of unstructured docu-
ments as discussed in Megaputer Intelligence Inc. (2007). The latter cites that taxonomy based
classifications are useful when dealing with large collections of unstructured documents such
as tracking the number of known issues in product repair notes and customer support letters.
According to Megaputer Intelligence Inc. (2007), PolyAnalyst “provides simple means
for creating, importing, and managing taxonomies, and carries out automated categorization
of text records against existing taxonomies.” Megaputer Intelligence Inc. (2007) provides ex-
amples of applications to executives, customer support specialists, and analysts. According
to Megaputer Intelligence Inc. (2007), “executives are able to make better business decisions
upon viewing a concise report on the distribution of tracked issues during the latest observation
period”.
This chapter provides several figures of actual screen shots of Megaputer PolyAnalyst
version 6.0 for text mining. These are Figure 27 for workspace of text mining of Megaputer


PolyAnalyst, Figure 28 is “Suffix Tree Clustering” Report for the text cluster of (desk; front),
and Figure 29 is screen shot of “Link Term” Report of hotel customer survey text. Megaputer
PolyAnalyst can also provide screen shots with drill-down text analysis and histogram plot of
text analysis.
65 Commercial Data Mining Software 1261
Fig. 65.27. Workspace for
Text Mining in Megaputer
PolyAnalyst
Fig. 65.28. Clustering Results
in Megaputer PolyAnalyst
Fig. 65.29. Link Term Report
using Text Analysis in Mega-
puter PolyAnalyst
65.6 Web Mining Software
Two selected software are reviewed and compared in terms of data preparation, data analysis,
and results reporting (see Table 4). As shown in the table below, Megaputer PolyAnalyst has
unique feature of data and text mining tool integrated with web site data source input, while
SPSS Clementine has linguistic approach rather than statistics based approach, Table 4 gives a
visual interpretation of the differences and similarities among both selected software as shown
below.
1262 Qingyu Zhang and Richard S. Segall
Table 65.4. Web Mining Software
Features Megaputer PolyAnalyst SPSS Clemen-
tine
Data Data extraction x (web site as data
source input)
Import server
files
Preparation Automatic Data Cleaning x x
user segmentation x x

Detect users’ sequences x
Data Understand product and
content affinities (link
analysis)
xx
Analysis Predict user propensity to
convert, buy, or churn
x
Navigation report x
Keyword and Search En-
gine
xx
Results Interactive Results Window x
Reporting Support for multiple lan-
guages
xx
Visual presentation x x
Unique features Data and text mining
tool integrated with
web site data source
input
Linguistic ap-
proach rather
than statistics
based approach
65.6.1 Megaputer PolyAnalyst
Megaputer PolyAnalyst is an enterprise analytical system that integrates Web mining together
with data and text mining because it does not have a separate module for Web mining. Web
pages or sites can be inputted directly to Megaputer PolyAnlayst as data source nodes.
Megaputer PolyAnlayst has the standard data and text mining functionalities such as Cat-

egorization, Clustering, Prediction, Link Analysis, Keyword and entity extraction, Pattern dis-
covery, and Anomaly detection. These different functional nodes can be directly connected to
the web data source node for performing web mining analysis. Megaputer PolyAnalyst user
interface allows the user to develop complex data analysis scenarios without loading data in
the system, thus saving analyst’s time. According to Megaputer (2007), whatever data sources
are used, PolyAnalyst provides means for loading and integrating these data. PolyAnalyst can
load data from disparate data sources including all popular databases, statistical, and spread-
sheet systems. In addition, it can load collections of documents in html, doc, pdf and txt for-
mats, as well as load data from an internet web source. PolyAnalyst offers visual “on-the-fly
integration” and merging of data coming from disparate sources to create data marts for fur-
ther analysis. It supports incremental data appending and referencing data sets in previously
created PolyAnalyst projects.
Figures 30-32 are screen shots illustrating the applications of Megaputer PolyAnalyst
for web mining to available data sets. Figure 30 shows an expanded view of PolyAnalyst
workspace. Figure 31 shows screen shot of PolyAnalyst using website of Arkansas State Uni-
65 Commercial Data Mining Software 1263
versity (ASU) as the web data source. Figure 32 shows a keyword extraction report from a
web page of undergraduate admission of website of Arkansas State University (ASU).
Fig. 65.30. PolyAnalyst
workspace with Internet data
source
Fig. 65.31. PolyAnalyst using
www.astate.edu as web data
source
Fig. 65.32. Keyword extraction
report
1264 Qingyu Zhang and Richard S. Segall
65.6.2 SPSS Clementine
“Web Mining for Clementine is an add-on module that makes it easy for analysts to perform
ad hoc predictive Web analysis within Clementine’s intuitive visual workflow interface.” Web

Mining for Clementine combines both Web analytics and data mining with SPSS analytical
capabilities to transform raw Web data into “actionable insights”. It enables business decision
makers to take more effective actions in real time. SPSS (2007) claims examples of auto-
matically discovering user segments, detecting the most significant sequences, understanding
product and content affinities, and predicting user intention to convert, buy, or churn.
Fig. 65.33. SPSS Clementine
workspace
Fig. 65.34. Decision rules for
determining clusters of web
data
SPSS (2007) claims four key data mining capabilities: segmentation, sequence detection,
affinity analysis, and propensity modeling. Specifically, SPSS (2007) indicates six Web anal-
ysis application modules within SPSS Clementine that are: search engine optimization, auto-
mated user and visit segmentation, Web site activity and user behavior analysis, home page
activity, activity sequence analysis, and propensity analysis.
Unlike other platforms used for Web mining that provide only simple frequency counts
(e.g., number of visits, ad hits, top pages, total purchase visits, and top click streams), SPSS
(2007) Clementine provides more meaningful customer intelligence such as: likelihood to
65 Commercial Data Mining Software 1265
Fig. 65.35. Decision tree re-
sults
convert by individual visitor, likelihood to respond by individual prospect, content clusters by
customer value, missed crossed-sell opportunities, and event sequences by outcome.
Figures 33-35 are screen shots illustrating the applications of SPSS Clementine for web
mining to available data sets. Figure 33 shows the SPSS Clementine workspace. Different
user modes can be defined including research mode, shopping mode, search mode, evaluation
mode, and so on. Decision rules for determining clusters of web data are demonstrated in
Figure 34. Figure 35 exhibits decision tree results with classifiers using different model types
(e.g., CHAID, logistic, neural).
65.7 Conclusion and Future Research

The conclusions of this research include the fact that each of the software selected for this
research has its own unique characteristics and properties that can be displayed when applied
to the available data sets. As indicated, each software has it own set of algorithm types to
which it can be applied.
Comparing five data mining software, Biodiscovery GeneSight focuses on cluster analysis
and is able to provide a variety of data mining visualization charts and colors. BioDiscovery
GeneSight have less data mining functions than the other four do. SAS Enterprise Miner,
Megaputer PolyAnalyst, PASW, and IBM Intelligent Miner employ each of the same algo-
rithms as illustrated in Table 1 except that SAS has a separate software SAS Text Miner for
text analysis. The regression results are comparable for those obtained using these software.
The cluster analysis results for SAS Enterprise Miner, Biodiscovery GeneSight, and Mega-
puter PolyAnalyst each are unique to each software as to how they represent their results.
In conclusion, SAS Enterprise Miner, Megaputer PolyAnalyst, PASW, and IBM Intelligent
Miner offer the greatest diversification of data mining algorithms.
This chapter has discussed commercial data mining software that is applicable to super-
computing for 3-D visualization and very large microarray databases. Specifically it illustrated
the applications of supercomputing for data visualization using two selected software of Avizo
and JMP Genomics. Avizo is a general supercomputing software and JMP Genomics is a spe-
cial software for genetic data. Supercomputing data mining for 3-D visualization with Avizo
is applied to diverse applications such as the human skull for medical research, and the atomic
structure that can be used for multipurpose applications such as chemical or nuclear. We have
also presented, using JMP Genomics, the data distributions of condition, patient, frequencies,
1266 Qingyu Zhang and Richard S. Segall
and characteristics for patient data of adenocarcinoma cancer. The figures of this chapter il-
lustrate the level of visualization that is able to be provided by these two softwares.
Comparing two text mining software, both Megaputer PolyAnalyst, and SAS Text Miner
have extensive text mining capabilities. SAS Text Miner is an add-on to base SAS Enterprise
Miner by inserting an additional Text Miner icon on the SAS Enterprise Miner workspace
toolbar. SAS Text Miner tags parts of speech and performs transformations such as those
using Singular Value Decompositions (SVD) to generate term-document frequency matrix for

viewing in the Text Miner node. Megaputer PolyAnalyst similarly is a software that combines
both data mining and text mining, but also includes web mining capabilities. Megaputer also
has standalone Text Analyst software for text mining.
Regarding web mining software, PolyAnalyst can mine web data integrated within a data
mining enterprise analytical system and provide visual tools such as link analysis of the critical
terms of the text. SPSS Clementine can be used for graphical illustrations of customer web
activities as well as also for link analysis of different data categories such as campaign, age,
gender, and income. The selection of appropriate web mining software should be based on
both its available web mining technologies and also the type of data to be encountered.
The future direction of the research is to investigate other data, text, web, and supercom-
puting mining software for analyzing various types of data and making comparisons of the
capabilities of these software between and among each other. This future research would also
include the acquisition of other data sets to perform these new analyses and comparisons.
Acknowledgement. The authors would like to acknowledge the support provided by a 2009
Summer Faculty Research Grant as awarded to them by the College of Business of Arkansas
State University without whose program and support this work cannot be done. The authors
also want to acknowledge each of the software manufactures for their support of this research.
References
AAAI (2002), American Association for Artificial Intelligence (AAAI) Spring Sympo-
sium on Information Refinement and Revision for Decision Making: Modeling for
Diagnostics, Prognostics, and Prediction, Software and Data, retrieved from http:
//www.cs.rpi.edu/
˜
goebel/ss02/software-and-data.html.
Ceccato, M., M. Marin, K. Mens, L. Moonen, et al., (2006), Applying and combining three
different aspect Mining Techniques, Software Quality Journal. 14(3), 209-214.
Chang, J. and Lee, W. (2006), Finding frequent itemsets over online data streams, Informa-
tion and Software Technology. 48(7), 606-619.
Chou, C., Sinha, A. and Zhao, H. (2008), A text mining approach to Internet abuse detection,
Information Systems and eBusiness Management. 6(4), 419-440.

Curry, C., Grossman, R., Locke, D., Vejcik, S., and Bugajski, J. (2007), Detecting changes
in large data sets of payment card data: A case study, KDD’07, August 12-15, San Jose,
CA.
65 Commercial Data Mining Software 1267
Data Intelligence Group (1995), An overview of data mining at Dun & Bradstreet, DIG
White Paper 95/01, retrieved from t/wp9501/wp9501.htm.
Davi, A, Dominique Haughton, Nada Nasr, Gaurav Shah, et al (2005), A Review of Two
Text-Mining Packages: SAS TextMining and WordStat. The American Statistician.
59(1), 89-104.
Davies, A. (2007), Identification of spurious results generated via data mining using an Inter-
net distributed supercomputer grant, Duquesne University Donahue School of Business,
/>Deshmukah, A. V. (1997), Software review: ModelQuest Expert 1.0, ORMS Today,
December 1997, retrieved from />review.html.
Ducatelle, F., (2006), Software for the data mining course, School of In-
formatics, The University of Edinburgh, Scotland, UK, retrieved from
/>Ganapathy, S., Ranganathan, C. and Sankaranarayanan, B. (2004), Visualization strategies
and tools for enhancing customer relationship management, Communications of the
ACM. 47(11), 92-98.
Grossman, R. (2007), Data grids, data clouds and data webs: a survey of high perfor-
mance and distributed data mining, HPC Workshop: Hardware and software for large-
scale biological computing in the next decade, December 11-14, Okinawa, Japan,
/>Hearst, M. A.(2003), What is Data Mining?, />˜
hearstr/
text
mining.html
IBM DB2 Intelligent Miner Visualization: Using the Intelligent Miner Visualizers Version
8.2 SH12, Second Edition, August 2004
Kim, S., E James Whitehead Jr and Yi Zhang, (2008), Classifying Software Changes: Clean
or Buggy? IEEE Transactions on Software Engineering. 34(2), 181-197.
Lau, K., Lee, K. and Ho, Y. (2005), Text Mining for the Hotel Industry, Cornell Hotel and

Restaurant Administration Quarterly. 46(3), 344-363.
Lazarevic A., Fiea T., & Obradovic, Z., (2006), A software system for spatial data analysis
and modeling, retrieved from ?˜zoran/papers/lazarevic00.pdf.
Leung, Y. F. (2004), My microarray software comparison - Data mining soft-
ware, September 2004, Chinese University of Hong Kong, retrieved from
mining specific.html.
Megaputer Intelligence Inc.(2007), Data Mining, Text Mining, and Web Mining Software,
http:///www.megaputer.com
Mesrobian, E. , Muntz, R., Shek,E., Mechoso,, C. R., Farrara, J.D., Spahr, J.A., Stolorz,
P.(1995), Real time data mining, management, and visualization of GCM output, IEEE
Computer Society, v.81, />94.ps.gz
Metz. C.(2003), Software: Text Mining, PC Magazine, July 1,
/>article2/0,1217.a=43573,00.asp
National Center for Biotechnology Information (2006), National Library of Medicine,
National Institutes of Health, NCBI tools for data mining, retrieved from
,nih.gov/Tools/.
Nayak, R. (2008), Data Mining in Web Services Discovery and Monitoring, International
Journal of Web Services Research. 5(1), 63-82.
Nisbet, R. A.(2006), Data mining tools: Which one is best for CRM? Part 3, DM Re-
view, March 21, 2006, retrieved from />print
action.cfm?articleId=1049954.
1268 Qingyu Zhang and Richard S. Segall
Pabarskaite, Z. and Raudys, A. (2007), A process of knowledge discovery from web log
data: Systematization and critical review, Journal of Intelligent Information Systems.
28(1), 79-105.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In-
ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480,
2001.

Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical
Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-
Verlag, 2004.
Sanchez, E. (1996), Speedier: Penn researchers to link supercomputers to community prob-
lems, The Compass, v. 43, n. 4, p. 14, September 17, />features/1996/091796/research
Sanchez, M., Moreno, M., Segrera,S. and Lopez, V. (2008), Framework for the develop-
ment of a personalised recommender system with integrated web-mining functionali-
ties,International Journal of Computer Applications in Technology, 33(4), 312-327.
SAS (2009), JMP Genomics 4.0 Product Brief, />/pdf/103112
jmpg4 prodbrief.pdf
Segall, R. and Zhang, Q. (2006), Data visualization and data mining of continuous numer-
ical and discrete nominal-valued microarray databases for biotechnology, Kybernetes:
International Journal of Systems and Cybernetics, 35(9/10),1538-1566.
Seigle, G. (2002), CIA, FBI developing intelligence supercomputer, Global Security.
Sekijima, M. (2007), Application of HPC to the analysis of disease related protein and
the design of novel proteins, HPC Workshop: “Hardware and software for large-
scale biological computing in the next decade”, December 11-14, Okinawa, Japan,
/>SPPS (2009a): PASW Modeler 13: Overview Demo, />modeler/ demo-modeler-overview/index.htm
SPPS (2009b): PAWS Modeler Auto Cluster and Cluster Viewer,
/>SPSS (2007), Web Mining for Clementine, />mining for clementine,
viewed 16 May 2007.
StatSoft, Inc. (2006), Electronic textbook, retrieved from
/>VSG Visualization Sciences Group (2009), Avizo The 3D visualization software for scien-
tific and industrial data, />prod avizo overview.php
Wikipedia (2006), Supercomputers, Retrieved May 19, 2009 from BookRags.com:
/>Wikipedia (2007), Web mining, />mining
Woodfield, Terry (2004), Mining Textual Data Using SAS Text Miner for SAS9 Course
Notes, SAS Institute, Inc., Cary, NC.
Zhang, Q. and Segall, R. (2008), Web mining: a survey of current research, techniques, and
software, International Journal of Information Technology & Decision Making, 7(4),

683-720.
66
Weka-A Machine Learning Workbench for Data
Mining
Eibe Frank
1
, Mark Hall
1
, Geoffrey Holmes
1
, Richard Kirkby
1
, Bernhard Pfahringer
1
, Ian H.
Witten
1
, and Len Trigg
2
1
Department of Computer Science, University of Waikato, Hamilton, New Zealand
{eibe, mhall, geoff, rkirkby, bernhard,
ihw}@cs.waikato.ac.nz
2
Reel Two, P O Box 1538, Hamilton, New Zealand

Summary. The Weka workbench is an organized collection of state-of-the-art machine lear-
ning algorithms and data preprocessing tools. The basic way of interacting with these methods
is by invoking them from the command line. However, convenient interactive graphical user
interfaces are provided for data exploration, for setting up large-scale experiments on dis-

tributed computing platforms, and for designing configurations for streamed data processing.
These interfaces constitute an advanced environment for experimental data mining. The sys-
tem is written in Java and distributed under the terms of the GNU General Public License.
Key words: machine learning software, Data Mining, data preprocessing, data visualization,
extensible workbench
66.1 Introduction
Experience shows that no single machine learning method is appropriate for all possible learn-
ing problems. The universal learner is an idealistic fantasy. Real datasets vary, and to obtain
accurate models the bias of the learning algorithm must match the structure of the domain.
The Weka workbench is a collection of state-of-the-art machine learning algorithms and
data preprocessing tools. It is designed so that users can quickly try out existing machine
learning methods on new datasets in very flexible ways. It provides extensive support for the
whole process of experimental Data Mining, including preparing the input data, evaluating
learning schemes statistically, and visualizing both the input data and the result of learning.
This has been accomplished by including a wide variety of algorithms for learning different
types of concepts, as well as a wide range of preprocessing methods. This diverse and compre-
hensive set of tools can be invoked through a common interface, making it possible for users
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_66, © Springer Science+Business Media, LLC 2010

×