Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 125 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.17 MB, 10 trang )

1220 Nissan Levin and Jacob Zahavi
Heckman, J., Sample Selection Bias as a Specification Error, Econometrica, Vol. 47, No. 1,
pp. 153-161, 1979.
Gilbert A. and Churchill, Jr., Marketing Research. Seventh edition The Dryden Press, 1999.
George, E.I., The Variable Selection Problem, University of Texas, Austin, 2000.
Herz, F., Ungar, L. and Labys, P., A Collaborative Filtering System for the Analysis of Con-
sumer Data. Univ. of Pennsylvania, Philadelphia, 1997.
Hodges, J.L. Jr., ”The Significance Probability of the Smirnov Two-Sample Test,” Arkiv for
Matematik, 3, 469 -486, 1957.
Kass, G., An Exploratory Technique for Investigating large Quantities of Categorical Data,
Applied Statistics, 29, 1983.
Kohonen, K., Makisara, K., Simula, O. and Kangas, J., Artificial Networks. Amsterdam,
1991.
Lauritzen, S.L., The EM algorithm for Graphical Association Models with Missing Data.
Computational Statistics and Data Analysis, 19, 191-201, 1995.
Long, S.J., Regression Models for Categorical and Limited Dependent Variables, Sage Pub-
lications, Thousand Oaks, CA, 1997.
Lambert P.J., The Distribution and Redistribution of Income. Manchester University Press.,
1993.
Levin, N. and Zahavi, J., Segmentation Analysis with Managerial Judgment, Journal of Di-
rect Marketing, Vol. 10, pp. 28-47, 1996.
Levin, N. and Zahavi, J., Applying Neural Computing to Target Marketing, The Journal of
Direct Marketing, Vol. 11, No. 1, pp. 5-22, 1997a.
Levin, N. and Zahavi, J., Issues and Problems in Applying Neural Computing to Target
Marketing, The Journal of Direct marketing, Vol. 11, No. 4, pp. 63-75, 1997b.
Miller, A., Subset Selection in Regression, Chapman and Hall, London, 2002.
Quinlan, J.R., Induction of Decision Trees, Machine Learning, 1, pp. 81-106, 1986.
Quinlan, J.R., C4.5: Program for Machine Learning, CA., Morgan Kaufman Publishing,
1993.
Rumelhart, D.E., McClelland, J.L., and Williams, R.J., Learning Internal Representation by
Error Propagation, in Parallel Distributed Processing: Exploring the Microstructure of


Cognition, Rumelhart, D.E., McClelland, J.L. and the PDP Researcg Group, eds., MIT
Press, Cambridge, MA, 1986.
Schwarz, G., Estimating the Dimension of a Model, Annals of Statistics, Vol. 6, pp. 486-494,
1978.
Shepard, D. (ed.), The New Direct Marketing, New York, Irwin, 1995.
Silverman, B.W., Density Estimation for Statistics and Data Analysis. Chapman and Hall,
1986.
Smith, W.R., Product Differentiation and Market Segmentation as Alternative Marketing
Strategies, Journal of Marketing, 21, 3-8, 1956.
Sonquist, J., Baker, E. and Morgan, J.N., Searching for Structure, Ann Arbor, University of
Michigan, Survey Research Center, 1971.
Tobin, J., Estimation of Relationships for Limited-Dependent Variables,
Econometrica, Vol. 26, pp. 24-36, 1958.
Zhang, R., Ramakrishnan, R. and Livny, M., An Efficient Data Clustering Method for Very
Large Databases. Proceedings ACM SIGKDD International Conference on Management
of Data. 103-114, 1996.
64
NHECD - Nano Health and Environmental
Commented Database
Oded Maimon
1
and Abel Browarnik
1
Department of Industrial Engineering, Tel-Aviv University, Ramat-Aviv 69978, Israel,

Summary. The impact of nanoparticles on health and the environment is a significant re-
search subject, driving increasing interest from the scientific community, regulatory bodies
and the general public. We present a smart repository system with text and data mining for this
domain. The growing body of knowledge in this area, consisting of scientific papers and other
types of publications (such as surveys and whitepapers) emphasize the need for a methodol-

ogy to alleviate the complexity of reviewing all the available information and discovering all
the underlying facts, using data mining algorithms and methods.
The European Commission-funded project NHECD (whose full name is “Creation of a
critical and commented database on the health, safety and environmental impact of nanopar-
ticles”) converts the unstructured body of knowledge produced by the different groups of
users (such as researchers and regulators) into a repository of scientific papers and reviews
augmented by layers of information extracted from the papers. Towards this end we use tax-
onomies built by domain experts and metadata, using advanced methodologies. We implement
algorithms for textual information extraction, graph mining and table information extraction.
Rating and relevance assessment of the papers are also part of the system. The project is com-
posed of two major layers, a backend consisting of all the above taxonomies, algorithms and
methods, and a frontend consisting of a query and navigation system. The frontend has web
interface which address the needs (and knowledge) of the different user groups. Documentum,
a content management system (CMS), is the backbone of the backend process component. The
frontend is a customized application built using an open source CMS. It is designed to take
advantage of the taxonomies and metadata for search and navigation, while allowing the user
to query the system, taking advantage of the extracted information.
64.1 Introduction
Nanoparticles toxicity (or NanoTox) is currently one of the main concerns for the scientific
community, for regulators and for the public. Nanoparticles impact on health and the environ-
ment is a research subject driving increasing interest. This fact is reflected by the number of
papers published on the subject, both in scientific journals and on the press.
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_64, © Springer Science+Business Media, LLC 2010
1222 Oded Maimon and Abel Browarnik
The published material (e.g., scientific papers) is essentially unstructured. It always uses
natural language (in the form of text), sometimes accompanied by tables and/or graphs.
Usually, when searching a body of unstructured knowledge (such as a corpus of scientific
papers) the “search engine” uses a method called “full text search”. Full text search can be
done either directly, by scanning all the available text or by using indexing mechanisms. Direct

search is feasible only for small volumes of data. Index-based search applies when the amount
of data rules out direct search. There are several indexing mechanisms, the most famous being
Google’s Page Rank (Brin, 1998). Indexing mechanisms are rated according to the results
returned by the search engines using them. In either case, users interact with the search engine
(and through it with the indexing mechanism) by means of queries.
Scientific papers are written in natural language. It could be easier for users to formulate
queries using the same natural language. However, the understanding of natural language is an
extremely non-trivial task. Therefore, using it for queries would add additional complexity to a
problem with enough complexity by itself. To avoid it, search engines use different approaches
to deal with queries:
1. Keywords: Document creators (or trained indexers) are asked to supply a list of words that
describe the subject of the text, including synonyms of words that describe this subject.
Keywords improve recall, particularly if the keyword list includes a search word that is
not in the document text.
2. Boolean queries: Searches using Boolean operators can dramatically increase the preci-
sion of a free text search. The AND operator says, in effect, ”Do not retrieve any doc-
ument unless it contains both of these terms.” The NOT operator says, in effect, ”Do
not retrieve any document that contains this word.” If the retrieval list retrieves too few
documents, the OR operator can be used to increase recall.
3. Phrase search: A phrase search matches only those documents that contain a specified
phrase.
4. Concordance search: A concordance search produces an alphabetical list of all principal
words that occur in a text with their immediate context.
5. Proximity search: A phrase search matches only those documents that contain two or
more words that are separated by a specified number of words.
6. Regular expression: A regular expression employs a complex but powerful querying syn-
tax that can be used to specify retrieval conditions with precision.
7. Wildcard search: A search that substitutes one or more characters in a search query for a
wildcard character such as an asterisk.
“Skin Deep”

1
, a product safety guide dealing with cosmetics, run by the Environmental
Working Group, grants public access to a database containing more than 42,000 products with
more than 8,300 ingredients from the U.S., nearly a quarter of all products on the market (fig-
ures updated to May 2009). The database is based on a link between a collection of personal
care product ingredient listings with more than 50 toxicity and regulatory databases.
Skin Deep uses a restricted user interface for simple queries. The visitor is asked on a
product, ingredient or company (see Figure 1).
A query for “vitamin a” returned 614 results, matching at least one word. The advanced
query screen allows for a much more detailed search (see Figure 2). Visitors can ask to find
products, ingredients or companies with higher granularity.
The results returned by Skin Deep consist of an exhaustive analysis of the substance, as
shown in Figure 3.
1
The Environmental Working Group Repository of Cosmetics,
/>64 NHECD - Nano Health and Environmental Commented Database 1223
Fig. 64.1. Skin Deep simple query.
Fig. 64.2. Skin Deep advanced queries.
1224 Oded Maimon and Abel Browarnik
Fig. 64.3. Skin Deep result example.
ICON, the International Council on Nanotechnology, from RICE University, uses an ap-
proach that constrains the user to formulate a query within a restricted (although very rich)
template, together with a “controlled vocabulary” (e.g., a list of predefined values) as shown
in Figure 4.
Fig. 64.4. ICON database.
Results obtained from ICON are, as stated on the ICON website:
64 NHECD - Nano Health and Environmental Commented Database 1225
“. . . a quick and thorough synopsis of our Environment, Health and Safety Database using
two types of analyses. The first is a Simple Distribution Analysis (pie chart) which compares
categories within a specified time range. The second type is a Time Progressive Distribution

Analysis (histogram) which compares categories over a specified overall time range and data
grouping period.
Other useful features include the ability to:
1. Generate and export custom reports in pdf and xls formats.
2. Click on a report result to generate a list of publications meeting your criteria”.
TOXNET - Databases on toxicology, hazardous chemicals, environmental health, and
toxic releases, an initiative by the US National Library of Medicine, lets visitors query its
network of databases by using keywords, as shown in Figure 5.
Fig. 64.5. Toxnet query.
There are several initiatives related to toxicity of nanoparticles, but to date none of it is
a real alternative to the existing (and limited) databases. Examples of such initiatives are the
Environmental Defense Fund Nanotech section
2
and the NANO Risk Framework.
There are also initiatives that aim at mapping current nanotox research.
The OECD (Organisation for Economic Co-operation and Development) runs a “Database
on Research into the Safety of Manufactured Nanomaterials”
3
. As suggested by its name, the
database maps research on the area. It uses extensive metadata
4
, as seen in Figure 6.
2
Environmental Defense Fund, />3
/>4
Metadata: “data about data” (see />1226 Oded Maimon and Abel Browarnik
Fig. 64.6. OECD NanoTox advanced search.
NIOSH, the U.S. National Institute for Occupational Safety and Health, runs a Nanopar-
ticle Information Library (NIL)
5

IMPART-Nanotox
6
, an EU funded project ended in 2008, includes a public, web acces-
sible database of nanotox publications. The search can be done by publications’ metadata, as
seen in Figure 7.
Figure 7 - Impart-Nanotox extended search
Fig. 64.7. Impart-Nanotox extended search.
5
The U.S. National Institute for Occupational Safety and Health Nanoparticle Information
Library (NIL) – />6

64 NHECD - Nano Health and Environmental Commented Database 1227
SAFENANO
7
, another EU funded project, contains a database of publications and meta-
data searchable on the web (see Figure 6).
Fig. 64.8. SAFENANO Publication Search.
7

1228 Oded Maimon and Abel Browarnik
Nano Archive
8
, another EU FP7 project, has the objective of allowing researchers to share
and search information, mainly through metadata exchange (see Figure 9). ObservatoryNano
9
, yet another EU FP7 funded project has an ambitious target,
“to create a European Observatory on Nanotechnologies to present reliable, complete and
responsible science-based and economic expert analysis, across different technology sectors,
establish dialogue with decision makers and others regarding the benefits and opportunities,
balanced against barriers and risks, and allow them to take action to ensure that scientific

and technological developments are realized as socio-economic benefits.”
Fig. 64.9. Nano Archive search.
The review above brought us to the conclusion that the following shortcomings should be
dealt with:
1. Many efforts are being dedicated to creating repositories of raw metadata of nanotox
publications. There is no evidence as to the contribution of such repositories to the ad-
vancement of nanotox research and implementation.
2. No significant searchable repository of nanotox data (as compared to metadata) exists
currently.
8
/>9
/>64 NHECD - Nano Health and Environmental Commented Database 1229
3. The query capabilities of widely used search engines do not include the option to query
the text for fact patterns (as well as more complex, derived patterns), such as “what con-
clusions were reached in scientific papers where <fact X> and <fact Y> occurred in that
order?”. Those are examples of queries that may help nanotox researchers and regulators,
as well as the general public.
4. There is no tool capable of extracting information specific to nanotox.
NHECD
10
, an EU FP7 funded project, is aiming at transforming the emerging body of
unstructured knowledge (in the form of scientific papers and other publications) into structured
data by means of textual information extraction, solves the above shortcomings by:
1. Developing taxonomies for the nanotox domain
2. Developing and implementing algorithms for information extraction from nanotox papers
3. Creating a repository of papers augmented by structured knowledge extracted from the
papers
4. Allowing visitors (e.g., nanotox scientists, regulators, general public) to navigate the
repository using the taxonomies
5. Letting visitors search the repository using complex patterns (such as facts)

6. Enabling data mining algorithms to predict toxicity based on characteristics extracted by
text mining methods. Thus free text can be used for data mining inference.
64.2 The NHECD Model
NHECD is, as suggested by its full name (Nanotox Health and Environment Commented
Database) an initiative to obtain a database (e.g., structured information that can be queried)
from available unstructured information such as scientific papers and other publications.
The process of obtaining the structured data involves many resources, from the domain of
Nanotox and from the areas of information sciences and technologies (IT).
The NHECD model is depicted in Figure 10.
The process starts with a collection of documents (e.g., scientific papers) gathered by
means of a search using criteria given by Nanotox experts. The process used to populate the
repository is called crawling. The documents are accompanied by the corresponding meta-
data (e.g., authors, publication dates, journals, keywords supplied by the authors, abstract and
more). The process requires Nanotox taxonomies. Taxonomies are classification artifacts used
at the information extraction stage (taxonomies are also used in NHECD for document navi-
gation). Taxonomy building tasks are “located” at the boundary between the Nanotox experts
and the IT experts (see Figure 10), due to its interdisciplinary nature.
Nanotox experts annotate papers to train the system towards the information extraction
stage. This stage is implemented using text mining algorithms. Further to the information ex-
traction process, a set of rating algorithms is applied on the documents to provide an additional
layer of information (e.g., the rating).
The result of the process consists of:
1. A corpus of results, updated on an ongoing, asynchronous basis.
2. A commented collection of scientific papers. By commented we refer to the added layer
of metadata, rating and other information extracted from the document.
The whole process can be represented with a block diagram as shown in figure 11.
10
“ - Creation of a critical and commented database on the health,
safety and environmental impact of nanoparticles”

×