Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 126 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (428.47 KB, 10 trang )

1230 Oded Maimon and Abel Browarnik
Fig. 64.10. NHECD model.
64.3 NHECD implementation
NHECD is built around Documentum
11
, an enterprise content management system (ECM).
Documentum acts as a central repository for documents (e.g., unstructured data), metadata
(semi structured to structured data, mostly in XML) and extracted information (mostly struc-
tured, tabular data).
NHECD assumes that all target scientific papers to be included in the repository are in
Adobe PDF format. A review of several websites hosting scientific papers shows that this is
a safe choice. If other formats are found the crawler can be instructed to convert the format
found to PDF almost seamlessly.
64.3.1 Taxonomies
Each taxonomy deals with a certain aspect of the Nanotox domain. They were built by teams of
domain experts and information management experts. Taxonomies are not an expected output
of NHECD. Yet, they are essential to the NHECD process. Hence it was one of the initial steps
in NHECD implementation. The taxonomies are:
Following are the taxonomies describing the subject “commercial NP characterization”.
64.3.2 Crawling
The process of automatically obtaining scientific papers and data about the paper (such as the
name of the author or authors, the publication date, the name of the journal, keywords, abstract,
11
/>64 NHECD - Nano Health and Environmental Commented Database 1231
NHECD
rating
information extraction
annotation
metadata
y
1


x
0
y
0
x
1




taxonomies
Fig. 64.11. NHECD process.
Fig. 64.12. Commercial characterization of NP.
1232 Oded Maimon and Abel Browarnik
Table 64.1. Taxonomies.
Subject Taxonomies
animal model animal gender
S
p
ecies
experimental exposure parameters mode of exposure
Metabolism
Excretion
Distribution
NP exposure protocol
effectin
g
a
g
ents

standard test protocol
NP chemical characterization NP chemical composition
core impurities
coat impurities
coat chemical composition
NP carrier solubility
commercial NP characterization Substance
fraction NP stated
Mixture (GHS or 1999/45/EC)
Article
NP characterization methods X-ray and neutron based instruments
Electron beam methods
Ion Beam analysis
O
p
tical methods
other NP measurements methods
NP general characterization Specific Surface Area
NP shape
dispersion and adsorption
Nano Delivery system
Structure
zeta potential
NP type
Overall NP taxonomies of NP characterization
taxonomies of NP chemical characterization
taxonomies of NP characterization methods
Results visible toxic effect by system
biological effects
pathologic effects

time for effect manifestation
Reversibility
site of effect
measurement methods for biological effect
measurements methods for pathological effects
and in general any detail made available with the paper itself) by visiting scientific paper
repositories available on the web (whether restricted to subscribers or available to everyone)
and searching by keywords on the paper text is called crawling. NHECD developed a crawler
for Pubmed
12
. Crawlers for other leading scientific sites, such as, ISIWEB
13
or SciFinder
12
/>13
/>64 NHECD - Nano Health and Environmental Commented Database 1233
14
are in a development stage. The main obstacles found often refer to intellectual property
issues and the efforts by the publishers to enforce it.
The crawler is written in java. It takes as input a set of keywords. Using websites API
15
it obtains a list of pointers to the targeted scientific papers. Those pointers are processed to
transform it into downloadable links. If a downloadable link is obtained, the scientific paper
is downloaded, provided that NHECD has access to the paper (e.g., there is a subscription to
the resource or it is publicly available). The paper (if available, otherwise its place holder),
along with the metadata already converted to XML, are uploaded to the NHECD document
repository.
64.3.3 Information extraction
The goals of information extraction in NHECD are:
1. To enable users to ask specific questions about specific attributes and receive answers. If

possible, a link to the paper is given, along with a pointer to the location of the requested
information within the document.
2. To enable, in the future, data mining on extracted data (e.g., patterns).
The process starts with a multistep preprocessing stage:
1. convert the input documents from PDF to text
2. perform parsing and stemming
3. perform zoning within the document
4. classify the document according to NHECD taxonomies
Next in the process is the tagging stage, used to recognize keywords, either by using the
taxonomies or the values involved. As an example, the input phrase
“To determine the effect of particle size, labeled microspheres of 500 and 1000
nm in diameter were incubated with mouse melanoma B16 cells”
would result in the tagged form
“To determine the effect of particle size, labeled microspheres of <NUMBER_1>
and <NUMBER_2> <LENGTH-UNIT_3> in diameter were incubated with
<SPECIES_4> <CELL-TYPE_5> <CELL-LINE_6> cells”
The pattern matching stage is based on the output of previous stages and on the process
of annotation, an auxiliary step performed by Nanotox domain experts to prepare a training
set for this stage.
The tasks needed to obtain patterns are:
1. Define the list of features to be extracted (based on the taxonomy)
14
/>15
Application Program Interface
1234 Oded Maimon and Abel Browarnik
2. For each feature that needed to be extracted we define a list of extraction patterns
3. Each extraction pattern (p) consists of the following items:
a) p.attributes – Associated attributes to be extracted. (note the same pattern can be
used to extract several attributes concurrently)
b) p.precondiction – A pre-condition

c) p.match - A regular expression to be matched.
d) p.extraction – A regular extraction expression to be used for extraction the values
assuming that pattern p.t has been matched.
e) p.scope – determine the scope of the extracted values in the text
f) p.store – A SQL query for storing the results in the database
The closing stage of the process is the conflict resolution stage. It is required for cases
where several possible contradicting patterns can be matched to the same text or the same
pattern can be matched to different part of the text.
The information extraction process is depicted in Figure 13:
Fig. 64.13. The information extraction process.
64 NHECD - Nano Health and Environmental Commented Database 1235
64.3.4 NHECD products
The results of NHECD consist mainly of two products:
1. A repository of scientific papers related to Nanotox, augmented by metadata provided by
authors and publishers, metadata extracted from the papers using text mining algorithms,
and ratings for the articles based on methods adopted by NHECD. All the above, indexed
using NHECD taxonomies. As a result, it is possible to retrieve scientific papers using
sophisticated queries.
2. A set of structured facts extracted from the scientific papers in tabular format. The struc-
tured facts should make it possible to perform data mining to obtain new, unforeseen
knowledge.
64.3.5 Scientific paper rating
A scientific paper has a well established life cycle. After the paper is written, refereed and
eventually accepted, it is published. From this point in time the paper can be cited.
The rating of a paper depends on several variables:
1. Journal Name
2. Publication Year
3. Full Author Names
4. For each citing article:
5. Citing article name (and a unique identifier for the paper itself. NHECD decided to adopt

SICI
16
for this purpose)
6. Citing journal name
7. From JCR (Journal Citation Report) , for journal name (including citing journals):
8. Impact Factor
9. Cited Half Life
10. H-Indices
17
per Author, From PoP
The rating algorithm is applied when the paper is loaded and then on a periodic basis, to
reflect changes such as new citations, changes in impact factors, in “Cited Half Life”, in JCR
data and more. The rating algorithm takes into account the publication date of newly published
papers to avoid less-than-fair ratings for such papers.
The scientific paper rating devised by NHECD is composed a Journal Impact Factor and
by H-indices. These components are defined below.
1. Rating By Journal Impact Factor
Rating
1
(Article(i)) = 1 −2
−0.6•CitationScore
Article(i)
where
CitationScore
Article(i)
=

Article( j)∈citations(Article(i))
Map(impact(Journal(Article( j))))
Age(Article(i))

and
16
/>17
number
1236 Oded Maimon and Abel Browarnik
Map(impact(Journal(Article( j)))) =



0.08 0 ≤ impact(Journal(Article( j))) ≤1.296
0.41.297 ≤ impact(Journal(Article( j))) ≤3.76
13.77 ≤ impact(Journal(Article( j))) ≤∞
2. Rating By H-Indices
Rating
2
(Article(i)) = 1 −1.05
−HScore
Article(i)
where
HScore
Article(i)
=

Article( j)∈citations(Article(i))
Average
Author(k)∈Article( j)
(H −Index
k
)
Age(Article(i))

3. Final Rating
Rating =
α
1
Rating
1
+
α
2
Rating
2
α
1
+
α
2
0 ≤
α
i
≤ 1
64.3.6 NHECD Frontend
NHECD provides a free access website including information retrieval functionalities to facil-
itate the search on NHECD repository.
It includes the following components:
1. An open source content management system implemented on Drupal, which stores and
manages the entire frontend database (including user information and usage patterns).
2. The user interface component that handles all the input or requests from the user.
The frontend interacts with the backend repository, stored and managed on Documentum.
Figure 14 shows the architecture design of NHECD Frontend.
1. User communities and Characteristics – NHECD front end is designed to meet the differ-

ent needs of three main communities and an additional group – the administrators.
2. Scientists – Users in this community will be scientists from academia and industry – the
most expert users among all three communities. These users should have an extensive
prior knowledge in the domain of nanotoxicology. The system assumes that these users
are proficient in information searches.
3. Regulators – Users working for (or on behalf of) government institutes and regulatory
agencies are part of the NHECD regulatory community. This community aims at provid-
ing legislation and regulation on the health, safety or environmental concerns regarding
the use of nano-particles. Usage patterns of this group often overlap with those of the
other communities.
4. General public – This community is composed of individuals and NGO’s who are active
in a wide range of fields where information provided by NHECD may be relevant. We
assume that most of the general public users are NOT able to read/evaluate the scien-
tific material NHECD provides. Therefore, the frontend provides - for this community -
mainly answers to queries on general information/light reviews or news on the impact of
exposure to nanoparticles.
64 NHECD - Nano Health and Environmental Commented Database 1237
Fig. 64.14. Architecture.
5. Administrator – The administrator is in charge of managing the daily operation of the
system. Administrators are responsible for managing user accounts, general settings and
monitoring.
The NHECD frontend provides the following features:
1. Basic search
2. Advanced search
3. Intelligent search
4. Taxonomic navigation
5. Recommender results (i.e., recommendations based on the analysis of usage patterns of
other users)
6. Option to resubmit queries, adding additional criteria for the refinement of results
7. Site registration

8. Personalization features
9. Displaying a list of most viewed papers
10. Links to other nanotox related sites
11. NHECD news, updates and FAQ’s
64.4 Conclusions
NHECD provides two important products:
1. An extensive and commented repository of scientific papers and other publications in
the Nanotox area, searchable using taxonomies and full text search. The scientific papers
are rated according to published NHECD criteria, to help users to better estimate their
findings. Such a repository significantly expand currently available repositories due to
the fact that it goes beyond the mapping of existing research in Nanotox (as most current
initiatives do). NHECD gives access to the research papers results, extracted from the
sources using text mining algorithms. Access to scientific papers is granted to visitors
following copyright and restrictions as imposed by publishers. This NHECD result is
intended for Nanotox scientists, regulators and for the general public.
1238 Oded Maimon and Abel Browarnik
NHECD 2.0
rating
information extraction
table extraction
graph mining
annotation
metadata
y
1
x
0
y
0
x

1




taxonomies
Fig. 64.15. NHECD 2.0.
2. A set of structured results extracted from the scientific papers populating the NHECD
repository. Using these results it will be possible to perform data mining on the results.
Data mining will result in validated results and further knowledge discovery. This part of
NHECD results is targeted at Nanotox scientists and regulators.
64 NHECD - Nano Health and Environmental Commented Database 1239
64.5 Further research
Graph and table mining
NHECD makes resort to text mining algorithms, allowing for information extraction from
textual data. It appears that scientific Nanotox papers (as in many other areas) often include
other type of elements, such as graphs and tables. Moreover, the expressiveness of these el-
ements is generally higher than that conveyed by text. Hence, expanding NHECD to include
graph and table mining seems desirable. Preliminary research on these subjects made by the
NHECD team shows that – at least for some types of graphs and tables – the task is feasible.
The concept of the future NHECD (touted NHECD 2.0) is shown in Figure 15.

×