learning to map between schemas ontologies

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (273.08 KB, 47 trang )

Alon Halevy
University of Washington
Joint work with Anhai Doan and Pedro Domingos
Learning to Map Between
Learning to Map Between
Schemas
Schemas
Ontologies
Ontologies
2
Agenda
Agenda

Ontology mapping is a key problem in many applications:
–
Data integration
–
Semantic web
–
Knowledge management
–
E-commerce

LSD:
–
Solution that uses multi-strategy learning.
–
We’ve started with schema matching (I.e., very simple ontologies)
–
Currently extending to more expressive ontologies.

–
Experiments show the approach is very promising!
3
The
The
Structure
Structure
Mapping Problem
Mapping Problem

Types of structures:
–
Database schemas, XML DTDs, ontologies, …,

Input:
–
Two (or more) structures, S1 and S2
–
Data instances for S1 and S2
–
Background knowledge

Output:
–
A mapping between S1 and S2
–
Should enable translating between data instances.
–
Semantics of mapping?
4

Semantic Mappings between Schemas
Semantic Mappings between Schemas

Source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
5
Motivation
Motivation

Database schema integration
–
A problem as old as databases themselves.
–
database merging, data warehouses, data migration

Data integration / information gathering agents
–
On the WWW, in enterprises, large science projects

Model management:

–
Model matching: key operator in an algebra where models and mappings are first-class objects.
–
See [Bernstein et al., 2000] for more.

The Semantic Web
–
Ontology mapping.

System interoperability
–
E-services, application integration, B2B applications, …,
6
Desiderata from Proposed Solutions
Desiderata from Proposed Solutions

Accuracy, efficiency, ease of use.

Realistic expectations:
–
Unlikely to be fully automated. Need user in the loop.

Some notion of semantics for mappings.

Extensibility:
–
Solution should exploit additional background knowledge.

“Memory”, knowledge reuse:
–

System should exploit previous manual or automatically generated matchings.
–
Key idea behind LSD.
7
LSD Overview
LSD Overview

L(earning) S(ource) D(escriptions)

Problem: generating semantic mappings between mediated schema and a large set of
data source schemas.

Key idea: generate the first mappings manually, and learn from them to generate the rest.

Technique: multi-strategy learning (extensible!)

Step 1:
–
[SIGMOD, 2001]: 1-1 mappings between XML DTDs.

Current focus:
–
Complex mappings
–
Ontology mapping.
8
Outline
Outline

Overview of structure mapping


Data integration and source mappings

LSD architecture and details

Experimental results

Current work.
9
Data Integration
Data Integration
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Applications: WWW, enterprises, science projects
Techniques: virtual data integration, warehousing, custom code.
wrappers
Query reformulation
and optimization.
10
Semantic Mappings between Schemas
Semantic Mappings between Schemas

Source schemas = XML DTDs
house
location contact
house

address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
11
Semantics (preliminary)
Semantics (preliminary)

Semantics of mappings has received no attention.

Semantics of 1-1 mappings –

Given:
–
R(A1,…,An) and S(B1,…,Bm)
–
1-1 mappings (Ai,Bj)

Then, we postulate the existence of a relation W, s.t.:
–
Π (C1,…,Ck) (W) = Π (A1,…,Ak) (R) ,
–
Π (C1,…,Ck) (W) = Π (B1,…,Bk) (S) ,
–
W also includes the unmatched attributes of R and S.


In English: R and S are projections on some universal relation W, and the mappings
specify the projection variables and correspondences.
12
Why Matching is Difficult
Why Matching is Difficult

Aims to identify same real-world entity
–
using names, structures, types, data values, etc

Schemas represent same entity differently
–
different names => same entity:
–
area & address => location
–
same names => different entities:
–
area => location or square-feet

Schema & data never fully capture semantics!
–
not adequately documented, not sufficiently expressive

Intended semantics is typically subjective!
–
IBM Almaden Lab = IBM?

Cannot be fully automated. Often hard for humans. Committees are required!
13

Current State of Affairs
Current State of Affairs

Finding semantic mappings is now the bottleneck!
–
largely done by hand
–
labor intensive & error prone
–
GTE: 4 hours/element for 27,000 elements [Li&Clifton00]

Will only be exacerbated
–
data sharing & XML become pervasive
–
proliferation of DTDs
–
translation of legacy data
–
reconciling ontologies on semantic web

Need semi-automatic approaches to scale up!
14
Outline
Outline

Overview of structure mapping

Data integration and source mappings


LSD architecture and details

Experimental results

Current work.
15
The LSD Approach
The LSD Approach

User manually maps a few data sources to the mediated schema.

LSD learns from the mappings, and proposes mappings for the rest of the sources.

Several types of knowledge are used in learning:
–
Schema elements, e.g., attribute names
–
Data elements: ranges, formats, word frequencies, value frequencies, length of texts.
–
Proximity of attributes
–
Functional dependencies, number of attribute occurrences.

One learner does not fit all. Use multiple learners and combine with meta-learner.
16
listed-price
$250,000
$110,000

address price agent-phone description

Example
Example
location
Miami, FL
Boston, MA

phone
(305) 729 0831
(617) 253 1429

comments
Fantastic house
Great location

realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” &
“great”
occur frequently
in data values =>
description
Learned hypotheses
price
$550,000
$320,000

contact-phone
(278) 345 7215
(617) 335 2315

extra-info
Beautiful yard
Great beach

homes.com
If “phone” occurs
in the name =>
agent-phone
Mediated schema
17
Multi-Strategy Learning
Multi-Strategy Learning

Use a set of base learners:
–
Name learner, Naïve Bayes, Whirl, XML learner

And a set of recognizers:
–
County name, zip code, phone numbers.

Each base learner produces a prediction weighted by confidence score.

Combine base learners with a meta-learner, using stacking.
18

Name Learner
Base Learners
Base Learners

(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)
(listed-price,price)
contact-phone => (agent-phone,0.7), (office-address,0.3)

Naive Bayes Learner [Domingos&Pazzani 97]
–
“Kent, WA” => (address,0.8), (name,0.2)

Whirl Learner [Cohen&Hirsh 98]

XML Learner
–
exploits hierarchical structure of XML data
(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)
(listed-price,price)
(contact-phone, ? )
19
<location> Boston, MA </>
<listed-price> $110,000</>
<phone> (617) 253 1429</>
<comments> Great location </>
<location> Miami, FL </>
<listed-price> $250,000</>
<phone> (305) 729 0831</>
<comments> Fantastic house </>
Training the Base Learners

Training the Base Learners
Naive Bayes Learner
(location, address)
(listed-price, price)
(phone, agent-phone)

(“Miami, FL”, address)
(“$ 250,000”, price)
(“(305) 729 0831”, agent-phone)

realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com
Mediated schema
location listed-price phone comments
20
Entity Recognizers
Entity Recognizers

Use pre-programmed knowledge to identify specific types of entities
–
date, time, city, zip code, name, etc
–
house-area (30 X 70, 500 sq. ft.)
–
county-name recognizer

Recognizers often have nice characteristics

–
easy to construct
–
many off-the-self research & commercial products
–
applicable across many domains
–
help with special cases that are hard to learn

21
Meta-Learner: Stacking
Meta-Learner: Stacking

Training of meta-learner produces a weight for every pair of:
–
(base-learner, mediated-schema element)

–
weight(Name-Learner,address) = 0.1
–
weight(Naive-Bayes,address) = 0.9

Combining predictions of meta-learner:
–
computes weighted sum of base-learner confidence scores
<area>Seattle, WA</>
(address,0.6)
(address,0.8)
Name Learner
Naive Bayes

Meta-Learner
(address, 0.6*0.1 + 0.8*0.9 = 0.78)
22
Least-Squares
Linear Regression
Training the Meta-Learner
Training the Meta-Learner
<location> Miami, FL</>
<listed-price> $250,000</>
<area> Seattle, WA </>
<house-addr>Kent, WA</>
<num-baths>3</>

Extracted XML Instances
Name Learner
0.5 0.8 1
0.4 0.3 0
0.3 0.9 1
0.6 0.8 1
0.3 0.3 0

Naive Bayes True Predictions
Weight(Name-Learner,address) = 0.1
Weight(Naive-Bayes,address) = 0.9

For address
23
<extra-info>Beautiful yard</>
<extra-info>Great beach</>

<extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</>
<day-phone>(617) 335 2315</>
<day-phone>(512) 427 1115</>
<area>Seattle, WA</>
<area>Kent, WA</>
<area>Austin, TX</>
Applying the Learners
Applying the Learners
Name Learner
Naive Bayes
Meta-Learner
(address,0.8), (description,0.2)
(address,0.6), (description,0.4)
(address,0.7), (description,0.3)
(description,0.8), (address,0.2)
Meta-Learner
Name Learner
Naive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com
Mediated schema
area day-phone extra-info
24
The Constraint Handler
The Constraint Handler

Extends learning to incorporate constraints

–
hard constraints
–
a = address & b = address a = b
–
a = house-id a is a key
–
a = agent-info & b = agent-name b is nested in a
–
soft constraints
–
a = agent-phone & b = agent-name
a & b are usually close to each other
–
user feedback = hard or soft constraints

Details in [Doan et. al., SIGMOD 2001]
25
The Current LSD System
The Current LSD System
Mediated schema
Source schemas
Data listings
Constraint Handler
Mappings
User Feedback
Domain
Constraints
Matching PhaseTraining Phase
Base-Learner

1
Base-Learner
k
Meta-Learner

learning to map between schemas ontologies

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về