Tải bản đầy đủ (.pdf) (12 trang)

Tài liệu Mining Database Structure; Or, How to Build a Data Quality Browser docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (304.46 KB, 12 trang )

Mining Database Structure; Or, How to Build a Data Quality
Browser
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk
AT&T Labs–Research
ABSTRACT
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA
Copyright 2002 ACM 1-58113-497-5/02/06
5.00.
1.1 Related Work
2. SUMMARIZING VALUES OF A FIELD
2.1 Set Resemblance
2.2 Multiset Resemblance
2.3 Substring Resemblance
2.3.1 Q-gram Signature
2.4 Q-gram Sketches
2.5 Finding Keys
3. MINING DATABASE STRUCTURES
3.1 Finding Join Paths
3.2 Finding Composite Fields
3.3 Finding Heterogeneous Tables
4. BELLMAN
5. EXPERIMENTS
5.1 Estimating Field Intersection Size
5.2 Estimating Join Sizes


Errorinintersectionsizeestimation,50samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Errorinestimation
Errorinintersectionsizeestimation,100samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Erroinestimation
ErrorinJoinSizeEstimation,100samples
0
0.2
0.4

0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Resemblance
Estimationerror
Errorinjoinsizeestimation,250samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1 1.2
Resemblance
Errorinestimation
Unadjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000

1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
5.3 Q-gram Signatures
Adjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
Estimatedvs.ActualQ-gramResemblance,50samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance

Estimatedresemblance
5.4 Q-gram Sketches
Estimatedvs.ActualQ-gramResemblance,150Samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance
Estimatedresemblance
Estimatedvs.actualq-gramvectordistance,50
sketchsamples
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Estimatedvs.actualq-gramvectordistance,150sketch
samples
0
0.2
0.4
0.6

0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Q-gramvectordistancevs.g-gramresemblance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Q-gramresemblance
Q-gramvectordistance
5.5 Qualitative Experiments
5.5.1 Using Multiset Resemblance
5.5.2 Using Q-gram Similarity
6. CONCLUSIONS
7. REFERENCES

×