Mining Database Structure; Or, How to Build a Data Quality
Browser
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk
AT&T Labs–Research
ABSTRACT
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA
Copyright 2002 ACM 1-58113-497-5/02/06
5.00.
1.1 Related Work
2. SUMMARIZING VALUES OF A FIELD
2.1 Set Resemblance
2.2 Multiset Resemblance
2.3 Substring Resemblance
2.3.1 Q-gram Signature
2.4 Q-gram Sketches
2.5 Finding Keys
3. MINING DATABASE STRUCTURES
3.1 Finding Join Paths
3.2 Finding Composite Fields
3.3 Finding Heterogeneous Tables
4. BELLMAN
5. EXPERIMENTS
5.1 Estimating Field Intersection Size
5.2 Estimating Join Sizes
Errorinintersectionsizeestimation,50samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Errorinestimation
Errorinintersectionsizeestimation,100samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Erroinestimation
ErrorinJoinSizeEstimation,100samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Resemblance
Estimationerror
Errorinjoinsizeestimation,250samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1 1.2
Resemblance
Errorinestimation
Unadjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000
1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
5.3 Q-gram Signatures
Adjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
Estimatedvs.ActualQ-gramResemblance,50samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance
Estimatedresemblance
5.4 Q-gram Sketches
Estimatedvs.ActualQ-gramResemblance,150Samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance
Estimatedresemblance
Estimatedvs.actualq-gramvectordistance,50
sketchsamples
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Estimatedvs.actualq-gramvectordistance,150sketch
samples
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Q-gramvectordistancevs.g-gramresemblance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Q-gramresemblance
Q-gramvectordistance
5.5 Qualitative Experiments
5.5.1 Using Multiset Resemblance
5.5.2 Using Q-gram Similarity
6. CONCLUSIONS
7. REFERENCES