Tải bản đầy đủ (.pdf) (26 trang)

Summary of mathematics doctoral thesis: Discovering functional dependencies and relaxed functional dependencies in databases

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (702.4 KB, 26 trang )

MINISTRY OF EDUCATION
AND TRAINING

VIETNAM ACADEMY OF
SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY

-----------------------------

VU QUOC TUAN

DISCOVERING FUNCTIONAL DEPENDENCIES
AND RELAXED FUNCTIONAL DEPENDENCIES
IN DATABASES

Major: Math Fundamentals for Informatics
Code: 9 46 01 10

SUMMARY OF MATHEMATICS DOCTORAL THESIS

Ha Noi - 2019


This work is completed at:
Graduate University of Science and Technology
Vietnam Academy of Science and Technology

Supervisor 1: Assoc. Prof. Dr. Ho Thuan
Supervisor 2: Assoc. Prof. Dr. Nguyen Thanh Tung


Reviewer 1: …..................................................
Reviewer 2: …..................................................
Reviewer 3: …..................................................

This thesis will be officially presented in front of the Doctoral thesis
Grading Committee, meeting at:
Graduate University of Science and Technology
Vietnam Academy of Science and Technology
At ............. hrs ....... day ....... month....... year .......

This thesis is available at:
1. Library of Graduate University of Science and Technology
2. National Library of Vietnam.


INTRODUCTION
Data dependencies play important roles in database design, data
quality management and knowledge representation. Dependencies in
knowledge discovery are extracted from the existing data of the
database. This extraction process is called dependency discovery.
The aim of dependency discovery is to find important
dependencies holding on the data of the database. These discovered
dependencies represent domain knowledge and can be used to verify
database design and assess data quality.
Dependency discovery has attracted a lot of research interests
from scientists since early 1980s. At the present time, the problem of
discovering data dependencies on big data sets becomes more
important because these big data sets contain a lot of valuable
knowledge.
Currently, with the development of digital devices, especially

social networks and smart phone applications, the amount of data in
the applications increases very quickly, these arise proplems in data
storage, data management, especially the problem of knowledge
discovery from those big data sets. The problem of discovering FDs
and RFDs in databases is one of important proplems of knowledge
discovery. Three typical types of data dependencies which are
interested in discovering are FD, AFD and CFD. AFD is an
extension of FD, the "approximation" is based on a degree of
satisfaction or an error measure; CFD is an extension of FD which
aims to capture inconsistencies in data.
The research directions which solve the problem of RFD
discovery in databases, firstly focus on FD discovery because FD is
the separate case of all types of RFD, the research results about FD
1


discovery can be adapted to discover other types of data
dependencies (such as AFD). The general model of FD discovery
problem includes steps: generating a search space of FDs, verifying
the satisfaction of each FD, pruning the search space, outputting the
set of satisfied FDs and reducing redundancies in this set of satisfied
FDs. In the FD discovery problem, the key discovery is the special
case and is also an important problem in normalizing relational
databases.
The time complexity of the FD discovery problem is polynomial
in the number of tuples in the relation but is exponential in the
number of attributes of that relation. Therefore, for reducing the
processing time, effective pruning rules should be developed. Among
the proposed pruning rules, it is important to prune keys, and if a key
is discovered then it is possible to prune (delete) all sets containing

the key in the search space. However, the disadvantage of existing
key pruning rules is to find keys on the entire set of attributes  of
the database (this is really a very difficult problem because the time
complexity can be exponential in number of attributes of ). So is
there any way to find keys in a proper subset of ? This question is
one of basic motivations of this thesis.
After the set of data dependencies is discovered, this set can be
very large and difficult to use because it contains unnecessary
redundancies. The important problem is how to eliminate (as much
as possible) the redundancy in the set of discovered data
dependencies. This is also a problem interested in the thesis.
Another research direction in the thesis is to focus on discovering
two typical types of RFD, namely AFD and CFD. Both AFD and
CFD have many applications and occurences in relational databases,
2


especially CFD is also a powerful tool for dealing with data cleaning
problems. For AFD, the most important problem is to improve and
develop techniques for computing approximate measures; For CFD,
in addition to discovering them, the research about a unified
hierarchy between CFD and other types of data dependencies is also
a very interesting problem.
The research content in the thesis is the current problems which
are renewed with a series of works of foreign authors; while in the
country (in Vietnam), there are many published works related to
methods and algorithms finding reducts of a decision table by
different approaches.
The objective of the thesis is to research some analyzed problems
in range of relational databases. The main contents of the thesis are

described as follows:
Chapter 1. An overview of relational data model, concepts of
functional dependency, closure of a set of attributes, key for a
relational schema, etc. This chapter also focuses on RFD and the
generalization of methods used for discovering FDs and RFDs.
Chapter 2. The presentation of AFD and CFD (two typical types
of RFD) and some related results.
Chapter 3. The presentation of the closure computing algorithms
of a set of attributes under a set of FDs, reducing the key finding
problem of a relation schema and some related results.
Chapter 4. The presentation of an effective preprocessing
transformation for sets of FDs (to reduce redundancies in a given set
of FDs) and some related results.

3


Chapter 1
FUNCTIONAL DEPENDENCIES AND
RELAXED FUNCTIONAL DEPENDENCIES IN THE
RELATIONAL DATA MODEL
1.1. Recalling some basic notions
A relation r on the set of attributes Ω = {A1, A2,…,An}
r  {(a1, a2,…,an) | ai  Dom(Ai), i = 1, 2,…, n}
where Dom(Ai) is the domain of Ai, i = 1, 2,…, n.
A relation schema S is an ordered pair S = <Ω, F>, where Ω is a
finite set of attributes, F is a set of FDs. S can also denoted by S().
1.2. Functional dependency
Functional dependency. Given X, Y  . Then X  Y if for all
relations r over the relation schema S(), t1, t2  r such that t1[X] =

t2[X] then t1[Y] = t2[Y].
Armstrong's axioms. For all X, Y, Z  , we have:
Q1. (Reflexivity): If Y  X then X  Y.
Q2. (Augmentation): If X  Y then XZ  YZ.
Q3. (Transitivity): If X  Y and Y  Z then X  Z.
The closure of X   under a set of FDs F, is the set X F :
X F = {A    (X  A)  F+}

Keys for a relation schema. Let S = <, F> be a relation schema
and K  . We say that K is a key of S if the following two
conditions are simultaneously satisfied:
(i). (K  )  F+
(ii). If K'  K then (K'  )  F+
If K only satisfies (i) then K is called a superkey.

4


1.3. Relaxed functional dependency (RFD)
1.3.1. Approximate functional dependency (AFD)
An AFD is a FD that almost holds. To determine the degree of
violation of X  Y in a given relation r, an error measure, denoted

e( X  Y , r ) , shall be used. Given an error threshold , 0    1. We
say that X  Y is an AFD if and only if e( X  Y , r )   .
1.3.2. Metric functional dependency (MFD)
Consider X  Y in a given relation r. A MFD is an extension of
functional dependency by replacing the condition t1[Y] = t2[Y] with
d(t1[Y], t2[Y]) ≤ , where d is a metric on Y, d: dom(Y)  dom(Y)  R
and   0 is a parameter.

1.3.3. Conditional functional dependency (CFD)
A CFD is a pair  = (X  Y, Tp), where X  Y is a FD and Tp is a
pattern tableau with all attributes in X and Y. Intuitively, the pattern
tableau Tp of  refines the FD embedded in  by enforcing the
binding of semantically related data values.
1.3.4. Fuzzy functional dependency (FFD)
Let r be a relation on Ω = {A1, A2,…,An} and X, Y  . For each
Ai  Ω, the degree of equality of data values in Dom(Ai) is defined
by the fuzzy tolerance relation Ri.
Given a parameter  (0 ≤  ≤ 1), we say that two tuples t1[X] and
t2[X] are equal with the degree , denoted t1[X] E() t2[X], if
Rk(t1[Ak], t2[Ak])   for all Ak  X. Then, X  Y is called a FFD with
the degree  if t1, t2  r, t1[X] E() t2[X]  t1[Y] E() t2[Y].
1.3.5. Differential dependency (DD)
DD extends the equality relation (=) in FD X  Y. The conditions
t1[X] = t2[X] and t1[Y] = t2[Y], in turn, are replaced by the conditions
which t1, t2 satisfies differential functions L and R.
5


In fact, the differential functions use metric distances to extend
the equality relation used in FD.
FD is a special case of DD if L[t1[X], t2[X]) = 0 and R[t1[Y],
t2[Y]) = 0. In addition, DD is also an extension of MFD if L[t1[X],
t2[X]) = 0 and R[t1[Y], t2[Y]) ≤ .
1.3.6. Other types of RFDs
There are many other types of RFDs. Starting from reality
applications, each type of RFDs is the result of extending (relaxing)
the equality relation in the traditional FD concept by a certain way.
1.4. FD Discovery

Top-down methods. These methods generate candidate FDs
following an attribute lattice, test their satisfaction, and then use the
satisfied FDs to prune candidate FDs at lower levels of the lattice to
reduce the search space. An important prolem is how to check if a
candidate FD is satisfied? Two specific methods were used: the
partition method (algorithms: TANE, FD_Mine) and the free-set
method (algorithm: FUN).
Bottom-up methods. Different from the top-down methods above,
bottom-up methods compare the tuples of the relation to find agreesets or difference-sets. These sets are then used to derive FDs
satisfied by the relation. The feature of these mothods is that they do
not check candidate FDs against the relation for satisfaction, but
check candidate FDs against the computed agree-sets and differencesets. Two typical algorithms using these methods are Dep-Miner and
FastFDs.
The worst case time complexity of the FD discovery problem is
exponential in the number of attributes of .
There are some topics relating to FD discovery, such as sampling,
6


maintenance of discovered FDs, key discovery,...
1.5. RFD Discovery
1.5.1. AFD Discovery
To test the satisfaction of AFDs, FD discovery methods can be
adapted to discover AFDs by adding a certain approximate measure
to these methods.
1.5.2. CFD Discovery
On the discovery of CFDs, challenges are from two areas. Like in
classical FDs, the number of candidate embedded FDs for possible
CFDs is exponential. At the same time, the discovery of the optimal
tableau for an embedded FD is NP-C.

Three typical algorithms for CFD discovery are CFDMiner,
CTANE and FastCFD.
1.6. Summary of chapter 1
This chapter presents an overview of FD and RFD in the
relational data model. The dependency discovery problem has an
exponential search space on the number of attributes involved in the
data.
The FD discovery methods can be adapted to discover RFDs. For
example, an error measure can be used in a FD discovery algorithm
for finding AFDs.
Some algorithms are proposed for discovering FDs and RFDs.

7


Chapter 2.
APPROXIMATE FUNCTIONAL DEPENDENCIES AND
CONDITIONAL FUNCTIONAL DEPENDENCIES
2.1. About some results relating to FD and AFD
This section shows relationships for the results in two works of
two groups of authors (([Y. Huhtala et al., 1999] and [S. King et al.,
2003]) and proves some important lemmas as the foundation to
discover FD and AFD (these lemmas have not been proven).
2.1.1. Partitions
For t  r and X  , let us denote:
[t]X = {u  r | t[X] = u[X]}and X = {[t]X | t  r}
The product of two partitions X and Y, denoted by X  Y. The
number of equivalence classes in X is denoted by |X |.
2.1.2. Some results
The following theorems of [S.King et al., 2003]) are really some

lemmas of [Y. Huhtala et al., 1999], these lemmas have been proven
in detail in the thesis.
Theorem 2.1. FD X  A holds if and only if X refines A.
Theorem 2.2. FD X  A holds if and only if |X| = |X{A}|.
Theorem 2.3. FD X  A holds if and only if g3(X) = g3(X  {A}).
Theorem 2.4. We have X  Y = X  Y.
Theorem 2.5. For B  X and X - {B}  B. Then, if X  A then X {B}  A. If X is a superkey then X - {B} is also a superkey.
Theorem 2.6. C+(X) = {A  R | B  X, X - {A, B}  B does not
hold}.
Theorem 2.7. For A  X and X - {A}  A. FD X - {A}  A is
minimal iff for all B  X, we have A  C+(X - {B}).
8


2.2. FD and AFD discovery
Some approximate measures proposed and usually used for
discovering AFD are TRUTHr(X  Y), g1(X  Y, r), g2(X  Y, r)
and g3(X  Y, r).
Choosing a certain approximate measure for discovering AFDs
affects the output results. In the thesis, we establish some new
relationships between the measures:
r
g1  X  Y , r 
 TRUTHr(X  Y) = 1 r 1
 g2  X  Y , r  

r
2

.g1  X  Y , r 


 g ( X  Y , r)  g ( X  Y , r) 
2
3



max | c ' |: c '   Y (c), c '  c / | r |

c X ( r )

Given a relation r on a schema S(). For each X  , we define
an equivalence relation X on r as follows:
t X u if and only if t[X] = u[X] for all t, u  r
Suppose r  t1 , t2 ,..., tm  . Each equivalence relation X on r can
be expressed in terms of a binary matrix with elements 1 or 0 (called
an equvalence matrix) where aij  1 if ti X tj and aij  0 otherwise.
t1
t2
...
tj
...
tm
X
t1

a11

a12


...

a1j

...

a1m

t2

a21

a22

...

a2j

...

a2m

...

...

...

...


...

...

...

ti

ai1

ai2

...

aij

...

aim

...

...

...

...

...


..

...

tm

am1

am2

...

amj

...

amm

Using equivalence matrices (attribute matrices), we give
algorithms which their time complexities are only O(m2) for
9


discovering FD (testing satisfaction) and AFD (computing measures
TRUTHr(X  Y), g1(X  Y, r), g2(X  Y, r)).
2.3. Conditional Functional Dependencies (CFD)
Definition. A CFD  on a relation schema R is a pair  = (X  Y,
Tp), where X  Y is a standard FD (referred to as the FD embedded
in ) and Tp is a tableau with all attributes in X  Y (referred to as the
pattern tableau of ), where for each A in X or Y and each tuple t 

Tp, t[A] is either a constant "a" in the domain Dom(A) of A or an
unnamed variable "".
Semantics. The pattern tableau Tp of CFD  = (X  Y, Tp)
defines tuples (in the relation) which satisfy FD X  Y. Intuitively,
the pattern tableau Tp of  refines the standard FD embedded in  by
enforcing the binding of semantically related data values.
The consistency problem for CFDs is NP-complete. The inference
system  is sound and complete for implication of CFDs. The
proposed algorithms for discovering CFD are CFDMiner, CTANE
and FastCFD.
2.4. About a unified hierarchy for FDs, CFDs and ARs
The work of [R.Medina et al., 2009] is interesting and original.
The authors have shown a hierarchy between FDs, CFDs and ARs:
FDs are the union of CFDs while CFDs are the union of ARs. The
hierarchy between FDs, CFDs, and ARs has many benefits:
algorithms for discovering ARs can be adapted to discover many
other types of data dependencies and further generate a reducted set
of dependencies.
The contents below are some remarks and preliminary results
after researching the work of [R.Medina et al., 2009]:

10


Remark 2.1. It is different from most authors researching into CFDs,
[R.Medina et al., 2009] have extended all t p  Tp , these pattern
tuples are now defined on the whole set Attr(R), where tp[A] =  if A
 X  Y.
Remark 2.2. Instead of matching of a tuple t  r with a tuple tp  Tp
(tp is now defined on Attr(R)), we match t(X) with tp(X), t(Y) with

tp(Y). More formally, t(X) and tp(X) (respectively t(Y) and tp(Y)) are
matching if
A  X: t(X)[A] = tp(X)[A] = a  Dom(A)
or t(X)[A] = a and tp(X)[A] = 
Remark 2.3. Consider a pattern tuple tp defining a fragment relation
of [R.Medina et al., 2009] as follows:
rt p = {t  r | tp  t}
(*)
It is clear that the formula (*) is incorrect. The reason is that in
almost cases, (*) returns the empty set. In fact, in case tp contains at
least one component , then there exists not t  r such that tp  t. In
the opposite case (tp does not contain the component ) and X  Y 
Attr(R), we have tp[A] =  and t[A] = a for A  X  Y. Therefore,
there does not exist t  r such that tp  t. So, rt p , defined by (*)
returns the non-empty result only when X  Y = Attr(R) and tp
coineides with a certain tuple t in r. Hence, the expression (*) must
be changed to
rt p = {t  r | t(X  Y)  tp(X  Y)}
[R.Medina et al., 2009] used the following definitions:

 X-complete property. A relation r is said to be X-complete if and
only if  t1, t2  r we have t1[X] = t2[X].
 X-complete pattern: (X, r) =  {t  r}
 X-complete horizontal decomposition:
RX(r) = {r'  r | r' is X-complete}
11


 Set of X-patterns: (X, r) = {(X, r') | r'  RX(r)}
 Closure operator:

(X, r) = {A  Attr(R) | tp  (X, r), tp[A]  }
Remark 2.4. Let r be an X-complete relation and r'  r. Then (X, r')
=  {t  r'}. Returning to the definition of  on two tuples t1, t2  r.
Here, using the order relation   a   for any constant a of an
attribute to compute t1  t2, we can have unnecessary difficulties.
Essentially, we have to compare the respective components of two
tuples t1 and t2 to know whether they are aqual or not. Therefore,
,
instead of the operator , it is better to use the simple operator 
defined as follows:
For all t1, t2  r,
 t2 = t such that A  Attr(R), t[ A]  t1[ A] if t1[ A]  t2 [ A]
t1 
 t[ A]   if t1[ A]  t2 [ A]
The following proposition which expresses the relationship
between (X, r) and X F proved in detail in the thesis.
Proposition. Let r be a relation defined over the set of attributes
Attr(R), X  Attr(R), and r satisfies the set of FDs F.
Then (X, r) = {A  Attr(R) | tp  (X, r), tp[A]  }
= X F = {A  Attr(R) | (X  A)  F+}
2.5. Summary of chapter 2
This chapter presents some results relating to FDs and AFDs, the
matrix method for discovering FDs and AFDs and some preliminary
results relating to a unified hierarchy for FDs, CFDs and ARs
FD, AFD and CFD are three important types of data
dependencies. Researching and continuing to solve problems relating
to these three types is a new and very interesting direction.
The main results of this chapter are published in the works [CT1,
CT2, CT8, CT9].
12



Chapter 3.
THE CLOSURE COMPUTING ALGORITHMS AND
REDUCING THE KEY FINDING PROBLEM OF
A RELATION SCHEMA
3.1. The closure computing algorithms
3.1.1. The closure concept
Let F be a set of FDs defining over  and X  . We have:
X F = {A    (X  A)  F+}

We use the symbol X+ instead of X F when F is clear from the
context.
3.1.2. Some closure computing algorithms
This section mentions some closure computing algorithms. The
main content is the improvement of Mora et al's algorithm.
The experimental results show that Mora at al's algorithm are
more efficient than other algorithms. However, the correctness of
this algorithm is not proved. Furthermore, its disadvantage is that
each time F is browsed, all FDs with the left and right sides
contained in Xnew will still be checked on the left side to compute the
new value of Xnew (this takes unnecessary time because the Xnew is
essentially unchanged.
The improved algorithm avoids the disadvantage because all FDs
with the right side contained in Xnew are removed before computing
the clorure.
The correctness of Mora at al's algorithm (and the improved
algorithm) is proved in detail in the thesis. Furthermore, the
improved algorithm is showed to be more efficient than Mora et al's
one.

13


The algorithm of Mora et al
Input: , F, X  
Output: X+
begin
Xnew = X;
repeat
Xold = Xnew;
for each Y Z  F do
if Y  Xnew then
Xnew = Xnew  Z;
(I)
F = F - {Y Z};
elseif Z  Xnew then
F = F - {Y Z};
(II)
else
F = F - {Y Z};
(III)
F = F {Y-Xnew Z-Xnew};
end if;
end for each;
until ((Xnew = Xold) or (|F| = 0));
return(Xnew);
end;

The improved algorithm
Input: , F, X  

Output: X+
begin
Xnew = X;
repeat
Xold = Xnew;
for each Y  Z  F do
if (Z  Xnew) then
F = F - {Y Z}
(I)
else if (Y  Xnew) then (II)
Xnew = Xnew  Z;
F = F - {Y  Z}
else
(III)
F = F - {Y  Z };
F= F  {Y-XnewZ-Xnew};
end if;
end for each;
until (Xnew = Xold) or (|F| = 0);
return(Xnew);
end;

3.2. Reducing the key finding problem of a relation schema
3.2.1. Some known results
Let S = <, F> be a relation schema, where  = {A1 , A2,..., An}
and F = {L1 R1,..., Lm  Rm | Li, Ri  , i = 1,...,m}. Let us
denote:
m

m


i 1

i1

L   Li , R   Ri , S = {Kj | Kj is a key of S},

G



K j S

Kj , H 



K j and H =  \ H.

K j S

Theorem 3.1 (Ho Thuan and Le Van Bao, 1985). Let S = <, F> be
a relation schema. Then:
- If X is a key of S then
 \ R  X  ( \ R)  (L  R)
(1)
- We have G =  \ R và R \ L  H
14



The concepts core and body have been proposed by P. Cordero et
al in 2013: Let S = <, F> be a relation schema. Then, core and
body of S are defined as follows:


core(, F) =  \   Ri  and
( Li  Ri )F 


body(, F) =   Li   [ \ core(, F)+]
( Li  Ri )F 

It is easily seen that
core(, F) =  \ R and body(, F) = L  [ \ ( \ R)+]
Theorem 3.2 (Mora et al, 2011). Let S = <, F> be a relation
schema and K is a (minimal) key of S. Then, we have:
core  K  (core  body), it means
 \ R  K  ( \ R)  [L  [ \ ( \ R)+] ]

(2)

It is clear that (2) is a necessary condition for which K is a key of
S
3.2.2. An improved form of the necessary condition (1)
Based on (1) and the familiar semantics of FD in the relational
model, we have the following theorem:
Theorem 3.3. Let S = <, F> be a relation schema. Then
 \ R  K  ( \ R)  [(L  R) \ (R  ( \ R)+ )],  K  S

(3)


It is clear that (3) is a improved form of (1)
3.2.3. Comparing the necessary conditions
Theorem 3.4. Two necessary conditions (2) and (3) are actually
equivalent expressions.
Theorem 3.5 (Ho Thuan et al, 1996). Let S = <, F> be a relation
schema and K be a key of S. Then:
( \ R)  K  ( \ R)  [(L  R) \ ( \ R)+]

(4)

The following theorem shows the relationship between (2) and
15


(4).
Theorem 3.6. The necessary condition (2) is actually coincides with
the one (4).
The theorems 3.3, 3.4 and 3.6 are proved in detail in the thesis.
Intuitively, the purpose of the key finding problem is to find all keys
K of S = <Ω, F> and we always know that all keys K are contained in
Ω. If we find keys in the universe set Ω then it is not effective
because Ω is the largest superkey containing all keys. Therefore, it is
important to find a superkey Z (as few attributes as possible)
containing all keys of S such that Z  Ω. If we find a such set Z then
the finding keys in Z instead of finding ones in Ω will be simpler.
The necessary conditions (1), (2), (3) and (4) are represented again
below, they show the general structure of all keys K of S, the righthand sets are superkeys containing all keys of S.
 \ R  K  ( \ R)  (L  R)
 \ R  K  ( \ R)  [L  [ \ ( \ R)+] ]

 \ R  K  ( \ R)  [(L  R) \ (R  ( \ R)+ )]
 \ R  K  ( \ R)  [(L  R) \ ( \ R)+]
In the thesis, we show that the right-hand set of (2) is better than
the one of (1) and prove that (2), (3) and (4) are equivalent
expressions.
As analyzed, we expect that the right-hand sets have as few
attributes as possible (the smaller the better). This is obviously
related to reducing the key finding problem. Indeed, suppose that we
have Z (a proper subset of ) which contains all keys of S = <,
F>. Then, the reducing for the key finding problem of S is performed
through the following steps:
Step 1. Creating the schema S' = <', F'> where
16


' = Z \ ( \ R)
and F' = {Li  '  Ri  ' | (Li  Ri)  F, i = 1, 2,..., m}
Step 2. Finding S ' by a certain algorithm.
Step 3. Finding S = {( \ R)  K | K  S ' }.
3.3. Summary of Chapter 3
In experiments, Mora et al's closure computing algorithm is more
effective than other ones. However, this algorithm still has some
limitations and its correctness has not been proven. We prove the
correctness of the algorithm and improve it.
The improved algorithm is more efficient than Mora et al's one
because it has the replacement of FDs with simpler FDs in the
computing process ; especially in many cases, the closure computing
process and set F are much simpler because all FDs with the right
side contained in Xnew are removed before building the closure.
For the reduction of the key finding problem, based on the

familiar semantics of FD in the relational data model, we improve
the necessary condition (1) to obtain the necessary condition (3), and
prove that the ones (2), (3) and (4) are actually equivalent
expressions. Finding a necessary condition which is better than (2),
(3) or (4) to further reduce the key finding problem is a very
interesting problem.
The main results of this chapter are published in the works [CT3,
CT4, CT6, CT7].

17


Chapter 4.
ABOUT AN EFICIENT PREPROCESSING
TRANSFORMATION FOR SETS OF FUNCTIONAL
DEPENDENCIES
4.1. Introduction
Let r be a relation defined on . Each assertion as the form XY,
where X, Y  , is called a FD in r. We say r satisfies XY if for all
t1, t2  r such that t1[X] = t2[X] then t1[Y] = t2[Y].
4.2. Redundancy in a set of FDs
Given a set of FDs F, the symbol F  (Z  W) for indicating
Z  W to be derived from F through Armstrong's inference rules.
Consider FD f = XY  F:
We say that f is redundant in F if F \{ f } | f.
We say that f is l-redundant in F if exists Z  , Z  X such that
(F \{ f })  {(X  Z)Y} | f
We say that f is r-redundant in F if exists U  , U  Y such
that (F \{ f })  { X(Y  U)} | f
We say that F have redundancy if it has an element which is

redundant or l-redundant or r-redundant.
4.3. An eficient preprocessing transformation for sets of FDs
Mora

et

al.

have

designed

an

efficient

preprocessing

transformation based on the substitution paradigm to eliminate
redundancy in a set of FDs. The basis and correctness of this
preprocessing transformation is theorem 4.1. In this section, we show
an unacceptable error in the proof of theorem 4.1 and give a new,
correct and simpler proof for that theorem; These are confirmed by
Mora.
18


4.3.1. The Paredaens Logic
 The Par language: Par = {XY | X, Y  2 and X  }
 The Paredaens Logic (LFD) is the logic given by the pair (Par,

SPar) where SPar as an axiom schema AxPar:
if Y  X and the following inference rules:
Trans
XY, Y Z |SPar XZ
Augm

XY |SPar XXY

Union

XY, X Z

Comp

XY, W Z |SPar XWYZ

Inters

XY, X Z |SPar XY Z

|SPar

gAug

(Transitivity Rule)
(Augmentation rule)

XYZ

where Y Z  

Reduc
XY |SPar XY Z
where Y Z  
Frag
XYZ |SPar XY

|S Par XY

(Union Rule)
(Composition Rule)

(Intersection Rule)
(Reduction Rule)
(Fragmentation Rule)

XY |SPar UV

where X  U và V  XY
(Generalized Augmentation Rule)
gTrans
XY, Z U |SPar VW where Z  XY, X  V
and W  UV

(Generalized Transitivity Rule)

Theorem 4.1 (P.Cordero et al.):
For XY, UV  LFD such that X  Y = .
(a). If X  U then
{XY, UV} S Par {XY, (U Y)(V  Y)}


(1)

(b). If X  U and X  UV then
{XY, UV} S Par {XY, U(V  Y)}

(2)

The goodness and novelty of theorem 4.1 is that it gives two
important substitution rules. It is clear that no axiomatic system for

19


FDs has the above rules which can detect and remove redundancy
effectively in a set of FDs.
The proof of the theorem 4.1 is performed by P.Cordero et al.
Consider the proof of (b) in direction  of P.Cordero et al as
follows:
1. U  X
(AxFD)
2. XY
(Hypothesis)
3. U  Y
(1, 2, Transitivity)
4. U  (V  Y)
(Hypothesis)
5. U  VY
(3, 4, Union)
6. U  V
(2, 5, Generalized Augmentation)

Next, consider step 1:
1. U  X
(AxFD)
This assertion is obviously wrong because we have hypotheses X
 U, X  UV, X  Y =  in (b) of theorem 4.1.
4.3.2. A new proof for theorem 4.1
To simplify the proof of theorem 4.1, we use the system of three
Armstrong's axioms A1, A2, A3 and the following two rules:
- If X  Y and U  V then XU  YV
(Union Rule)
- If X  Y then X  Z for all Z  Y
(Decomposition Rule)
Proof.
(a). 
Since X  U so X  Y  U  Y .
Since X  Y =  so X Y = X  U  Y
Therefore, we have:
1. (U  Y)  X
(A1)
2. XY
(Hypothesis)
3. (U  Y)  Y
(1, 2, A3)
4. (U  Y)  (U Y)
(A1)
5. (U  Y)  UY (3, 4, Union)
6. (U  Y)  U
(5, Decomposition)
7. U  V
(Hypothesis)


20

(6, 7, A3)
8. (U  Y)  V
9. (U  Y)  (V  Y)
(8, Decomposition)
(a). 
1. XY
(Hypothesis)
2. (UY)(V  Y) (Hypothesis)
3. U  X (A1, since X  U)
4. U  Y (3, 1, A3)
5. U  VY (2, 4, Union)
6. U  V (5, Decomposition)


(b). 
1. U  V
(Hypothesis)
2. U  (V  Y) (1, Decomposition)
(b). 
1. XY
(Hypothesis)
2. U  (V  Y) (Hypothesis)
3. U  U(V  Y) (2, A2)
4. U(V  Y)  (UV  Y) (A1)
since U  (V  Y)  (U  Y)  (V  Y)
and (U  Y)  (V  Y) = UV  Y


5. U  (UV  Y) (3, 4, A3)
6. (UV  Y)  X (A1)
since X  UV and X  Y =  so
X = (X  Y)  UV  Y
7. (UV  Y)  Y (6, 1, A3)
8. U  Y
(5, 7, A3)
9. U  UVY
(5, 8, A2)
10. U  V (9, Decomposition)

In the new proof of theorem 4.1, the proof of (a) is essentially the
same as the original proof. The difference is that the way to explain
the reasoning steps. Moreover, in the new proof, we use Armstrong's
axioms, so the steps is simpler and clearer.
To overcome the error in proof of (b) of Theorem 4.1, our proof
of (b) is completely new. It makes theorem 4.1 (a very good theorem
as the basis for the preprocessing transformation which effectively
eliminates redundancies in a given set of FDs) be stable and usable.
In practice, in many cases, for simplicity, we can use the
following substitution rule:
If X  U, X  V and X  Y =  then
{X  Y, U  V}  {X  Y, U  (V  Y)}

(3)

The experimental results on many sets including different
quantities and sizes of FDs indicate the percentages of application of
the substitution rules are very high and are increased substantially
with the complexity of the sets of FDs.

- In 28,25% of FDs sets, it is not necessary to apply the
transitivity rule and the preprocessing transformation eliminates
redundancy efficiently.
- The size of FDs sets has been reduced by 52,89%.
21


- When the number of attributes is increased, the number of cases
in which it is not necessary to apply the transitivity rule (A3) is also
increased. This demonstrates that the substitution rules are specially
adequate for dealing with large databases schemas.
- The percentages of application of the substitution rules are
independent from the number of attributes and from the length of the
FD.
4.4. Summary of Chapter 4
Redundancy makes it increase unnecessary sizes for storing data,
which causes data inconsistencies and reduces the efficiency in the
management and exploitation of database systems.
The preprocessing transformation for eliminating redundancy in
sets of FDs is new and very effective. The basis of this preprocessing
transformation is the theorem 4.1. Unfortunately, the proof of (b) of
theorem 4.1 is wrong and unacceptable. In this chapter, we give a
new proof for theorem 4.1 and a simple substitution rule which is
easy to apply in practice. It makes theorem 4.1 be stable and usable.
Creating new substitution rules for preprocessing sets of FDs is
also an interesting research direction.
The main results of this chapter are published in the work [CT5].

22



CONCLUSIONS
The thesis presents an overview of FD and RFD in the relational
data model, researching into the closure computing algorithm of a set
of attributes and the reduction for the key finding problem of a
relation schema, researching into an efficient preprocessing
transformation for sets of FDs, researching into AFDs and CFDs.
The results of the thesis are summarized as follows:
- Some results about FDs, AFDs (showing the relationships
between results of the two works, proving some lemmas, finding
some new relationships between approximate measures which are
usually used for discovering AFDs, proposing a new approach using
matrix to discover FDs and AFDs) and some preliminary results
about a unified hierarchy for FDs, CFDs and ARs (the appropriate
adjustment for the expression which defines a fragment relation; the
improvement for the pattern intersection operator; the proof of the
relationship (X, r) = X F .
- Proposing an improved algorithm for computing the closure of a
set of attributes under a set of FDs.
- For the reduction of the key finding (discovering) problem,
based on the familiar semantics of FDs in the relational model, we
improve a necessary condition (for which a subset of attributes is a
key for a relational schema) and prove that the three necessary
conditions are actually equivalent expressions.
- Showing a serious error in the proof of a theorem which is the
basis of an efficient preprocessing transformation in order to
eliminate redundancies in a set of FDs. Moreover, we give a new,
correct and simpler proof for that theorem, as well as a simple
substitution rule which is easy to apply in practice.


23


×