Tải bản đầy đủ (.pdf) (94 trang)

Applying combinatorial techniques to two problems in computational biology

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (280.22 KB, 94 trang )

Copyright
by
Chen Wei
2004


The Dissertation Committee for Chen Wei
certifies that this is the approved version of the following dissertation:

APPLYING COMBINATORIAL
TECHNIQUES TO TWO PROBLEMS IN
COMPUTATIONAL BIOLOGY

Committee:

Dr. Sung Wing-kin, Supervisor


APPLYING COMBINATORIAL
TECHNIQUES TO TWO PROBLEMS IN
COMPUTATIONAL BIOLOGY
by

Chen Wei, B.Sc., Shanghai JiaoTong University

Thesis
Presented to the Faculty of the Graduate School of
National University of Singapore
in Partial Fulfillment
of the Requirements
for the Degree of



Master of Science

National University of Singapore
Januray 2004


To my beloved parents


Acknowledgments
I would like to take this opportunity to thank all those who have helped me
during the period that I pursue for Master degree.
First of all, I want to present my sincere thank to my supervisor, Dr.
Sung Wing-kin. When I started my master study, I almost knew nothing
about both the field I was going to work in, and the methods used in doing
research. It was Dr. Sung who helped my from every fine aspect through my
whole study. His diligent attitude and perseverance will encourage me forever.
I would also like to thank those good friends in the Computational
Biology lab. They provide me a real light and free working environment.
When I met problems, they were the most people I want to discuss with. Not
only because they have the same background as me, but also because they are
really warm-hearted.
Last, and most, I wish to thank for the endless supports from my parents. They are the ones who became proud of me when I succeeded, they are
the ones who encouraged me never to lose heart when I met difficulties, and
they are the ones who taught me how to become a real person by themselves.

v



My beloved parents, I dedicate this thesis to you.

Chen Wei
National University of Singapore
Januray 2004

vi


APPLYING COMBINATORIAL
TECHNIQUES TO TWO PROBLEMS IN
COMPUTATIONAL BIOLOGY

Chen Wei, M.Sc.
National University of Singapore, 2004

Supervisor: Dr. Sung Wing-kin

Computational biology is one of the fast growing research areas nowadays.
Homology searching problem and Motif-finding problem are two important
problems in this area since they are related to many critical applications, such
as Human Genome Project and Genome to Life Project.
For the homology searching problem, the most popular tools used now
are BLAST-like tools. Although they are successful in performing homology
search, they still have difficulty in increasing efficiency and sensitivity simultaneously by using original searching pattern. In order to solve this problem,
a new type of searching pattern was introduced lately and a new searching
programme is proposed, known as PatternHunter. But this programme is not
flexible enough to perform fine tuning between sensitivity and efficiency of
searching results. In our work, we propose a new searching pattern aiming to
vii



solve this problem, and it is proved to be successful. The result is presented
in a paper On Half Gapped Seeds, GIW 2003.
For the motif-finding problem, there have been quite a lot researches
previously. Moreover, the state-of-the-art is still far away from realistic, that
is, given a corrupted biological data, how to get the motifs from it. Apart
from this, most of the algorithms also suffer from the long executing time and
incomplete outputs. This thesis presents a new algorithm which can solve the
above difficulties while execute in a reasonable period of time to compute a
complete set of all motifs.
Keywords:
half match, half gapped seeds, motif, partial candidate, partial motif

viii


Contents
Acknowledgments

v

Abstract

vii

List of Tables

xii


List of Figures

xiii

Summary

xv

Chapter 1 Introduction

1

I

3

Research on Homology Search Problem

Chapter 2 Background Knowledge

4

Chapter 3 What is a Half Seed?

8

Chapter 4 Half Seeds vs Gapped Seeds

ix


12


Chapter 5 Further Study of Half Seeds

18

5.1

The number of ‘half match’ positions . . . . . . . . . . . . . .

18

5.2

The definition of neighbor nucleotides . . . . . . . . . . . . . .

21

5.3

The number of ‘don’t care’ positions . . . . . . . . . . . . . .

22

5.4

The usage of the 3 key parameters

25


. . . . . . . . . . . . . . .

Chapter 6 Conclusion

29

II

30

Research on Motif-finding Problem

Chapter 7 Background Knowledge

31

7.1

What is Motif-finding Problem

. . . . . . . . . . . . . . . . .

31

7.2

Our Contributions . . . . . . . . . . . . . . . . . . . . . . . .

32


Chapter 8 Related Works

34

Chapter 9 A New Algorithm for Motif-finding Problem

38

9.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .

38

9.2

Brute Force Algorithm . . . . . . . . . . . . . . . . . . . . . .

39

9.3

The New Idea . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

9.4

Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . .


40

9.5

The Core Algorithm . . . . . . . . . . . . . . . . . . . . . . .

43

9.5.1

The Outline of Our Algorithm . . . . . . . . . . . . . .

43

9.5.2

Algorithm Generate − M otif s . . . . . . . . . . . . .

45

9.5.3

Algorithm Generate . . . . . . . . . . . . . . . . . . .

46

x



9.6

9.7

Analysis of Our Algorithm . . . . . . . . . . . . . . . . . . . .

48

9.6.1

Handle the Real Difficulty . . . . . . . . . . . . . . . .

48

9.6.2

Analysis of Space Usage . . . . . . . . . . . . . . . . .

50

9.6.3

Analysis of Time Complexity . . . . . . . . . . . . . .

52

9.6.4

Determine the Parameter k . . . . . . . . . . . . . . .


58

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Chapter 10 Benchmark and Experiments

60

10.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

10.2 Finding Regulatory Patterns in DNA Sequences . . . . . . . .

63

10.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Chapter 11 Concluding Remarks

70


11.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

11.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Bibliography

73

xi


List of Tables
5.1

the top three sensitivities for “half gapped seeds” differing only
in the number of ‘half match’ positions when the similarity between query and database sequences is 0.6 . . . . . . . . . . .

5.2

19

the top three sensitivities for “half gapped seeds” using different
neighbor nucleotides definition when the similarity of query and
database sequence is 0.6 . . . . . . . . . . . . . . . . . . . . .

5.3


22

the top three sensitivities for “half gapped seeds” having different number of ‘don’t care’ positions when the similarity between
query and database sequence is 0.6. . . . . . . . . . . . . . . .

24

10.1 Biology data experiment . . . . . . . . . . . . . . . . . . . . .

65

xii


List of Figures
4.1

Comparison on Sensitivity between Weight 6,7 Optimal Gapped
Seed and Half Seeds . . . . . . . . . . . . . . . . . . . . . . .

4.2

Comparison on Sensitivity between Weight 6,7 Optimal Gapped
Seed and Half Seeds . . . . . . . . . . . . . . . . . . . . . . .

5.1

15


16

comparison on the expected number of hits between “half gapped
seeds” differing only in the number of ‘half match’ positions on
64-bits regions . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

20

comparison on the expected numbers of hits between “half gapped
seeds” differing only in neighbor nucleotides definition on 64bits regions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

23

comparison on sensitivities between the four listed “half gapped
seeds” and the optimal weight 6 and 7 “gapped seeds” on 64-bits
regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

27


5.4

comparison on efficiencies between the four listed “half gapped
seeds” and the optimal weight 6 and 7 “gapped seeds” on 64-bits

regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

28


Summary
With the vast developments in computational biology, it has become one of
the most challenging and attractive research areas. Although quite a lot of
problems have been solved in the latest two decades, there are more and more
new problems being discovered and waited to be solved. Homology searching
and motif-finding problems are probably the two of the hottest problems. The
first one relates to the recognition of the structure of genome. And the second
relates to the identification of function units in genes. Both of them play
important roles in many critical biology research such as the Human Genome
Projects.
In this thesis, we present two researches on homology searching and
motif-finding problem respectively.
For the first one, we propose a new type of searching pattern for Blastlike searching tools. With the help of this new pattern, we can increase the
efficiency and sensitivity of the searching results a lot compared with using
original pattern. Also, our new pattern has a quite good ability in performing
fine tuning between sensitivity and efficiency to meet different requirements.
For the second one, we propose a new algorithm for motif-finding problem. Compared with current algorithms, it has better efficiency and is able to
give high quality results. Besides this, our new algorithm can solve the real
xv


difficulty that all the current algorithms fail to.
We give thorough discussions in both parts based on the experimental

results. We also figure out the directions of probable improvements for the
current approaches.

xvi


Chapter 1
Introduction
It is obvious that computational biology is a challenging and exciting area
in the next several decades. With the vast developments in this area, many
biology problems have been solved using computational methods.
The advance in biological technology has already pushed the research to
the levels of genome, gene, or even motif. Biologists want to have some precise
research tools for their projects. So it becomes demanding to develop tools to
be used on finer levels. In particular, two tools are important in this aspect.
One is the homology searching tools while the other is the motif-finding tools.
We performed two in-depth researches on both topics, and we present them in
this thesis subsequently.
This thesis is organized as follows: Part I introduces our works on the
homology searching. Chapter 2 gives the background knowledge of homology
searching problem. Chapter 3 provide the necessary definitions to understand

1


the new idea we used. Chapter 4 compares our proposed new searching pattern
with the current best searching patterns. Chapter 5 provides the experimental
results of our searching pattern together with some thorough study of those
key parameters involved in the pattern. We conclude our work on homology
searching in Chapter 6.

In Part II, we present the research works on the motif-finding problems. Chapter 7 gives the background knowledge of the motif-finding problem.
Chapter 8 concludes current algorithms to solve this problem, and points out
their problems. Chapter 9 first shows how we succeed in solving the bottleneck
of the motif-finding problem with our new idea. Then we present the complete
algorithm based on our new idea. After that, we provide an in-depth analysis
of the proposed algorithm, including time complexity, space usage and determine the key parameters in the algorithm. We give the experimental results in
Chapter 10, with the discussion based them. Chapter 11 concludes the work
and gives the plan of future works.

2


Part I
Research on Homology Search
Problem

3


Chapter 2
Background Knowledge
Homology search is the problem of locating the approximate matches within
one DNA sequence or between two sequences. This problem has a lot of applications in biology. Finding faster and more sensitive methods for homology
search has attracted a lot of research works.
The first solution to the homology search problem is contributed by
Smith and Waterman [1]. Their method is dynamic programming in nature
and compares every base in the first sequence with every base in the other
sequence to generate a precise local alignment. Although this method gives
the most sensitive solution, it is also the slowest one. In order to improve
the efficiency, without too much loss in sensitivity, many ideas are presented.

Among them, FASTA [3], SIM [4], the Blast family (Altschul [2]; Gish, [5];
Altschul [6]; Zhang [7]; Tatusova and Madden, [8]), Blat [14], SENSEI [9],
MUMmer [10], QUASAR [11], REPuter [12] and PatternHunter [15] are the

4


most famous ones. All of these methods can be divided into two major tracks.
The first track is represented by MUMmer [10], QUSAR [11] and REPuter [12], which use suffix trees [13]. Two major problems make them less
popular. First, although suffix tree is good in dealing with exact matches, it is
not good for finding approximate matches. Therefore, methods based on suffix
tree normally can only find matches with high homology. Second, suffix tree
is very big and methods based on suffix tree suffer from the storage limitation.
The second track is represented by Blast, which is probably the most
widely used approach now. Their basic idea is to finds short exact matches
(hits) in the whole sequence first, which are then extended into longer alignments through dynamic programming process. FASTA [3], SIM [4], Blastn [8],
WU-Blast [5], and Psi-Blast [6] encounter space and efficiency problem when
they are used to compare relatively long sequences. SENSEI [9] is much faster
and cost less working space, though it is incapable to allow gapped alignments.
Blat [14] is a Blast-like homology searching tool, which is very fast to get results while it is limited by the high similarity requirements. MegaBlast [7] is
the most efficient among Blast family, while its output is also rough.
Blast type methods all face an inevitable dilemma caused by the length
of the exact match hit, that is, longer exact match hit increases the efficiency
but reduces the accuracy; while shorter one gives better sensitivity but prolongs the executing time.
Ma et al. proposed the PatternHunter [15] to solve the awkward dilemma.
They introduce the new idea, gapped seed, which is used to seek noncon5


secutive short matches. The total number of nonconsecutive matches is called
weight for their seeds. Once these matches are found, they are extended to

longer alignments by dynamic programming. According to their experimental
results, “gapped seeds” can reach both higher efficiency and better sensitivity
than Blast’s original consecutive seeds.
Depending on applications, we sometime require better sensitivity while
we can tolerant a little decrease in efficiency. “Gapped seeds” allows us to
perform such tuning only by changing its weight. More precisely, reducing the
weight of the “gapped seed” brings better sensitivity while we should sacrifice a
lot in efficiency. In other words, the “gapped seeds” are incapable of providing
finely flexible tradeoff choices. For example, when we reduce the weight from
7 to 6, the sensitivity can be improved from 0.8 to 0.9 when two sequence
have 0.6 similarity. But at the same time, the searching time is prolonged by
4 times! Such kind of tuning is too rough for many applications. Therefore,
we would like to ask if we can give a better solution to solve the problem of
tradeoff between the sensitivity and the efficiency.
This paper gives a positive answer to this question. We propose a new
type of seed called “half seed”. This new type of seed is a generalization of
the gapped seed, which will be defined in detail in Chapter 3. Similar to
the gapped seed, the half seeds are better than the existing consecutive seeds
in both sensitivity and efficiency. Moreover, the half seeds provide a more
flexible tradeoff between speed and sensitivity. Especially for the cases where
we cannot afford to have a big jump in both efficiency and sensitivity, the half
6


seeds are particularly useful.
This part is organized as follows. Chapter 3 gives all the necessary and
useful definitions for fully understanding what is a half seed. We also give a
convenient notation to represent the different classes of seeds, which is used
throughout this paper. Chapter 4 compares the half seeds with the gapped
seeds in term of both sensitivity and efficiency by performing a series of experiments. The results show that the half seeds can really offer flexible choices of

tradeoff than gapped seed between sensitivity and efficiency. In Chapter 5, we
mention the impacts on sensitivity and efficiency when parameters are changed
in our new seeds. From those results, we can have a fundamental idea of how
to tune the tradeoff for “half seeds”.

7


Chapter 3
What is a Half Seed?
Before describing our new seeds, let’s first have a brief review of the seeds used
in Blast family and PatternHunter. These seeds can be represented using some
0−1 strings of length L. What’s the meaning for these 0 and 1? They represent
two important definitions, ‘match’ positions and ‘don’t care’ positions.
Definition 1 Consider two length L substrings S and S from the query sequence and the database sequence, respectively. Suppose position i of the seed
is 1, which is denoted as the ‘match’ position. Then, (S, S ) is said to have a
match at position i, if S[i] = S [i].
Definition 2 Consider two length L substrings S and S from the query sequence and the database sequence, respectively. Suppose position i of the seed
is 0, which is denoted as the ‘don’t care’ position. (S, S ) is said to have a
match at position i, no matter S[i] = S [i] or not.

8


Definition 3 For a length L seed, we say there is a hit when two length L
substrings from query and database sequence match at all the corresponding
positions in the seed.
Definition 4 We call the seed which only contains ‘match’ positions “consecutive seed”. We call the seed which contains both ‘match’ positions and ‘don’t
care’ positions “gapped seed”.
For Blast, they use the “consecutive seed” 11111111111, which means

every pair of length 11 substrings from query and database sequence should
be identical at all these 11 ‘match’ positions to get a hit. For PatternHunter,
they use the “gapped seed” 110100110010101111, which means there is a hit
for a pair of length 18 substrings from query and database sequence when they
are identical at the 11 ‘match’ positions regardless of those characters at the
7 ‘don’t care’ positions.
After we have an idea of the seeds used in Blast and PatternHunter,
we will introduce our new seeds as follow. First of all, there is a fundamental
definition called ‘neighbor nucleotide’.
Definition 5 Recall that every DNA sequence is composed of a set of 4 different nucleotides, N = {A, C, G, T }. For every nucleotide x ∈ N , neig{x}
is a predefined subset of N − {x}, which represents the set of neighbor nucleotides of x. When |neig{x}| = 2, we call it ‘two neighbor’ definition, and
when |neig{x}| = 1, we call it ‘one neighbor’ definition.

9


×