Phương pháp đánh chỉ số cho tài liệu XML tin sinh học dựa trên r tree tt tiếng anh

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (573.03 KB, 26 trang )

MINISTRY OF EDUCATION
AND TRAINING

VIETNAM ACADEMY OF
SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY
-------------------------------

DINH DUC LUONG

BIOINFORMATICS XML DOCUMENTS INDEX
METHOD BASED ON R-TREE METHOD

Major: Mathematical Foundations for Computer Science
Code: 9 46 01 10

SUMMARY OF MATHEMATICS DOCTORAL THESIS

Ha noi, 2019

List of works of author
1

Dinh Duc Luong, Hoang Do Thanh Tung, “A Survey on Indexing for Gene
Database”, International Clustering Workshop: Teaching, Research, Business,
December 27-29, 2014, pp. 50-54.

2

Hoang Do Thanh Tung, Dinh Duc Luong, “A proposed Indexing Method for
Treefarm

database”,

International

Conference

on

Information

and

Convergence Technology for Smart Society, Vol.2 No.1, Jan, 19-21,2016 in
Ho Chi Minh, Vietnam, pp. 79-81.

3

Vuong Quang Phuong, Le Thi Thuy Giang, Dinh Duc Luong, Ngo Van Binh,
Hoang Do Thanh Tung, “Technology solution of managing pig breed”,
Proceedings of the XXI National Conference: Some selected issues of
Information Technology and Communications, Thanh Hoa, 27-28/7/2018, pp.
110-116

4

Hoang Do Thanh Tung, Dinh Duc Luong, “An Improved Indexing Method for
Xpath Queries”, Indian Journal of Science and Technology, Vol 9(31),

DOI:10.17485/ijst/2016/v9i31/92731, August 2016, pp. 1-7 (SCOPUS).

5

Dinh Duc Luong, Vuong Quang Phuong, Hoang Do Thanh Tung, “A new
Indexing technique XR+tree for Bioinformatic XML data compression”,
International Journal of Engineering and Advanced Technology (IJEAT),
ISSN: 2249-8958 (Online), Volume-8, Issue-5, June 2019, pp. 1-7 (SCOPUS).

INTRODUCTION
XML documents are structured text data, or semi-structured data, which has been popular
for decades because data storage is flexible and easy to share and use through the internet. In
the past, XML documents were not usually very large, but in recent years began to appear
large bioinformatics XML documents that can reach Giga, Tera Byte because of the rapid
development of biotechnology in this era. That data can be found from reputable data sources
such as SRA (decoded sequences), NCBI Genome (sequenced species), ensembl.org
(aggregating a lot of data into BioMart) ...
Bioinformatics XML documents are two-part data, biological data (DNA, Protein,
subspecies, etc.) and description data of biological data . Data structures are defined according
to tags and these data structures are often flexible and may be different because they are
customized by biological individuals and organizations.
Because of such a large size, the basic documents must be stored and exploited on the
hard disk, or in a distributed storage system, before being able to access a small portion to put
on main memory (RAM) whenever further analysis is needed. Hard disk access mechanism is
sequential and much time-consuming than accessing on RAM. Therefore, the query methods
that need to access the hard disk always find ways to minimize the number of times to access
the hard disk and maximize the use of main memory, such as Cache, Buffer.
The practical queries based on the algorithm of specific queries are designed to achieve
the desired results in a short time and to match the query. For example:

1. Query XPath for an XML document (exact search): extract all data with tags of the
same origin / sibling of one White Mouse type or extract all data is a descendant of
the African pig.
2. Homologous query for DNA data fragments (approximate search): look for all the
homologous genes with a Gen sample of a new species.
The traditional solution for such queries is to select and install indexing methods that suit
to some certain types of data and specific queries. These have already been such methods, but
these methods are limited to such large-sized text data.
With text data, the size of the index data is often very large, even much larger than the
original data, thus causing problems: (1) storing index data is a difficult problem. (2) Data
compression and data exploitation at the same time are less efficient. Moreover, if the index is
text data, the query speed problem is still a difficult problem to solve.
Therefore, recent studies on indexing an XML document tend to:
- Separate XML document into 2 parts of data and apply different indexing methods to suit
data types and specific query types. Detailed:
1. The method of indexing structured data (tag data) and supporting specific queries such
as XPath
2. Methods of indexing biological data (such as DNA fragments) and supporting specific
queries such as searching for homologous DNA sequences.
1

-

Converting original text data into digital format is aimed at:
1. Reduce the original data size.
2. Apply appropriate indexing methods.
3. Improve the speed of queries.

The problems to be solved are broad, including informatics and biology, so the thesis's

research focuses on solving the problem of indexing method to support specific queries about
speed by reducing the number of queries access to the hard disk and still achieve the expected
results.
The results of the thesis have solved the method of indexing structured data (data of tags)
and supporting XPath queries. In addition, with the problem of Biological data indexing
method (such as DNA fragments) and supporting specific queries such as searching for
homologous DNA sequences, the thesis has investigated the method and had orientation for
further research. Objectives and results of the thesis are as follows.
The objective of the thesis
-

Research indexing method based on R-tree method to increase the efficiency of XPath
queries on XML data, through intermediate data converted into numerical coordinates of
tags. The target XML data is from a bioinformatics XML document.

-

Use the method of converting XML structured text data into numeric data that can be
represented on 2-dimensional space (can be extended to many dimensions). The objective
is to reduce the size of original data and apply the proposed indexing method.

The achieved result of the thesis as follow:
-

Experiments have shown that the method of converting bioInformatics XML data into
spatial data is effective in reducing the size with a fairly good rate in general. However,
the compression ratio does not have uniformly good results between experiments with the
XML bioinformatics documents, DNA, Protein, and subspecies …

Proposing the method of BioX-tree indexing and the extending method of BioX + tree.

The proposed methods (improved R-tree method) have proved more effective than the R-tree
method when applied to index data converted from XML data through experiments. In
particular, sibling queries, or queries that leverage sibling queries in the algorithm have good
results. Theory and experiment have proven that: queries have reduced the steps of redundant
tree browsing on the index tree (stored on the hard disk), thereby reducing the number of times
to access the hard disk to retrieve data on main memory, and still get the desired results

2

CHAPTER 1. OVERVIEW
1.1. Bioinformatics and data sources
1.1.1. Bioinformatics
BioInformatics is a field of science using the technology of applied mathematics,
informatics, statistics, computer science, biology, chemistry, physics ... and biology
mathematics fields. Bioinformatics is often associated with computational biology or system
biology.
1.1.2 Data sources
- Database NCBI
- Database EMBL / EBI
- Database DDBJ
1.1.3 Bioinformatics and bio databases issues
It can be seen that the biological database contains a huge amount of information such
as DNA sequences, proteins, functions, subspecies, etc., and added continuously to increase
their size. quickly, especially with the development of current biotechnologies. Biological
databases can be stored on computers; however, such problems of searching or querying data
on large databases are often difficult to perform due to factors related to space and time query.
At present, the problem of indexing to speed up the processing of bioinformatics data is very
much interested by many researchers, and has great significance in reality.

1.2. Methods for indexing biological and bioinformatics data
1.2.1. Index and external memory model
Complete databases are often incompatible with the main memory (RAM) of a
computer system. Therefore, complete databases are usually stored on hard drives. Access to
this drive will be 100,000 times slower than accessing to the main memory, which is the
bottleneck of database management systems.
Measuring the effectiveness of an algorithm is calculating the amount of I / O to
perform an action. Indexes in a database are a special lookup structure that database search
tool can use to quickly increase data retrieval time and performance by reducing the number
of blocks used to storing the database if possible.
1.2.2. Indexing methods for biological data
There are two main groups:
- Group 1: Methods that perform to compare sequences by comparing the segments
in the sequence and optimizing the similarity.
- Group 2: The methods that use special transformation to build the index.
3

There are many types of bioinformatics documents stored in many different formats. In
this thesis, the author will focus on large-sized XML data, this is one of the output standards
for users to download from the above mentioned data sources.
XGrind [78], Xpress [52], XQzip [15], XQueC [7], Arroyuelo et al. [8], Qian et al. [62],
Dietz [21] Li and Moon in XISS [61] methods have been studied by the author and will be
presented and analyzed advantages and disadvantages more carefully in the following sections.

1.3. Method of indexing XML documents
1.3.1. XML and XPath documents
-

XML document: XML (eXtensible Markup Language) [77] is a hierarchical data model

derived from SGML, it allows modeling a document as a tree structure.
Xpath: Structure of an XML document can be visualized as a tree with many different
branches and small branches. An axis indicates which node is relevant to the context
node, should be included in the search. The XPath specification [11] lists a family of 13
axes in Table 1.1:
Table 1.1: Xpath axes
axis

Self
parent
child
ancestor
ancestor-or-self
descendant
descendant-or-self
following
following-sibling
preceding
preceding-sibling
attribute
namespace

Description
Context node itself
Parents of context node, if existed
Children of context node, if existed
Ancestor of context node, if existed
Ancestor of context node, and itself
Descendant of context node
Descendant of context node, and itself

Nodes in XML documents after context node, not including descentdant
Sibling nodes after context node
Nodes in XML documents before context node, not including ancestor
Preceding siblings of context node
Predicate of context node
Namespace context node

The predicate can also be specified at each step in the path to restrict the set of nodes that
originate at one step. In other words, predicates allow to identify the needed data more
precisely, resulting in smaller and more usable results. Some indexing methods are described
below:
3.1.1.1. Numbering on the schema
The XML document will be built as a tree with the parent-child hierarchy relation, after
that the corresponding name tags will be indexed with 2 indexes according to the pre-order and
post order value rule, (this pair of value will form the NodeID) and serialized each tag name
of the XML document [22] [63]:
-pre-order: is the order of sequential reading from top to bottom of the nodes, ie the nodes
will be numbered from top to bottom of the tree until the end.
-post order: is the order of sequential reading numbered up from left to right on the tree.
4

3.1.1.2. Structured joints
The simplest way to improve and evaluate path queries is to divide large expressions into
many smaller expressions (called sub-expressions) and perform a search for results in those
sub-steps. The drawback is that we need to determine the A-D relationship for each pair of
nodes, they may have to find many times and repeatedly consider an element in different steps,
which is costly and time-consuming.
1.3.1.3. Conversion into multi-dimensional space
This approach attempts to convert paths and A-D relationships from input XML

documents into multi-dimensional data sets. The main idea is to avoid structured joins that
may be inoptimal in a variety of situations that cause slow implementation, and also to take
advantage of multi-dimensional data structures that are becoming more effective as R-tree. The
works in [37] and [51] propose a new indexing method for XML trees based on multidimensional data sets called MDX (Multidimensional Approach to Indexing XML)..
1.3.1.4. Map to relational database
In [36] presents a method specifically designed for XPath queries and path expressions,
which represent the nodes of the input XML file with 5 dimensions: entry (E) = {pre (E ); post
(E); par (E); att (E); tag (E)}. For an E node, pre and post node is the tree browsing value by
preceding value, browsing the tree by the following value; par is the tree browsing value
according to the preceding value of the parent node of node E; att is the status flag, the tag
contains the node's tag name. The XPath query will be based on a window query and
represented as an SQL query (Structured Query Language). Because nodes are represented in
5-dimensional space, this proposed solution uses R-trees for indexing because they have been
evaluated by many studies as having good results in XPath queries.

1.4. R-tree method
1.4.1. Concept of R-tree
R-tree method was built to quickly access to spatial zones, by dividing the space into memory
zones and creating indexes for these small memory spaces, then applying graph tree theory to
manage. R-tree is a method of dividing the data space into the minimum rectangular block
containing data (Minimum Bounding Rectangle - MBR). The MBRs themselves are stored in
the tree structure rather than the data itself (such as metadata), so the search for data will be
performed on the nodes.
1.4.2 R-tree structure
In general, the R-tree is an index structure for n-dimensional spatial objects and is similar to a
B-tree. Leaf nodes in the tree contain indexes, so they have the format: (MBR, object_ptr) where object_ptr refers to a data set in the database and MBR is an n-dimensional rectangle
containing the space objects it presents. The non-leaf nodes have the form: (MBR, chirld_ptr)
- where chirld_ptr is the address of another node in the tree and the MBR includes rectangles
in the lower nodes. An R-tree satisfies the following predicates:
-Each node contains the number of child nodes in the range m and M except the root node.

5

-For each input type (MBR, object_ptr) at leaf node, MBR is the smallest rectangle containing
the n-dimensional data object represented by object_ptr.
-For each input type (MBR, chirld_ptr) at non-leaf node, MBR is the smallest rectangle
containing the rectangle in the chirld node.
-The root node has at least 2 subnodes except that it is the leaf node.
-This is a balanced tree.
1.4.3. Some basic algorithms in R-tree method [30]
a) Search in organizational data structure as R-tree
b) Insertion
c)….
1.4.4.Some improved methods
- XR-tree [32]
- AR*-tree [87]

1.5. The remaining issues
The model shown in Figure 1.1 helps to convert XML data into multi-dimensional space,
thereby applying spatial indexing and querying methods to increase processing speed and
reduce data size when indexing. Because bioinformatics XML is in fact very diverse, R-trees
will be more suitable and selected as the basis in this thesis.

Figure 1.1: Overview Model

Figure 1.2: Model shows data conversion and
indexing on hard disk

R-tree method still has some problems when applied to process bioinformatics XML data
as follows:

1) Firstly, it is overlapping problem. For spatial-based index technique, the larger search space
the more time it takes for getting the returned node set. But the weakness of R-tree based
method is that the queries require a fairly large data scan window, thus causing a considerable
impact on the query performance.

6

2).The problem of the sibling connection of tags after converting to space, which is expressed
as points in space such as parent, preceding, sibling, descendant, child, following, etc. with
Xpath axis. In Figure 2.2 shows the data distribution on the coordinate system, the author
recognizes that all data is skewed as a trapezoid / diagonally aligned (like an airplane wing).
Meanwhile, all the previous methods did not care about that, so the queries have not improved
significantly when querying in the data zone of this airplane wing.

1.6 Conclusion
Chapter 1 presents some fundamental concepts of bioinformatics and bioinformatics data.
Bioinformatics data is becoming huge due to the regular contribution and sharing of the
research community. Because the problems of bioinformatics data analysis are very diverse,
the storage information documents need structure that is easy to change, flexible, diverse and
especially easy to share / contribute. Currently, XML documents are an important standard for
describing and storing huge bioinformatics data. However, XML documents have text and
semi-structured data, so the extraction is not the same as regular data. Chapter 1 also presents
related studies of the problem of XML data extraction, the indexing methods, the algorithms
proposed in previous studies have been mentioned, in which R-tree is The algorithm appearing
effective with XML documents and XPath queries. On that basis, chapter 1 analyzes and
presents the research issues of the thesis.

CHAPTER 2. BIOX-TREE INDEXING METHOD
2.1. INTRODUCTION

The methods given above for indexing in space based on R-tree are having problems:
Firstly, it is overlapping problem. For spatial-based index technique, the larger search
space the more time it takes for getting the returned node set. But the weakness of R-tree
based method is that they create an unoptimal window query. Figure 2.1 illustrates an instance
of XML document with several small points represent XML data in planar. Assume that, from
the context node v we want to get all of its descendant nodes by using a window query {pre(E),
; 0, post(E)} [36]. The really needed window is what in white color defined by the tree
browse value with the preorder value of left-most descendant node and the tree browse value
with the post order value of righ-most descendant node of node E. As the result, the waste area
covered by dark color by the query window corresponding to descendant axis causes a
considerable impact on the query performance, which the range can be very large in many
cases.

Figure 2.1. Scanning range of pre-order and post -order (gray zone) and zoomed (white
zone) for descendant queries is performed according to the sample query
7

Secondly, it is a matter of the connection of tags after converting to space, which is
expressed as points in space such as parent, preceding, sibling, descendant, child, following,
etc. with Xpath axis. In Figure 2.2 shows the data distribution on the coordinate system, with
tested rice DNA data on 1000 nodes (Figure 2.2a), with tested Swissprot data on about
20,000 nodes (Figure 2.2b), the author recognizes that all data is skewed as a trapezoid /
diagonally aligned (like an airplane wing). The author has tested on many different XML
documents, from a few hundred nodes to several hundred thousand nodes, with the same
results.

(a)

(b)

Figure 2.2: Example of distributing conversion points for an XML document
Meanwhile all previous methods did not concern about that, they only focused on
processing the relationship between parent/child or ancestor/descendant and omitted the other
axes that are considered an important part on query processing, especially processing query
stream of XPath queries with predicates. The queries have not improved significantly when
querying in the airplane wing data zone.
From there, the author digs into the new indexing method, improved from the R-tree to
help XPath queries run more efficiently in a number of axes.
Based on the model selected in Chapter 1, the author will make suggestions for
improvement in the components: conversion, indexing, query processing module.

Figure 2.3: Proposed parts for improvement in the BioX-tree method
The results of this chapter are published in works 1, 2, 3 and 4 in the "List of author's works".

2.2. BioX-tree Indexing method
2.2.1. XML document conversion
Still following that general principle in XML document analysis and transformation, the author
has built a separate program to ensure accuracy when compared to the R-tree-based method of
the previous studies. In document [20], the conversion is implemented by using two
procedures startEuity (t, a, att) and endEuity (t). Here, the author has added a new parameter of
8

the author's own way to increase searching efficiency, besides still using the parameters by the
pre order value and the post order value. It is the parameter used to indicate the level l (level)
of each node. The above two procedures are modified with new names startElement and
endElement, presented in Algorithm 2.1.
Algorithm ConvertXMLDocument(XMLdoc)
Input: XMLdocument need converting

Đầu ra: file txt containing values in space of a node(E) = {pre (E), post (E), par (E), att
(E), tag (E)}.
Begin
l = -1;
startElement(t, a, atts)
l ← l +1;
v ← (pre = gpre, post = _, par = (S.top()).pre, level= l, att = a, tag = t)
S.push(v);
Gpre ← gpre + 1;
for v’ IN atts do
startElement(v’, true, nil );
endElement(v’);
endElement(t)
v ← S.pop();
v.post ← gpost;
gpost ← gpost +1;
insert v in the result table;
l ← l-1;
End
Algorithm 2.1: Two modified algorithms in XML document conversion
2.2.2. Index structure on BioX-tree
BioX-tree applies a different insert / split strategy to achieve sibling relationships of
XML data more easily, while not affecting the spatial differential ability of the index too much.
Similar to the XPath and R-tree method mentioned in Chapter 1, each tag name in the XML
document after conversion is represented as an entry [30] consisting of 5 attributes node (E) =
{pre (E), post (E), par (E), att (E), tag (E)}. A node will have the size corresponding to a block
in the hard drive.
Non-leaf nodes have the form (pointer, MBR) in which the pointer pointer points to the
child and the MBR is the smallest rectangle surrounding all entries attached to it. We simply
understand that the non-leaf nodes will contain metadata information of leaf nodes, need to

know the information about leaf nodes can be found here.
Leaf nodes, which contain the elements after conversion, are responsible for maintaining
the aligned trajectories of the actual XML data. To do that, the author applies double-linking
methods to keep the connections with the preceding and following XML. The author also uses
pointers to stay connected with the parents of the XML children. In short, the author uses 3
pointers in a leaf node to connect with the preceding and following siblings and their parents,
so each node of this type will have a set of tuple-shaped pointers (previouspointer, nextpointer,
parpointer).
9

The purpose is to try to maintain a relationship that reflects the wing airplane data
distribution in space to make the query windows smaller and force a node on the tree to
contain only Its siblings, making it possible for us to quickly find sibling relationships in
BioX-tree.

Figure 2.4: Tree hierarchy under tags in rice DNA XML documents

Figure 2.5: Leaf nodes show a connection on the structure tree of BioX-tree
For example, Figure 2.2 depicts the tree structure of a document related to rice DNA that
the author will test in the following sections, here is an XML data set published on Gene NCBI
bank. They are numbered aby the pre order value and the post order value on the top based on
the numbering type and algorithm described above. After transforming the data (for simplicity
we only use the pre order value to describe), the nodes will be represented in the structure of
the BioX-tree tree as shown in Figure 2.3, the data nodes whether the same parent will be
stored in the same leaf node. In case the leaf node too many entries and overflows from the
array, it will be split and have pointers connected to each other to ensure still connecting with
the siblings. Straight arrows represent pointers from a leaf node to their previous and next
sibling, curved arrows represent connections with their parents.
In this example, entries with pre-order value of 21, 22, and 23 are siblings node in the

XML document that will be inserted in the same leaf node and a pointer will be used to
connect. to their parent node, which is the node containing entry 24. That is, 21, 22, 23 and 24
are siblings and have the same parent.
2.2.3. Algorithms
Because changing the tree structure will affect the insertion, deletion and query of nodes,
the author will redesign some algorithms to be more appropriate. This section will show the
modified algorithms, and the ones that are not shown the author will reuse as in the original
method.
10

2.2.3.1. Insertion algorithm
With the goal is to keep the sibling connection of the XML data. The insertion algorithm
is quite complicated. A plain pseudo-code explains the insert process as well as the split
strategy in case of leaf node is fully available in algorithms 2.2.
Algorithm Insert(N,E)
Insert:
context node N và entry E will be inserted.
Begin
1. Call FindSiblingNode(N,E) to find leaf node N’ containing siblings of new entry
E need inserting.
2. if node N’ is found
3.
if (N’ has space to add) then
4.
insert entry E into N’
5.
else
6.
Call CreateNewLeafNode(E) to create a new leaf node on

entry E tree needs inserting here
7.
endif
8. else
9. Call CreateNewLeafNode(E) to create a new leaf node on tree needs inserting
here
10. endif
End.
Algorithm 2.2: Insertion algorithm
Algorithm FindsiblingNode(N, E)
Input: context node N, entry E need finding siblings.
Output: node N contain siblings of entry E.
Begin
1. if N is not a leaf node
2.
Browse searching entries E’ in N has MBR intersect with MBR of entry E
3.
Call FindSiblingNode(N’, E) in which N’ is subnode of N indicated by E’
4. else
5.
if N containing an entry is sibling of E
6.
return N
7. endif
End.
Algorithm 2.3: Algorithm FindSiblingNode
Algorithm CreateNewLeafNode(E)
Input: entry E is inserted.
Begin
1. Finding sibling node of entry E, N’ is found

2. if (N’ has room to add entry e) then
3.
Add entry E to N’
4. else
11

5.
6.
7.
8.
9.
10.

Search from bottom to top until you see the parent Q.
Go to the nearest right path from parent node Q to node P with level 1
if non-leaf node P is not full
entry E will be added to leaf node in P
else
Create a new non-leaf node R in level 1, Create a new non-leaf
node and insert entry E.

12.
endif
13. endif
End.
Algorithm 2.4: Algorithm CreateNewLeafNode
2.2.3.2. Query algorithm
BioX-tree is different from the R-tree method in that it can directly answer most queries on
axes without the need of fine-tuning step, while in fact the R-tree based method is able to

directly answer 4 main axies queries (ancestor, preceding, following, descendant) as shown at
the beginning.
Before going into the details of each algorithm, Algorithm 2.5 and Algorithm 2.6 show the
algorithms used for point and range queries. These are basic spatial queries and are considered
sub-algorithms available in the R-tree method, the author does not make any further
improvements here. With the help of these queries, we can get the result returned as a set of
nodes or exactly one node as desired.
Algorithm FindNode(N,E) // point query
Input: context node N và entry E need finding
Out: node N contains entry E
Begin
1.
if (N is a leaf node
2.
Browse finding entries E’ where N has MBR intersect with MBR of entry E
3.
Call FindNode(N’, E) where N’ is child node of N pointed by E’
4.
else
5.
if N contains entry E
6.
return N
7.
endif
End.
Algorithm 2.5: Point query algorithm
Algorithm RangeQuery(N, Q, RESULT) //window query
Đầu vào: context node N (at beginning, context node will be original node ) và query
window Q

Output: RESULT list containing all entries has MBRintersect with Q
Begin
1. if N in not a leaf node
2.
Browse finding entries E’ where N has MBR intersect with MBR in Q
3.
Call RangeQuery(N’, Q, RESULT) where N’is child node of N pointed by E
12

4.
5.
6.
7.
8.

else

Browse finding entries E’’ where N has MBR intersect with MBR in Q

Add E’’ to RESULT
endif

End.
Algorithm 2.6: Range query algorithm
To avoid listing different types of queries but with similarities, the author divided the query
processing algorithms on BioX-tree into two categories: one that included algorithms for
sibling queries and one type for other queries.
2.2.4. Query processing
2.2.4.1 Sibling query algorithm

Thuật toán SiblingQuery(N, E, RESULT)
Input: context node N (at beginning, ontext node is root node and entry E need finding
siblings
Output: RESULT list containing entries is sibling of entry E
Begin
1. Call FindNode(N, E) to find node N’ containing entry E
2. if (N’ is found)
3. Browse entries E’ in N’
4. Add E’ to RESULT
5. if (following siblings according to pointer F not null)
6. Call FollowingSiblingQuery(NF, RESULT) where NFis node pointed by F
7. if (preceding sibling according to pointer P not null)
8. Call PrecedingSiblingQuery(NP, RESULT) where NP is node pointed by P
9. Else
10. Sibling node is not found
11. Endif
End.
Algorithm 2.7:Sibling query algorithm
Algorithm FollowingSiblingQuery(NF, RESULT)
Đầu vào: context node NF
Đầu ra: RESULT list containing entries is following sibling of NF
Begin
1. Browse entries E’ in NF
2. Add E’ to RESULT
3. if (following siblings according to pointer F not null)
4. Call FollowingSiblingQuery(NF’, RESULT) where NF’ is
node pointed by F
5. endif
6. endfor
End.

Algorithm 2.8: Following sibling query algorithm
13

Algorithm PrecedingSiblingQuery(NP, RESULT)
Đầu vào: context node NP
Đầu ra: RESULT list containing entries is preceding sibling of NP
Begin
1. Browse entries E’ in NP
2.
Add E’ to RESULT
3. if (preceding siblings according to pointer P not null)
4. Call PrecedingSiblingQuery(NP’, RESULT) where NP’ is node
pointed by P
5. endif
6. endfor
End.
Algorithm 2.9: Preceding Sibling Query Algorithm
2.2.4.2. Other query algorithms
Algorithm ChildrenQuery(N, Q, RESULT)
Input: context node N and window query Q
Output: RESULT list contains all children of entry E
Begin
1.
if (N is a non-leaf node)
2.
Browse entries E’ in N have MBR intersect with MBR of Q
3.
Call ChildrenQuery(N’, Q, RESULT) where N’ is a child node of N pointed
by E’

4.
else
5.
Browse entries E’’ where E has MBR intersect with MBR of Q
6.
Call SiblingQuery(N, E’’, RESULT)
7.
endif
End.
Algorithm 2.10: child query algorithm
Algorithm AncestorQuery(N, E, RESULT)
Input: context node N and entry E need finding ancestor
Output: RESULT list contains all ancestors of E
Begin
1.
Call FindNode(N, E) to find node N containing entry E
2.
if (N not null)
3.
if ( parent pointer F not null)
4.
Browse entries E’ in NP, NP is node pointed by P
5.
if E’ is ancestor of E, add E’ to RESULT
6.
Jump to step 3, so that NP replace N
7.
else
8.
Input Node is root

9.
else
10.
Node found is not existed
11. endif
End.
Algorithm 2.11: Ancestor query algorithm
14

Unlike the above algorithms, in the Descendant query, the author only tended to
minimize the query window size and then embed this window into a normal range query. As in
the conversion, the author has added a level l parameter to the nodes, by using this parameter
we can reduce the window size of the search space. Experimental results of algorithm
2.2.5. Assess the complexity of algorithms
Sibling query of the proposed method is complicated:
-The best case is O (k + logm N), where n is the number of nodes in the tree, m is the number
of entries in 1 node. Where k is the number of siblings found in the query.
-The worst case is O (k + N)
-The average is O (k + m logmN)

2.6. Experimental results of BioX-tree method
2.6.1 Experimetal model and environment
 Test model

Figure 2.6: Experimental model of BioX-tree và R-tree method
 Experimental data
The author uses 4 different bioinformatics sources from reputable data sources. They
describe various biodiversity: DNA, Protein, and descriptions of subspecies: DNACorn,
DNARice, Swissprot, Allhomologies

 Scenario
In XPath queries, queries on sibling axis are most important and most frequently used
because they return small and specific result sets and used in calculations. For example, the
user needs to search XML data: descendants of the current node, or sibling of the current node.
The other sub-axis queries like ancestors, descendants, preceding, following are often a
collection of a lot of results and are rarely used in practice.
Experimental scenario of two types of XPath queries on sibling axis and children axis is a
type of point query. The remaining axes of XPath use range queries. Each of the above XML
documents is treated as a different database and is implemented separately. On an XML
document, the author randomly selects 200 tag names, then queries find the tag name sets
related to the Xpath axes.
The queries performed on XML documents increase in size, as shown by the complexity
of the XML tree with the number of name / node tags of 20,000 - 40,000 - 60,000 - 80,000
respectively. For each type of query, the average result of accessing hard drive times of the
15

above 200 options will be retrieved to evaluate the performance of the methods. Fewer hard
disk access times mean higher query performance.
 Experimental tooling and environment
Algorithm programming tool is a programming language of C ++ in Visual Studio 2008.
Experimental running environment on computers with CPU configuration: Intel Xeon E5520 2.7 GHz, RAM: 20 GB, OS: Windows Server 2008 R2 Enterprise.
2.6.2. Program construction
 Design an index file
 Program design.
 Class diagram

Figure 2.7: Class diagram of BioX-tree
2.6.3. Assess the effectiveness of data size reduction
In order to evaluate the actual effectiveness of the method of compressing data from

XML documents to documents converted into digital space, the author has experimented on
practical documents as Figure 2.11. This shows that the compression ratio is quite good,
especially with DNA description documents. However, the Allhomologies document
describing species information is quite surprising because of its low compression ratio. To
understand why compression ratios differ between documents, the author analyzed the XML
files and identified a problem. Allhomologies documents describe species information so most
of them have Attribute tags, this tag describes many attributes on a string. The conversion
algorithm that encounters these tags will have to separate each Attribute in a string into
separate tags thus increasing the size of the converted document. Thus, it can be seen that, in
fact, this conversion method does not always have good compression ratio because it
depends on the structure of the XML document.

Figure 2.8: Compare the size of XML documents and documents converted to digital
space
16

2.6.4. Compare the results of the BioX-tree and R-tree methods
In spatial indexing methods, the unit of performance measurement will be the average
number of nodes accessed, because the actual processing time is fast or slow depending on
whether a query needs access (I / O) more or less on the hard drive to read the blocks. That is,
the less access to read the block nodes, the better the processing time will be.
a. Node query
Figures 2.9 and 2.10 show that the performance of BioX-tree is much better than R-tree. The
reason is that to achieve results, R-trees must use scoping queries to scan all sibling or
descendant nodes, then filter out the expected nodes. But the BioX-tree handles these queries
by first approaching only one leaf node containing the object and then searching all its sibling
and child node through pointers. This helps avoid the R-tree overlap problem. The bigger the
size of XML data is, the more R-tree will overlap. That is why R-tree performance decreases
rapidly as data size increases.

Figure 2.10: children query

Figure 2.9: Sibling query
b. Range query

Figure 2.11 shows that the performance of the BioX-tree is slightly lower than that of R-tree
except for large data. The reason is that the author has forced the sibling nodes of an XML data
object into some leaf nodes of the R-tree. Certainly, it makes the indexing structure less
optimal, leading to overlapping problems. However, thanks to the pointer (to parents) of the
BioX-tree, the performance of the BioX-tree is not much worse than that of the R-tree. Figure
2.12 shows that the performance of BioX-tree is slightly better than R-tree. Instead of scanning
entirely one of the four discrete areas on the plane, the BioX-tree only looks for the children of
a descendant node and then uses the pointers (to the sibling node) to reach the rest. Similar to
Figure 2.12, Figure 2.13, 2.14 shows that the performance of BioX-tree is a little less than that
of R-tree. The reason is that range queries are forced to scan one of the four discrete regions
resulting in a serious overlap problem

Figure 2.12: descendant query

Figure 2.11: Ancestor query

17

Figure 2.14: Preceeding query node

Figure 2.13:Following query node

In summary, the author's proposed indexing methods are much better performing with

node queries but performance is almost similar to range queries. In fact, node queries are used
more than range queries because users rarely need all the ancestor or descendant data of an
object. In fact, querying preceding and following sibling, children brings many benefits to the
DNA database in searching.

2.7. Conclusion
In this chapter, the author has researched and presented an improved method for more
efficient processing of XPath queries, which is considered the basis for building complex
queries. The BioX-tree method proposes some important enhancements such as adding
pointers to indicate relationships: ancestors - descendants, parents - children, siblings, the step
of converting from XML documents to space added some parameters, redesigned query
algorithms on XPath's main axes to speed up execution. The experimental part was carried out
on bioinformatics XML data from reputable and popular sources in the biological community.
In this new structure, the experimental results show that it is much more efficient than the Rtree-based indexing method on point queries, which is the most commonly used algorithm in
practice, because the new algorithms are based on Tracking link trajectories by using highly
optimized pointers for reading and recording hard disk I / O. But besides that there are still
some disadvantages that the author continues to study and present solutions in chapter 3.
The research results in Chapter 2 are published in works 2, 3 and 4 of the "List of author's
works".

Chapter 3. BIOX+-TREE EXTENSION INDEXING METHOD
3.1. Introduction

Hình 3.1: Improved parts in BIOX+-TREE extension indexing Method
In chapter 2, the author came up with the idea of increasing the efficiency of XPath axis
queries for preceding tags (preceding sibling), the following (following sibling) and ancestors
by adding pointers to leaf nodes of R-tree. Thanks to these pointers, other Xpath axis queries
that result include sibling tags also benefit for better performance. However, the disadvantage
18

of the method is that it is possible to create an unoptimal R-tree index tree. Therefore, the next
content, we analyze the converted data space of the XML document and propose algorithms
that can overcome some of the defects. In this chapter, we apply a method of converting the
number of name tags to numerical spatial coordinates to reduce the size of the document and
propose a new indexing method to increase query efficiency of XML documents, the results
also point to some practical issues due to the variety of bioinformatics XML documents.The
proposed improvement components are: indexing module, query processing module.
The research results in this Chapter are published in works 5 of the "List of author's
works".

3.2. XR+ tree method
3.2.1. Analyzing the conversion data space of XML documents
In the study in chapter 2, the author has shown that the distribution of XML data in space
tends to focus on the two axes Xpath (preceding sibling), the following (following sibling).

Figure 3.2: XML document tree is numbered
Figure 3.2 shows the leaf nodes of XR-tree in 2-dimensional space through the rectangles
(MBR). Where, R is the MBR of the sub-tags of the original tag (1-31) of Figure 3.1. Similarly,
R1, 2, 3 in Figure 3.2 are the MBRs of the sub-tags of the tags: 2-10, 12- 10, 22-30 in Figure
3.1, and the same for smaller level tags. From the spatial visual description of Figure 3.2, we
find the theorems about the MBR subtrees of the tags of document X when expressed in BioXtree space.
Theorem 1: Suppose in an XML document, the T tag is the father of the tags t1, t2, .., tn
(sibling). The MBRs that contain subtrees of t1, t2, ..., tn on the XR-tree space are always
separate (without intersection). For example, R1, R2 and R3 are bounding rectangles of subtrees that are always separate and spread from left to right on the space of Figure 3.2.
To prove this theorem, we easily recognize that the sub-elements in the MBR bounding
the subtree of the brothers (sibling) on the left having the same parent always have a value less
than the sub-elements in the MBR bounding the sub tree on the right with the tree-order values
according to the pre-order value. On the contrary, the values on the left are always greater than
the values on the right with the tree-order values according to the post order values. Therefore,

MBRs for subtrees of subnode having the same parent in X documents on XR-tree space are
always separate. Thus, Theorem 1 shows that, forcing the sibling XML tags with the same
parent into the same BioX-tree leaf node may not have much effect on optimizing the BioXtree tree structure for queries. Experimental results in chapter 2 on BioX-tree have shown this
observation.
19

Theorem 2: Suppose in an XML document, the T tag is the father of the tags t1, t2, .., tn,
they are brother tags (sibling). Except for MBRs bounding subtrees of t1 and tn (first and last
sibling), MBR including t1, t2, ..., tn will cover all MBRs of subtrees of t2, .., tn-1.Example: R
will cover all R2 and R21, R22, R23. To prove this theorem, it is easy to recognize that prepost values of t1 and tn are preceding and after the pre-post value of T. Therefore, pre-post
values in the subtrees of t2, ..., tn are obviously within MBR range of t1 and tn. From theorem
2, we draw a consequence 2.1 for the query algorithm to search for an XML tag in the XR-tree
tree as follows
Consequence 2.1: Suppose when looking for a t tag of an XML document in the XR-tree
search tree and finding the location of t located inside rectangles R1 and R2. If R1 is inside R2,
the search algorithm will not need to continue browsing other subtrees of R2 to find t because
it is certainly t in the subtrees of R1. From the theorems and consequences, we redesigned the
XR-tree algorithms to re-optimize queries, overcome structural weaknesses.
3.2.2. The proposed algorithm
From the conclusions obtained in Section 3.1, we propose new Insert and Query
algorithms and we call the new extension method of BioX-tree as BioX+- tree. The goal of the
algorithm is to reduce redundant tree browsing steps to optimize query speed, while reducing
the structural disadvantages of the BioX-tree. The BioX+- tree will remove the Par pointer
(pointing to the parent node) because it has not really yielded the expected effect and
consumes a lot of storage memory.
Algorithm Insert(N,E)
Input:
node N contains entry E
Output: context node N and entry E need searching

Begin
Invoke FindNode(N,E) to find a leaf node N’ containing the sibling predecessor of the new
context node entry E to be inserted (stage 1)
IF node N’ is found,
IF N’ is first node or last node
fullnode =m // m is minimum number of entries in a node
else
fullnode =M // M is maximum number of entries in a node
if |N’| Insert new context node, E into N’.
else
CreateNewLeafNode(E) // to create new leaf node for new context entry E,
then insert the new leaf node into the tree
else
CreateNewLeafNode(E) ////to create new leaf node for new context entry E, insert the
new leaf node into the tree
Creatpointers (pre, post, N’)
End

Algorithm 3.1: Insertion Algorithm
The difference in this algorithm is that it will consider the first / last tags of the leaf
nodes in BioX+ tree. In particular, the nodes containing the first / last sibling tags will only
contain up to m tags (minimum) instead of M (Maximum). Thus, the method will be able to
20

reduce the size of the MBR, which covers all the sibling tags, so there will be fewer
intersections between the MBRs of the BioX+ tree. Queries will benefit from this structure.
Algorithm FindNode(N, E)
Input: context node N and entry E need searching

Output: node N contains entry E
Begin
minN = N
if (N not a leaf),
Browse entries E’ in N with MBR (N’) intersect with MBR of E
if (N’ intersect or inside minN)
invoke FindNode(N’,E), with N’ is childnode of N pointed to by E’.
if N’ inside N
minN=N’
else
if N contains entry E,
Return N.
End
Algorithm 3.2: Querying Algorithm
This algorithm applies Theorem 2 and Consequence 2.1, which means that when
searching for E, if R1 is inside R2, skip the subtree of R2 and save the R1 value.
- At any intermediate node,
- If R contains R1, it will ignore the subtree
- If there is R inside R1, then save R for comparison in the next step, continue browsing R
- If R is not in the above 2 cases, continue browsing R

3.3. Experimental results of BioX + -tree method
3.3.1 Experimental Model and data
 Experimental model
The author's test model will be similar to that done on BioX-tree, except that the BioX-tree and
BioX + -tree methods will be tested and recorded. To evaluate the practical effectiveness of the
method, the author experimented on bioinformatics XML documents of common and diverse
origins.
 Experimental data
As tested in chapter 2, the author will still use 4 different bioinformatics documents from

various reputable data sources.
 Scenario
Similar to the steps in chapter 2, the experimental scenario on two types of sibling and
children XPath axis queries with the form of a point query. The remaining axes of XPath use
range queries. In this section, we will go deeper into sibling queries, with the distinction
between preceding and following sibling.

21

This section also doesn't compare with increasing size (number of name / node tags) as in
chapter 2, but the author will run the algorithm on the same 80,000 input tags of 4 types of
bioinformatics data files with different data properties.
3.3.2 Experimental tools and environment
Completely similar to the test in chapter 2, the author still uses those environments and
devices to experiment.
3.3.3. Comparison of results of BioX + -tree and BioX-tree methods
a. Sibling queries: The query must find all the brother tags (sibling) of any tag in the
XML document. Figure 3.2 shows that the average XR + tree performance result is always
better than XR-tree for all XML documents. And it looks with Allhomologies document,
experiments have for best results.
b. Sibling proceding queries: The query only looks for preceding sibling tags a tag in the
XML document. Figure 3.3 also shows that the average XR + tree performance results are
always better than XR-tree for all XML documents. And it seems that with Allhomologies
document, experiments still give the best results.

Figure 3.3: Sibling query result

Figure 3.4: Sibling proceding query result

Figure 3.5: Sibling following query result

Figure 3.6: Children query result

Figure 3.7: Range query result
22

c.Sibling following queries: The query only looks for brother tags (sibling) behind any
tag in the XML document. Figure 6 also shows that the average XR + tree performance results
are always better than XR-tree for all XML documents. However, the results for DNARice
documents seem much different. This proves that there are XR + tree situations that will give
very good results when the query removes many steps of redundant tree browsing according to
Theorem 2 and consequence 2.1.
d. Children queries: These queries need to find the child tags of any tag in the XML
document. To do this, it is best to estimate a sub-tag (first or last), then use the sibling query to
find those sub-tags. As such, this query will become the sibling query as above. However, in
this paper, we create a zone query with a limited search window that can find subtags of any
tag. Figure 3.5 also shows that the BioX+-tree and BioX-tree average results are basically the
same for all XML documents. It seems that the limited search window is good enough to not
see the difference between the two tree structures.
e. Range query: In addition to the queries on the XPath axes as above, the remaining
queries use window zone that have limited scope of space, the author experiment with these
zone queries together. Experiments on these queries will not take advantage of pointers that
connect leaf nodes of BioX-tree và BioX+-tree. Thus, the results are evaluated completely
based on the optimization of the tree structure. Objectively, to compare query performance
between two BioX-tree and BioX+-tree structures, we use 50 fixed (non-random) queries to
experiment with. The results in Figure 3.6 show that not all documents give different query
results, but there are still XML documents where the BioX+-tree structure provides better
performance than BioX-tree structure. From the above experiments, it has been shown that the

construction of algorithms according to the proposed theorems and consequences is effective,
not only queries related to sibling tags (sibling) but also for common queries thanks to more
optimal tree structure.

3.4. chapter 3 conclusion
The goal is to find new indexing methods that can retrieve information of large bio-sized
XML documents and at the same time reduce storage size. In chapter 3, the thesis analyzed the
correlation between XML tree structure after conversion to digital spatial data and BioX- tree
tree structure. From there, give theorems and consequences to apply the construction of new
algorithms, the new method is BioX+-tree.
Experimental results have shown that the new BioX+-tree indexing method is superior to
BioX-tree in most XPath axes and regular queries. The experiments have used bioinformatic
data sets from reputable sources and described different biodiversity, the purpose to confirm
the objectivity and practicality of the algorithms.
The research results in chapter 3 are published in work 5, section "List of author's works".

23

Phương pháp đánh chỉ số cho tài liệu XML tin sinh học dựa trên r tree tt tiếng anh

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về