Tải bản đầy đủ (.docx) (31 trang)

Tóm tắt tiếng anh: Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (654.81 KB, 31 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI NATIONAL UNIVERSITY OF EDUCATION

NGUYEN VAN TINH

LINK PREDICTION IN HETEROGENEOUS
INFORMATION NETWORKS AND ITS
APPLICATIONS IN PREDICTING ASSOCIATIONS
BETWEEN NON-CODING RNAS AND DISEASES

Major: Computer Science
Major code: 9480101

SUMMARY OF DISSERTATION IN COMPUTER
SCIENCE

Hanoi, 2023


The dissertation was completed at the Faculty
of
Information Technology, Hanoi National University of
Education.

Supervisors:
1. Assoc. Proc. Dr. Tran Dang Hung
2. Dr. Le Thi Tu Kien
Reviewer 1: Assoc.Prof.Dr. Nguyen Long Giang,  Institute
of Information Technology- Vietnam Academy of Science
and Technology.
Reviewer 2:


University.

Assoc.Prof.Dr. Le Duc Hau, Thuyloi

Reviewer 3:
Assoc.Prof.Dr. Nguyen Ngoc Hoa,
University of Engineering and Technology-Vietnam
National University, Hanoi (VNU).
The dissertation will be defensed at The university
committee: Hanoi National University of Education.
At: ………………..

The dissertation can be read at: National Library of
Vietnam, Hanoi, Vietnam or Library of Hanoi National
University of education, Hanoi, Vietnam


PUBLICATION
[VTN1] Van Tinh Nguyen, Thi Tu Kien Le and Dang Hung Tran,
"A new method on lncRNA-disease-miRNA tripartite graph to
predict lncRNA-disease associations", 2020 12th International
Conference on Knowledge and Systems Engineering (KSE), 2020,
pp.

287-293,

doi:

10.1109/KSE50997.2020.9287563


(Scopus

indexed).
[VTN2] Van Tinh Nguyen, Thi Tu Kien Le, Tran Quoc Vinh
Nguyen

and

Dang

Hung

Tran,

 “Inferring

miRNA-disease

associations using collaborative filtering and resource allocation on a
tripartite

graph”, BMC

Med

Genomics 14, 225

(2021).

(ISI Q2 journal).

[VTN3] Van Tinh Nguyen and Dang Hung Tran, "An improved
computational method for prediction of lncRNA-disease associations
based on collaborative filtering and resource allocation", 2021 13th
International Conference on Knowledge and Systems Engineering
(KSE), 2021, pp. 1-6, doi: 10.1109/KSE53942.2021.9648632
(Scopus indexed).
[VTN4] Van Tinh Nguyen, Thi Tu Kien Le, Khoat Than and Dang
Hung Tran, “Predicting miRNA–disease associations using improved
random walk with restart and integrating multiple similarities”, Sci
Rep 11, 21071 (2021). />(ISI Q1 journal).


INTRODUCTION
Nowadays, we are in a connected world where data or objects’
information, actors or agents, object groups or component groups are
interacted with each other to compose large networks. These
networks are complex. They contain multiple types of nodes and
multiple types of interactions. These networks are called
heterogeneous information networks (HINs). They are rich in
semantic information and can be constructed from multiple data
sources. Analyzing of heterogeneous information network (HIN)
generates a trendy research of mining of data, retrieving of
information, link prediction, mining of graph, network science, and
so forth…
Link prediction is a crucial and active task in HIN analysis. It
benefits many researchers and organizations in a variety of fields.
The link prediction’s main objective is to discover absent links in a
network or to forecast links which may soonly occur in a network.
Link prediction has been broadly applied in various domains
from social networks to biological systems. For biological systems,

link prediction has been used to discover the relationships or
associations
among
biological
objects
such
as
disease-phenotype/gene associations, drug-protein interactions, drugmiRNA associations, disease-drug associations, non-coding RNAdisease associations, and so forth. Especially, for a long time,
identifying non-coding RNAs (ncRNAs) in the human genome is
difficult. They were treated as noise. However, ncRNAs play vital
roles in life activities.
Identifying relationships between ncRNAs and diseases has
exposed opportunities for therapeutic and diagnostic of human


diseases. Therefore, the studies of ncRNA-disease relationships have
extensively been executed in recent years
Conventional biological experiments make it costly, timeconsuming, and laborious to discover potential ncRNA-disease
relationships. Therefore, it requires to have computational methods
for identifying ncRNA-disease associations, especially associations
between non-coding RNAs (ncRNAs) and diseases. In the past few
years, various computational methods for predicting ncRNA-disease
associations have been developed.
Although actual computational methods have made massive
benefits in revealing disease‐associated ncRNAs in each category
and typically decrease the cost as well as time of biological
experiments. Howerver, there are still some limitations which are
needed to be solved as follows.
Firstly, the computational approaches for predicting ncRNAdisease associations ought to deal with sparse data problem. It bases
on the reality that the known ncRNA-disease associations' number is

quite smaller compared to the unknown associations. Hence, it is
difficult to obtain a reliable network to represent a reasonable
biological network. Therefore, it limits prediction accuracy.
Secondly, due to the sparsity data problem, it causes another
issue that the is unbalancing of positive and negative samples in
performing computational methods for predicting ncRNA-disease
associations. It is the reason that the prediction performance of
computational methods is not very reliable.
Thirdly, the similarity calculation in existing computational
methods depends excessively on known associations between
ncRNAs and diseases. It could generate the noticeable bias to
construct computational models for predicting ncRNA-disease


associations. Therefore, it requires to reasonably fuse different
similarity scores from different souces of biological information to
enhance ncRNA-disease association prediction performance.
Fourthly, most of existing computational methods are not
applicable to predict associations for isolated diseases or ncRNAs
(miRNAs or lncRNAs) which have not any known association with
other ncRNAs or other diseases in the examined data sets. So, it is
nessessary to combine different biological information to improve
the capability of prediction ncRNA-disease association for isolated
cases.
Fifthly, there are too many parameters that need to be adjusted
in many computational methods leading to the difficulty in
performing ncRNA-disease association prediction. It means that the
researchers need to develop more computational methods which will
be easier to employ in ncRNA-disease association prediction.
And finally, since more and more biological databases become

available so it requiresto effectively fuse data from multiple data
sources to enhance the reliability and performance of prediction.
Up to date, a numerous number of research are weekly
published in scientific journals or conferences to show new results of
research on developing computational methods for ncRNA-disease
association prediction. Many of them concentrates on solving the
above mentioned limitations. Additionally, based on the fact that
selecting useful data from heterogeneous information to build up a
reliable HIN is still a challenge, it remains room for scientists and
researchers to research for constructing a reliable HIN and training
an useful computational method to achieve more decisive
performance of ncRNA-disease associations prediction.


It is the reason that the Ph.D student selects the topic “Link
prediction in heterogeneous information networks and its
applications in predicting associations between non-coding RNAs
and diseases” for this dissertation.
 Dissertation objective and research problem
Through this dissertation, the research will focus on: proposing
computational methods or models to improve prediction performance
for predicting human non-coding RNA-disease associations on
heterogeneous information networks by solving the following
problems:
- Solving the sparse data problem to improve the accuracy of
human ncRNA-disease associations prediction performance.
- Fusing multi-types of information from different biological
datasets to have more realistic similarities and to decrease the
impact of depending on known human ncRNA-disease
associations excessively.

- Inheriting the computational methods from other domains such
as predicting of microbe-disease associations, drug-disease
relationships, drug-target interactions and so forth, and
improving them to achieve better performance in predicting
human non-coding RNA-disease associations.
 Thesis’s scientific contributions:
The thesis has the following scientific contributions:
Contribution 1: Proposed an improved computational model by
combining collaborative filtering and a resource allocation
process on a tripartite graph using multiple known associations'
types to forecast ncRNA-disease associations.
Contribution 2: Proposed a new miRNA-disease associations
prediction method which used a WKNKN algorithm to pre-


process data to decrease unverified associations in miRNAdisease association dataset and uncover latent miRNA-disease
associations using improved random walk with restart (RWR)
algorithm and fusing multiple similarities from HINs.
The contribution 1 is presented in Chapter 2 of the dissertation,
related contents of this contribution were published the Proceeding
of the KSE2020 ([VTN1], Scopus Indexed), BMC Medical
Genomics journal in 2021 ([VTN2], ISI Q2 journal), and Proceeding
of the KSE2021 ([VTN3], Scopus Indexed).
Thesis’s structure: Thesis structure is shown in the below figure.

CHAPTER 1. BACKGROUND
1.1. Basic concepts


1.1.1. Heterogeneous information networks

 Information network
Definition 1.1. Information network. An information network
is represented by a graph G=(V , E) where V represents set of
nodes and E represents set of edges of the graph. The graph G
contains an mapping function of object type, ϕ: V → A. It also
contains a mapping function of link type, ψ: E → R. Each node v ϵ
V has only one particular object type, ϕ(v) ϵ A. Each link e ϵ E has
only one particular link type, ψ(e) ϵ R. If two links share the same
starting and the same ending object type, they are same link type.
 Heterogeneous/Homogeneous information network.

Definition
1.2.
Heterogeneous/Homogeneous
information network. If the information network contains
moreover one object type or one link type, it is known as a
HIN, typically |A|>1 or |R|>1; otherwise it is called a
homogeneous information network, typically, |A|=1 and |R|=1.
1.1.2. Biological systems
Biological systems are a special class of heterogeneous information
networks which consists of a large number of biological entities such
as genes, miRNAs, lncRNAs, gene expressions, phenotypes, and so
forth.
1.1.3. Non-coding RNAs (ncRNAs)
The RNAs which can not be transferred into proteins are
referred to as non-coding RNAs Micro RNAs (miRNAs)
MiRNAs are a class of single-stranded, endogenous, small,
evolutionarily conserved ncRNAs with length of 22-26 nucleotides.
Long non-coding RNAs (lncRNAs)
LncRNAs belong to a subclass of ncRNAs. They lengthen more

than 200 nucleotides.


1.2. Link prediction in heterogeneous information networks
1.2.1. Link prediction problem
Definition 1.5. Link prediction in heterogeneous information
networks
Giving a network which is represented by a graph G=¿
V 2 ∪… ∪ V M , E1 ∪ E2 ∪… ∪ E N ¿, where V i (i=1,2 , … , M )
indicates the node set with type i, E j ( j=1,2, … , N ) relects the link
set with type j. The link prediction task is to find if there will exist a
link e k between vi ( v i ∈ V i )and v j ( v j ∈ V j ). In this sense, e k reflects
the prediction task's target link.
Input:
The
graph

G=(V 1 ∪ V 2 ∪ … ∪ V M , E1 ∪ E2 ∪… ∪ EN ) :
V i (i=1,2 , … , M ) is the node set with type i and E j ( j=1,2, … , N )
reflects the link set with type j.
Output: For any two potentially connected objects vi ( v i ∈ V i )
and v j ( v j ∈ V j ), whether link e k exists (1) or not exist (0)?
1.2.2. Link prediction methods
In general, the link prediction methods can roughly be
categorized into the following types: Network similarity-based
methods, probabilistic and maximum likelihood-based methods,
machine learning based methods and deep learning-based methods.
1.2.3. Link prediction applications in biological systems
In biological networks, link prediction is normally used in
predicting associations between diseases and genes, prioritizing of

disease candidate metabolites, drug development, predicting drugprotein interactions, identifying drug-miRNA associations,
discovering disease-drug associations, predicting associations
between non-coding RNAs and diseases, and so forth.


1.3.

Computational methods for predicting associations
between non-coding RNAs and diseases
1.3.1. Predicting non coding RNA-disease association prediction
as a link prediction problem
Predicting ncRNA-disease associations problem is considered as
a HIN-based link prediction problem. It typically uses a
heterogeneous network with multiple biological objects and
relationships among them. These biological objects and links among
them were collected from different sources including nodes
belonging to ncRNA (miRNA, lncRNA) type and disease type. Then,
it predicts potential associations between ncRNAs and diseases. The
potential associations may be new associations or missing ones.
1.3.2. Materials used for ncRNA-disease association prediction
Information about miRNAs and miRNA-target associations could
be gathered from different data sources such as miRBase, miReg,
miRTarBase, miRecords, and so on. The verified miRNA-disease
associations could be collected from multiple public databases such
as
MiRCancer, MiR2Disease, HMDD, MiREC, DbDEMC, and so on.
LncRNAs’ information could be obtained from different data sources
such as LNCipedia, NONCODE database, LncRBase, and so on.
LncRNA-related interactions’ information could be gathered from
different databases such as DIANA-LncBase, lncRNA2Target, and

so on. LncRNA-disease associations’ information could be collected
from divergent databases such as LncRNADisease, Lnc2Cancer,
MNDR, and so on.
1.3.3. Similarity calculation and network construction
Disease similarity calculation


A typical method calculates disease similarity by computing the
the disease's ancestral nodes' contribution in a tree structure like
MeSH. One other type of method used other related biological
molecules' information to calculate disease’s similarity.
Non-coding RNA similarity calculation
The most popular method is to measure ncRNA similarity using
the biological information of ncRNAs themselves
Heterogeneous network construction
After obtaining similarities, a heterogeneous network is constructed.
1.3.4. Literature review of computational methods to predict
ncRNA-disease associations
Numerous computational approaches have recently been
developed for ncRNA-disease associations prediction. They can
generally be arranged into the accompanying categories: networkbased, recommendation-based, resource allocation-based, machine
learning-based, deep learning-based, and multi-biological
information and multi-model integration-based methods. Each
computational methods' category contains their own advantages and
disadvantages.
1.4. Thesis’s research directions
The research can be implemented in the following directions
Firstly, it is necessary to develop network representation
methods and more reasonable methods for feature extraction,
similarity calculation, and fusion to solve sparse data problem or

enhance the reliability of prediction performance.
Secondly, the thesis can concentrate on integrating different
biological datasets to construct more reasonable similarities and
proposing new computational approachess for forecasting noncoding RNA-disease associations.


Thirdly, the computational approaches for revealing ncRNAdisease associations can also be applied to other research areas such
as drug-disease associations, microbe-disease associations,
metabolite-disease associations, and so on. Therefore, the new
approaches or models for predicting non-coding RNA-disease
associations can also borrow the computational techniques from the
aforementioned fields as well and acclimate them in order to improve
performance in the interested area.
1.5. Some evaluation methods and metrics to evaluate
prediction performance
In this dissertation, the prediction performance of the proposed
models is evaluated by using Area under Roc Curve (AUC) as well
as Area under Precision-Recall Curve (AUPR) by doing 5-foldcross-validation and leave-one-out-cross-validation (LOOCV)
experiments. Besides that, to support the prediction performance
reliability, some case studies could be implemented in each proposed
method. Additionally, although the time complexity was not usually
taken into account to evaluate the performance of a method but in
this dissertation, the time complexity of proposed methods were
quantitatively estimated to assure that they will be finished in
acceptable execution time.
1.6. Chapter summary
In this chapter, firstly, some basic concepts are introduced.
Secondly, the problem of link prediction in HINs was stated and its
applications in biological systems were summarized. Thirdly, the
computational methods for predicting associations between ncRNAs

and diseases were reviewed. From the literature review of
computational methods to predict ncRNA-disease associations, the
thesis research directions are already figured out. Finally, some


evaluation methods and metrics which are used in evaluating
prediction performance of the proposed models, were presented.
CHAPTER 2. NCRNA-DISEASE ASSOCIATIONS
PREDICTION WITH COLLABORATIVE FILTERING AND
RESOURCE ALLOCATION PROCESS ON A TRIPARTITE
GRAPH
2.1. Motivations
Numerous computational approaches have recently been
proposed
to forecast potential ncRNA-disease associations, particularly
miRNA-disease associations as well as lncRNA-disease associations.
Many of them heavily depended on known ncRNA-disease
associations. They need to utilize different similarity matrices which
are not straightforwardly connected with the ncRNA-disease
associations. Therefore, recently, numerous computational methods
have been constructed by using different known associations' types
among different objects' types to discover latent ncRNA-disease
associations. Generally, computational methods that take into
account a variety of known associations among a variety of objects
typically contributes to an increase in prediction accuracy for
predicting ncRNA-disease associations.This chapter introduces a
brand-new computational model that integrates a variety of known
associations between multiple objects to solve the sparsity data
problem and enhance prediction accuracy. The ncRNA-disease
associations were predicted by employing a process for allocating

resources on a tripartite graph.
2.2. Main related works


2.2.1. The item-based collaborative filtering algorithm for
ncRNA-disease association prediction.
Our proposed model used the item-based CF algorithms for
recommendation to solve the sparse data problem.
2.2.2. Resource allocation on a tripartite graph
The resource allocation algorithm on a tripartite graph was
successfully implemented in some computational approaches to
forecast associations between ncRNAs and diseases, including
TPGLDA and ncPred models.
2.3. The proposed model
The proposed model's flowchart is provided in Figure 2.1. It
typically consists of four main stages. In the initial stage, by using
known miRNA-disease associations, known lncRNA-disease
associations, and virified miRNA-lncRNA interactions, a tripartite
graph G0 is constructed. During the second stage, an item-based CF
algorithm
is


Figure 2.1. The proposed model's flowchart
applied on graph G0 to solve the problem of sparsity data and to
produce a tripartite graph G u. At the third stage, a resource allocation
algorithm uses the graph Gu to determine the final resource score of
each disease's ncRNA candidates. And finally, all ncRNA
candidates’ resource score for each disease are ranked in descending
order at the final stage, so that the candidate with higher resource

score will have greater possibility of being verified in the future
2.4. Employing the proposed model to infer miRNA-disease
associations based on collaborative filtering and resource
allocation
2.4.1. Proposed model's stages in inferring miRNA-disease
associations
 Stage 1: Construction of a tripartite graph G0
 Stage 2: Construction of a tripartite graph Gu
 Stage 3: Employing resource allocation process on the tripartite
graph Gu to infer miRNA-disease associations




Stage 4: Ranking all candidate miRNAs’ Rscores for each disease
in descending order
2.4.2. Proposed method’s experiments and results
 Step 1. Preparing datasets.
The proposed method used datasets which came from the study of
Zhao et al. There are 190 diseases, 111 lncRNAs, 264 miRNAs,
1880 known lncRNA-miRNA associations, 936 known associations
between diseases and lncRNAs, and 3552 verified miRNA-disease
associations in these datasets. Figure 2.2 shows the relationships
among different datasets and the number of data nodes used in the
proposed method.
 Step 2: Implementing the proposed method and estimating its
computational time complexity
The proposed method was implemented by using the Python

Figure 2.2. The datasets and the numbers of data nodes in the

proposed method
programming language. Its time complexity is equivalent to O(n3). It
is polynomial time complexity.
 Step 3. Evaluating prediction performance
The 5-fold cross-validation experiments, and the Area under roc
curve (AUC) as well as Area under precision-recall curve (AUPR)
are used to evaluate the proposed method's performance in
identifying miRNA-disease associations.


Evaluating the AUC under 5-fold-cross validation

Figure 2.3. ROC curve and AUC value of the proposed method with
γ = 0.9 in one experimental running time
The proposed method achieves the best averaged value of
0.9788
with γ = 0.9 after performing the experiments for 10 times under 5fold-cross-validation. Figure 2.3 illustrates ROC curve and AUC
value of proposed method with γ = 0.9 in one experimental running
time.
Evaluating the AUPR under 5-fold-cross validation
The proposed method achieves the best averaged AUPR value
0.9373 with γ = 0.9 under 5-fold cross experiments. Figure 2.4 shows
Precision-Recall curve and AUPR value of proposed method with γ
= 0.9 in one experimental running time.


Figure 2.4. Precision-Recall curve and AUPR value of the
proposed method with γ = 0.9 in one experimental running
time
Comparing prediction performance with other related models

The performance of proposed method is compared with the
performance of DCSMDA and TPGLDA methods. The
performances of these methods are shown in the Table 2.1.
Table 2.1. Performance comparison with other related methods
Method
AUC value
AUPR value
TPGLDA
0.9703
0.7421
DCSMDA
0.8155
The proposed method
0.9788
0.9373
 Step 4. Checking case studies
Case studies on Prostatic Neoplasms, Heart Failure, Glioma and
Open-angle Glaucoma are constructed to show the ability of our
model in order to identify new disease-associated miRNAs.
2.5. Employing the proposed model to predict lncRNA-disease
associations based on collaborative filtering and resource
allocation
2.5.1. Proposed model's stages in predicting lncRNA-disease
associations
 Stage 1: Construction of a tripartite graph G0
 Stage 2: Applying collaborative filtering algorithm on known
lncRNA-disease associations and verified lncRNA-miRNA
interactions to obtain a new tripartite graph Gu
 Stage 3: Using improved resource allocation process to obtain
predicted lncRNA-disease associations



 Stage 4: Ranking all predicted lncRNAs for each disease to have
final results
2.5.2. Proposed method’s experiments and results
 Step 1. Preparing datasets

Figure 2.5. The relationships between the different data sources and
the numbers of data nodes used in the proposed method
The proposed method used the datasets came from studies of
Fu et al and Yao et al. These datasets contained 2697 tentatively
known lncRNA-disease associations (LDAs), 13562 known miRNAdisease associations (MDAs) and 1002 verified lncRNA-miRNA
interactions (LMIs) among 240 lncRNAs, 420 diseases and 495
miRNAs. Figure 2.5 illustrates the data nodes’ numbers and the
relationships of the different data sources used in the proposed
method.


Step 2: Implementing the proposed method and estimating
time complexity of proposed method
This proposed method was also implemented by using the Python
programming language. The time complexity of the proposed
method
is O(nl*nd*nm) ≈ O(n3). It means its time complexity is polynomial.
 Step 3: Evaluating prediction performance
The 5-fold cross-validation experiments were employed and AUC
and AUPR were used to estimate the proposed method's
performance.




×