Tải bản đầy đủ (.pdf) (144 trang)

Dự đoán liên kết trong mạng hỗn tạp và ứng dụng trong dự đoán mối quan hệ giữa RNA không mã hóa và bệnh

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.18 MB, 144 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI NATIONAL UNIVERSITY OF EDUCATION

NGUYEN VAN TINH

LINK PREDICTION IN HETEROGENEOUS INFORMATION
NETWORKS AND ITS APPLICATIONS IN PREDICTING
ASSOCIATIONS BETWEEN NON-CODING RNAS AND DISEASES

DOCTORAL DISSERTATION IN COMPUTER SCIENCE

HANOI-2023


MINISTRY OF EDUCATION AND TRAINING
HANOI NATIONAL UNIVERSITY OF EDUCATION

NGUYEN VAN TINH

LINK PREDICTION IN HETEROGENEOUS INFORMATION
NETWORKS AND ITS APPLICATIONS IN PREDICTING
ASSOCIATIONS BETWEEN NON-CODING RNAS AND DISEASES

Major: Computer Science
Code: 9480101

DOCTORAL DISSERTATION IN COMPUTER SCIENCE

SUPERVISORS
1.


Assoc. Prof. Dr. TRAN DANG HUNG

2.

Dr. LE THI TU KIEN

Hanoi-2023


i

AUTHORSHIP'S DECLARATION

I, NGUYEN VAN TINH, affirm that the dissertation entitled “Link
prediction in heterogeneous information networks and its applications in
predicting associations between non-coding RNAs and diseases” has been
completed by myself under the supervision of Assoc.Prof.Dr. Tran Dang Hung and
Dr. Le Thi Tu Kien. I assure some points as follows:
-

This dissertation was done in the Ph.D. research time at Hanoi National
University of Education.

-

This work has not been submitted for any other degrees or qualifications at
Hanoi National University of Education or any other institutions.

-


Appropriate acknowledgment has been given in the thesis where references
have been made to the other published works.

-

The submitted thesis is my own, except the work in the collaboration has been
included. The collaborative contributions have been indicated.

Hanoi, 2023
Ph.D. Student

SUPERVISORS:
1. Assoc. Prof. Dr. TRAN DANG HUNG

2. Dr. LE THI TU KIEN


ii

ACKNOWLEDGEMENT
The dissertation was completed in duration of my Ph.D. course at Hanoi
National University of Education (HNUE). HNUE is a special place where I obtained
valuable knowledge and skills on the way to become a researcher. I am so grateful
for all the people who always support and encourage me completing the dissertation.
Firstly, I would to say thanks to my advisors, Assoc. Prof. Dr. Tran Dang
Hung and Dr. Le Thi Tu Kien for their instruction, advice, and encouragement
throughout my Ph.D. course. My dissertation could not be completed without my
advisors’ scientific direction, encouragement, and support.
Secondly, I wish to thank all members of the Faculty of Information
Technology, HNUE for their frequent support during my Ph.D. course. And I also

wish to thank all my colleagues in the Faculty of Information Technology, Hanoi
University of Industry (HaUI) for their support in professional work during the time
of the Ph.D. course.
Next, I wish to thank Assoc. Prof. Dr. Than Quang Khoat, Hanoi University
of Science and Technology, and Dr. Nguyen Tran Quoc Vinh, Faculty of
Information Technology, The University of Da Nang - University of Science and
Education for their contributions and suggestions during my Ph.D. course.
And then, I also would like to thank all reviewers for their valuable comments
and suggestions on the dissertation’s completion.
Additionally, this work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and Supported by Vingroup Innovation
Foundation (VINIF) under project code VINIF.2019 DA18.
Finally, I would like to express my sincere gratitude to my family and friends
for their continuous support and encouragement to complete the Ph.D. course.
Hanoi, 2023
Ph.D. Student

Nguyen Van Tinh


iii

CONTENTS
AUTHORSHIP'S DECLARATION ...................................................................... i
ACKNOWLEDGEMENT..................................................................................... ii
CONTENTS ..........................................................................................................iii
ABBREVIATIONS............................................................................................... vi
LIST OF TABLES............................................................................................... vii
LIST OF FIGURES ............................................................................................ viii
INTRODUCTION ................................................................................................. 1

CHAPTER 1. BACKGROUND .......................................................................... 10
1.1. Basic concepts................................................................................................. 10
1.1.1.

Heterogeneous information networks ................................................. 11

1.1.2.

Biological systems ............................................................................. 13

1.1.3.

Non-coding RNAs (ncRNAs) ............................................................ 14

1.2. Link prediction in heterogeneous information networks .................................. 15
1.2.1.

Link prediction problem..................................................................... 15

1.2.2.

Link prediction methods .................................................................... 16

1.2.3.

Link prediction applications in biological systems ............................. 19

1.3. Computational methods for predicting associations between non-coding RNAs
and diseases ........................................................................................................... 22
1.3.1.


Predicting non coding RNA-disease association prediction as a link

prediction problem .......................................................................................... 22
1.3.2.

Materials used for ncRNA-disease association prediction .................. 22

1.3.3.

Similarity calculation and network construction ................................. 26

1.3.4.

Literature review of computational methods to predict ncRNA-disease

associations ..................................................................................................... 27
1.4. Thesis’s research directions ............................................................................. 36
1.5. Some evaluation methods and metrics to evaluate prediction performance ...... 37
1.5.1.

Cross-validation ................................................................................. 37

1.5.2.

Area under Roc Curve (AUC) ............................................................ 38


iv


1.5.3.

Area under Precision-Recall Curve (AUPR) ...................................... 39

1.5.4.

Checking case studies ........................................................................ 40

1.6. Chapter summary ............................................................................................ 41
CHAPTER 2. NCRNA-DISEASE ASSOCIATIONS PREDICTION WITH
COLLABORATIVE FILTERING AND RESOURCE ALLOCATION
PROCESS ON A TRIPARTITE GRAPH .......................................................... 43
2.1. Motivations ..................................................................................................... 43
2.2. Main related works ......................................................................................... 45
2.2.1. The item-based collaborative filtering algorithm for ncRNA-disease
association prediction ...................................................................................... 45
2.2.2. Resource allocation on a tripartite graph ................................................ 46
2.3. The proposed model for predicting ncRNA-disease associations based on a
collaborative filtering algorithm and a resource allocation process on a tripartite
graph ..................................................................................................................... 48
2.4. Employing the proposed model to infer miRNA-disease associations based on
collaborative filtering and resource allocation........................................................ 50
2.4.1. Detailed description of proposed model's stages in inferring miRNAdisease associations ......................................................................................... 50
2.4.2. Proposed method's experiments and results ........................................... 54
2.5. Employing the proposed model to predict lncRNA-disease associations based
on collaborative filtering and resource allocation ................................................... 66
2.5.1. Detailed description of proposed model's stages in predicting lncRNAdisease associations ......................................................................................... 66
2.5.2. Proposed method’s experiments and results ........................................... 71
2.6. Chapter summary ............................................................................................ 79
CHAPTER 3. MIRNA-DISEASE ASSOCIATIONS PREDICTION USING

IMPROVED RANDOM WALK WITH RESTART AND INTEGRATING
MULTIPLE SIMILARITIES ............................................................................. 81
3.1. Motivation and main related works ................................................................. 81


v

3.2. Datasets used in the proposed method ............................................................. 83
3.2.1. Human miRNA-disease associations ..................................................... 83
3.2.2. Disease semantic similarity ................................................................... 83
3.2.3. MiRNA functional similarity ................................................................. 84
3.3. Proposed method ............................................................................................ 85
3.3.1. Proposed method overview .................................................................... 85
3.3.2. Calculating Gaussian interaction profile kernel similarity for miRNAs
and diseases. ................................................................................................... 87
3.3.3. Calculating Integrated similarity for miRNAs and diseases ................... 88
3.3.4. Weighted K-nearest known neighbors algorithm ................................... 88
3.3.5. Constructing miRNA similarity-based and disease similarity based
heterogeneous networks .................................................................................. 89
3.3.6. Employing improved random walk with restart to predict miRNA-disease
associations ..................................................................................................... 91
3.3.7. Rank the final prediction score of associations to obtain predicted
miRNA-disease associations. .......................................................................... 94
3.4. Experiments and results .................................................................................. 94
3.4.1. Datasets ................................................................................................. 94
3.4.2. Implementing and Estimating time complexity of the proposed method 95
3.4.3. Performance measures ........................................................................... 96
3.4.4. Performance comparison with other related models ............................. 100
3.4.4. Case studies ......................................................................................... 102
3.5. Chapter summary and discussion .................................................................. 108

CONCLUSION AND FUTURE WORKS ........................................................ 110
PUBLICATIONS............................................................................................... 113
REFERENCES .................................................................................................. 114


vi

ABBREVIATIONS
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

22
23
24
25
26
27
28
29
30

Abbreviation
AUC
AUPR
CF
CNN
CRC
DAGs
DBN
FN
FP
FPR
GCN
GIP
HCC
HF
HIN
lncRNAs
LOOCV
MF
miRNAs

ncRNAs
NMF
OAG
POAG
ROC
RWR
SVM
TN
TP
TPR
WKNKN

Meaning
Area Under Roc Curve
Area Under Precision-Recall Curve
Collaborative filtering
Convolutional neural network
Colorectal cancer
Directed acrylic graphs
Deep brief network
False negative
False positive
False positive rate
Graph convolutional network
Gaussian interaction profile
Hepatocellular carcinoma
Heart failure
Heterogeneous information network
Long non-coding RNAs
Leave-one-out cross validation

Matrix factorization
Micro RNAs
Non-coding RNAs
Non-negative matrix factorization
Open-angle glaucoma
Primary open-angle glaucoma
Receiver operating characteristic
Random Walk with Restart
Support vector machine
True negative
True positive
True positive rate
Weighted K nearest known neighbors


vii

LIST OF TABLES
Table 1.1. Databases containing miRNA-related information and miRNA-disease
associations ........................................................................................................... 23
Table 1.2. Databases containing lncRNA-related information................................ 24
Table 2.1. Performance comparison with other related models .............................. 60
Table 2.2. Top 40 predicted miRNAs for Prostatic Neoplasms .............................. 62
Table 2.3. Top 40 predicted miRNAs for Heart failure .......................................... 63
Table 2.4. Top 40 predicted miRNAs for Glioma .................................................. 64
Table 2.5. Top 20 miRNAs for Glaucoma, Open-Angle ........................................ 65
Table 2.6. AUC and AUPR values of related methods in comparison .................... 76
Table 2.7. Top 10 predicted Prostate cancer-related lncRNAs ............................... 78
Table 2.8. Top 10 predicted lncRNAs related to Stomach cancer ........................... 78
Table 3.1. AUC and AUPR One-sample t-test ....................................................... 97

Table 3.2. Evaluation of index changes in WKNKN algorithm .............................. 99
Table 3.3. AUC and AUPR values RWRMMDA and other latest methods in
comparison .......................................................................................................... 102
Table 3.4. Top 40 predicted Breast Neoplasms-associated miRNAs .................... 103
Table 3.5. Top 40 predicted Hepatocellular carcinoma-associated miRNAs ........ 105
Table 3.6. Top 40 predicted Stomach Neoplasms-associated miRNAs ................ 106
Table 3.7. Top 10 predicted associations between Lung Neoplasms and miRNAs
from the simulated experiment for predicting new disease-related miRNAs ........ 107
Table 3.8. Top 10 predicted associations for Ovarian Neoplasms and miRNAs from
the simulated experiment for predicting new disease-related miRNAs ................. 108


viii

LIST OF FIGURES
Figure 0.1. The dissertation outline .......................................................................... 8
Figure 1.1. An illustration of HIN with multiple node types and multiple link types.
.............................................................................................................................. 11
Figure 1.2. An illustration of HIN’s network schema. ............................................ 12
Figure 1.3. An illustration of a link prediction problem ......................................... 16
Figure 1.4. A ROC curve and AUC's illustration ................................................... 39
Figure 1.5. An illustration of a Precision-recall curve and AUPR. ......................... 40
Figure 2.1. The proposed model's flowchart .......................................................... 49
Figure 2.2. The datasets and the numbers of data nodes in the proposed method ... 56
Figure 2.3. ROC curve and AUC value of the proposed method with γ = 0.9 in one
experimental running time ..................................................................................... 59
Figure 2.4. Precision-Recall curve and AUPR value of the proposed method with γ =
0.9 in one experimental running time ..................................................................... 60
Figure 2.5. The relationships between the different data sources and the numbers of
data nodes used in the proposed method ................................................................ 72

Figure 2.6. The proposed method's ROC curves and AUC values in 5 running times
of experiments with 𝛾 = 0.8. ................................................................................. 75
Figure 2.7. The proposed method's Precision-Recall curves and AUPR values in 5
running times of experiments with 𝛾 = 0.8 ........................................................... 76
Figure 3.1. Illustration of computing miRNA functional similarity ........................ 84
Figure 3.2. The workflow of the proposed method (RWRMMDA) ........................ 85
Figure 3.3. Illustration of the process of weight assignment in disease space and
miRNA space ........................................................................................................ 91
Figure 3.4.

The improved RWR process's steps to predict miRNA-disease

associations ........................................................................................................... 92
Figure 3.5. ROC curves and AUC values (a) and PR curves and AUPR values (b) in
5 running times of 5-fold cross-validation experiments.......................................... 97


ix

Figure 3.6. ROC curve and AUC value (a) and PR curve and AUPR value (b) under
global LOOCV experiment .................................................................................... 98
Figure 3.7. ROC curves and AUC values (a) and Precision-Recall curves and AUPR
values (b) in comparison with other related approaches ....................................... 101
Figure 3.8. ROC curves and AUC values (a) and Precision-Recall curves and AUPR
values (b) in different cases of RWRMMDAs ..................................................... 101


1

INTRODUCTION

Nowadays, we are in a connected world where data or objects’ information,
actors or agents, object groups or component groups are interacted with each other to
compose large networks. These networks are complex. They contain multiple types
of nodes and multiple types of interactions. These networks are called heterogeneous
information networks (HINs). They are rich in semantic information and can be
constructed from multiple data sources. Analyzing of heterogeneous information
network (HIN) generates a trendy research of mining of data, retrieving of
information, link prediction, mining of graph, network science, and so forth [1]–[3].
Link prediction is a crucial and active task in HIN analysis. It benefits many
researchers and organizations in a variety of fields. The link prediction’s main
objective is to discover absent links in a network or to forecast links which may
soonly occur in a network. It has been extensively studied in different literature [4]–
[8]. Link prediction has been broadly applied in various domains from social
networks to biological systems. For biological systems, link prediction has been used
to discover the relationships or associations among biological objects such as diseasephenotype/gene

associations,

disease-metabolite

associations,

drug-protein

interactions, drug-miRNA associations, disease-drug associations, non-coding RNAdisease associations, and so forth. Especially, for a long time, identifying non-coding
RNAs (ncRNAs) in the human genome is difficult. They were treated as noise.
However, ncRNAs play vital roles in life activities. Additionally, it has been
demonstrated that they have a significant impact on the human diseases' occurrence,
progression as well as development. Identifying relationships between ncRNAs and
diseases has exposed opportunities for therapeutic and diagnostic of human diseases.

Therefore, the studies of ncRNA-disease relationships have extensively been
executed in recent years.
Recently, a huge number of experimental methods have been developed to
facilitate us in determining the relationships between ncRNAs and diseases.
However, conventional biological experiments make it costly, time-consuming, and


2

laborious to discover potential ncRNA-disease relationships. Therefore, it requires to
have computational methods for identifying ncRNA-disease associations. Among
ncRNAs' types, there are two special types, micro RNAs (miRNAs) as well as long
non-coding (lncRNAs), which have been carefully studied and attracted a lot of
attention from researchers. In the past few years, various computational methods for
predicting ncRNA-disease associations have been developed. We can practically
divides them into categories as: network-based, recommendation-based, resource
allocation-based, machine learning-based, deep learning-based, as well as multi
model and biological information integration-based methods [9]–[12]. Although
actual computational methods have made massive benefits in revealing disease‐
associated ncRNAs in each category and typically decrease the cost as well as time
of biological experiments. For examples, network-based methods are easy to
understand and normally have fast predictive capabilities. The machine learningbased methods can strongly learn and derive ncRNAs or diseases' features. The deep
learning-based approaches, with the graph neural networks' development, have strong
abilities of learning as well as predicting to combine features of networks and biology.
Howerver, there are still some limitations which are needed to be solved as follows.
Firstly, the computational approaches for predicting ncRNA-disease
associations ought to deal with sparse data problem. It bases on the reality that the
known ncRNA-disease associations' number is quite smaller compared to the
unknown associations. Hence, it is difficult to obtain a reliable network to represent
a reasonable biological network. Therefore, it limits prediction accuracy [11].

Secondly, due to the sparsity data problem, it causes another issue that the is
unbalancing of positive and negative samples in performing computational methods
for predicting ncRNA-disease associations. It is the reason that the prediction
performance of computational methods is not very reliable.
Thirdly, the similarity calculation in existing computational methods depends
excessively on known associations between ncRNAs and diseases. It could generate
the noticeable bias to construct computational models for predicting ncRNA-disease


3

associations. Therefore, it requires to reasonably fuse different similarity scores from
different souces of biological information to enhance ncRNA-disease association
prediction performance [10].
Fourthly, most of existing computational methods are not applicable to predict
associations for isolated diseases or ncRNAs (miRNAs or lncRNAs) which have not
any known association with other ncRNAs or other diseases in the examined data
sets. So, it is nessessary to combine different biological information to improve the
capability of prediction ncRNA-disease association for isolated cases.
Fifthly, there are too many parameters that need to be adjusted in many
computational methods leading to the difficulty in performing ncRNA-disease
association prediction. It means that the researchers need to develop more
computational methods which will be easier to employ in ncRNA-disease association
prediction.
And finally, since more and more biological databases become available so it
requires to effectively fuse data from multiple data sources to enhance the reliability
and performance of prediction.
Up to date, a numerous number of research are weekly published in scientific
journals or conferences to show new results of research on developing computational
methods for ncRNA-disease association prediction. Many of them concentrates on

solving the above mentioned limitations. Additionally, based on the fact that selecting
useful data from heterogeneous information to build up a reliable HIN is still a
challenge, it remains room for scientists and researchers to research for constructing
a reliable HIN and training an useful computational method to achieve more decisive
performance of ncRNA-disease associations prediction [11]. The future research on
developing computational methods for ncRNA-disease associations can follow the
below aspects.
Firstly, the sparse data problem needs to be solved to enhance the reliability of
prediction performance. The sparse data problem can be solved by selecting
reasonable similarity calculation, network representation methods as well as selecting


4

reasonable and meaningful pre-processing algorithms or methods that were already
applied in other recent studies.
Secondly, the future research needs to integrate different biological datasets to
construct more reasonable similarities and to reduce the impact of relying too much
on known ncRNA-disease associations. Thereby, the performance and reliability of
prediction of computational methods can be enhanced.
Thirdly, the computational methods from other domains such as microbedisease associations prediction, metabolite-disease associations prediction, drugdisease associations prediction, drug-target prediction, and so on, can also be applied
in predicting ncRNA-disease association area. Therefore, the future research can
borrow the computational methods from these areas and acclimating them to attain
better performance of ncRNA-disease associations prediction.
It is the reason that the Ph.D student selects the topic “Link prediction in
heterogeneous information networks and its applications in predicting
associations between non-coding RNAs and diseases” for this dissertation.
 Dissertation objective and research problem
Through this dissertation, the research will focus on: proposing computational
methods or models to improve prediction performance for predicting human noncoding RNA-disease associations on heterogeneous information networks by solving

the following problems:
-

Solving the sparse data problem to improve the accuracy of human ncRNAdisease associations prediction performance.

-

Fusing multi-types of information from different biological datasets to have more
realistic similarities and to decrease the impact of depending on known human
ncRNA-disease associations excessively.

-

Inheriting the computational methods from other domains such as predicting of
microbe-disease associations, drug-disease relationships, drug-target interactions
and so forth, and improving them to achieve better performance in predicting
human non-coding RNA-disease associations.


5

 Research questions need to be answered:
To solve the above problems and achieve the research objective, some research
questions need to be answered as follows.
The first question is "How to solve the sparse data problem?". Up to date, there
are several methods which are used to decrease the effects of the sparsity data
problem. For example, the colaborative filtering (CF) algorithms, in the context of
recommender systems, have been used in different studies to mitigate the sparsity
data problem in revealing ncRNA-disease associations [13]. Weighted K-nearest
known neighbors (WKNKN) algorithm has been applied to pre-process data to reduce

the number of unknown ncRNA-disease associations in different works [14]–[17]. It
based on the assumption that the unknown associations could be proper association
in the datasets used to train the models by measuring a ncRNA or disease's similarities
to other ncRNAs or diseases, respectively. Therefore, the dissertation research can
employ the CF or WKNKN algorithms to solve the sparse data problem depending
on the used biological datasets. However, integrating the biological datasets to
construct reasonable similarities among various data sources is also an issue. It
depends on the types of selected biological data which are used to measure
similarities. Hence, the next question that the dissertation research has to answer is
"Which are the types of biological data used to have resonable network
representations to predict associations between non-coding RNAs and diseases?".
Recently, various studies have integrated biological information from different data
sources to measure the similarities among divergent biological objects. For instance,
Ding et al. [18] used information of lncRNAs, diseases and genes to build a tripartite
graph to forecast lncRNA-disease associations. Yu et al. [13] relied on lncRNAs,
diseases and miRNAs' information and a lncRNA-miRNA-disease network to reveal
lncRNA-disease associations. Other works used multi-type biological networks or
multi-omics data to forecast ncRNA-disease associations [19], [20]. Besides that,
category of computational methods contains their own issues. Therefore, an other
question that the research has to answer is that how to combine or integrate multi-


6

models and multi-type biological information to effectively overwhelm the issues of
the single models and enhance prediction performance.
 Thesis’s research scope and methodology:
To achieve the objective of proposing computational methods or models to
improve prediction performance for predicting human non-coding RNA-disease
associations on heterogeneous information networks, both theory and experimental

methodologies are employed in the thesis.
Theory research
For theory research, firstly, the author has to review literature to obtain
background knowledge of heterogeneous information networks, link prediction
problem, biological systems, biological objects as well as link prediction applications
in biological systems and so on. Secondly, the computational methods for predicting
ncRNA-disease associations are reviewed and analyzed to understand the strengths
as well as to detect weaknesses and problems of these methods. Finally, some new
computational methods are proposed by combining, integrating and improving
different types of biological information and different computational methods to
solve the detected weaknesses and problems.
Experimental research
To evaluate performance of proposed computational methods in the
dissertation, the author already implemented them in Python programming language
(including PyCharm IDE, Python’s libraries…) using different biological datasets.
After having experimental results, the prediction performance of proposed methods
are compared with other related approaches on the same experimental datasets.
Additionally, to support prediction performance reliability, the author also employed
case studies by checking whether the predicted ncRNA-disease associations were
confirmed in other biological literature or databases.
 Thesis’s scientific contributions:
The thesis has the following scientific contributions:
-

Contribution 1: Proposed an improved computational model by combining


7

collaborative filtering algorithm and a process of resource allocation on a

tripartite graph using multiple known associations' types to forecast ncRNAdisease associations.
-

Contribution 2: Proposed a new miRNA-disease associations prediction method
which used a WKNKN algorithm to pre-process data to decrease unverified
associations in miRNA-disease association dataset and uncover latent
associations between miRNAs and diseases using improved random walk with
restart (RWR) algorithm and fusing multiple similarities from HINs.
The contribution 1 is presented in Chapter 2 of the dissertation, related contents

of this contribution were published the Proceeding of the KSE2020 ([VTN1], Scopus
Indexed), BMC Medical Genomics journal in 2021 ([VTN2], ISI Q2 journal), and
Proceeding of the KSE2021 ([VTN3], Scopus Indexed).
The contribution 2 is presented in Chapter 3 of the dissertation, related contents
of this contribution were published in Scientific Reports journal (ISI Q1 journal) in
2021 [VTN4].
 Thesis’s structure:
The dissertation outline is illustrated in Figure 0.1, which contains Introduction,
three Chapters, and Conclusion and Future Works. Each part of the dissertation is
briefly described as follows:
Introduction
In this section, an overview of heterogeneous information networks and link
prediction in heterogeneous information networks are firstly introduced. Next, the
importance of developing ncRNA-disease associations prediction computational
methods as well as some limitations in predicting associations between non-coding
RNAs and diseases are presented. Then, the thesis objective, research problems and
research questions are figured out. And then, the thesis scope and methodology are
summarized. The thesis scientific contributions are shown in the next section. And
finally, the thesis structure is outlined.



8

Figure 0.1. The dissertation outline
Chapter 1. Background
In this chapter, firstly some fundamentals of HINs, biological systems, and noncoding RNAs are provided. Secondly, the problem of link prediction in HIN(s) and
popular link prediction methods are summarized. Thirdly, an overview of ncRNAdisease associations' computational methods along with their strengths as well as
weaknesses are shown. Inspired by these strengths and weaknesses, some research
directions of the thesis are drawn. And finally, some methods and metrics used in the
prediction performance's evaluation of proposed models in the next chapters are
presented.
Chapter 2. NcRNA-disease associations prediction with collaborative
filtering and resource allocation process on a tripartite graph


9

In this chapter, firstly some fundamentals of CF algorithm and process of
resource allocation on a tripartite graph are introduced.
Secondly, a new computational model for non-coding RNA-disease
associations prediction using a CF algorithm and a process of resource allocation on
a tripartite graph based on multi-type biological objects was proposed.
Finally, the newly proposed model was applied in two applications of miRNAdisease association prediction and lncRNA-disease association prediction to
demonstrate its outperformance in prediction compared to other related methods. The
proposed model can also be considered a useful tool for inferring ncRNA-disease
associations due to its high performance in both inferring potential associations
between miRNAs and diseases and discovering new associations for new diseases (or
ncRNAs) without any known associations.
Chapter 3. MiRNA–disease associations prediction using improved random
walk with restart and integrating multiple similarities

In this chapter, a method named “Predicting miRNA–disease associations using
improved random walk with restart and integrating multiple similarities” is presented.
The proposed method uses a WKNKN algorithm as a pre-processing step to solve the
sparsity data issue. It also integrates multiple data sources to increase prediction
reliability. Besides that, it borrows a RWR method from the microbe-disease
association prediction field and improves the RWR process to uncover latent miRNAdisease associations. It very well may be considered as a valuable tool to
derivemiRNA–disease association.
Conclusion and future works
In this section, the thesis's conclusion and future works are given. Firstly, the
proposed methods of miRNA-disease associations prediction and lncRNA-disease
associations prediction are summarized. Next, the limitations of the proposed
methods are analyzed. And finally, strategies for increasing the reliability of
prediction performance in the future are pointed out.


10

CHAPTER 1. BACKGROUND
As an essential problem in heterogeneous information networks, link
prediction’s main objective is to discover absent links in a network or to forecast links
that may soonly occur in a network where objects are shown by nodes while
interactions among objects are represented by links [7], [21]. Link prediction is
advantageous to numerous researchers and has emerged as a significant and active
area of study in recent years. It has a wide range of applications in different domains
such as social networks, biological networks, scientific networks, and so forth. For
systems biology, link prediction plays a crucial role in discovering relationships or
associations among biological objects which can help us to understand many
biological processes [6], [7], [22]. Especially, link prediction is used for predicting
ncRNA-disease associations, and provides opportunities for understanding and
analyzing these associations. Inferring and analyzing the ncRNA-disease

relationships play vital roles in understanding disease's mechanisms and help us in
diagnosing, curing, and preventing complex human diseases [9]–[11], [23].
As previously mentioned, this dissertation's objective is to proposing
computational methods or models to improve prediction performance for predicting
human non-coding RNA-disease associations on heterogeneous information
networks. Therefore, in this chapter, firstly, some basic concepts are introduced.
Secondly, the link prediction problem is stated and link prediction applications are
shown. Thirdly, the computational methods for predicting ncRNA-dissease
associations were summarized. Finally, some methods and metrics to assess proposed
methods' prediction performance in this dissertation, are presented.
1.1. Basic concepts
As mentioned before, most the real-world systems may be treated as
heterogeneous information networks (HINs). Biological systems are a special class
of HINs. Therefore, in this section, some basic concepts which relate to HINs and
biological systems are introduced.


11

1.1.1. Heterogeneous information networks


Information network
As can be known, most real-world systems are composed of multi-typed

components and a large number of interactions or associations among them, for
examples, human social activity systems, communication systems, computer
systems, biological systems, and so forth. Without loss of generality, such systems,
can be known as information networks. Formally, an information network is defined
as following:

Definition 1.1. Information network [2]. An information network is
represented by a graph 𝐺 = (𝑉, 𝐸) where 𝑉 represents set of nodes and 𝐸 represents
set of edges of the graph. The graph G contains an mapping function of object type,
ϕ: V → A. It also contains a mapping function of link type, ψ: E → R. Each node v ϵ
V has only one distinct object type, ϕ(v) ϵ A. Each link e ϵ E has only one particular
link type, ψ(e) ϵ R. If two links have the same type of starting and ending object, they
are same link type.


Heterogeneous/Homogeneous information network.
Definition 1.2. Heterogeneous/Homogeneous information network [2]. If

the information network contains moreover one object type or one link type, it is
known as a HIN, typically |A|>1 or |R|>1; in another way, it is called a homogeneous
information network, typically, |A|=1 and |R|=1. Figure 1.1 is an illustration of a
heterogeneous information network with multiple node types and multiple link types.

Figure 1.1. An illustration of HIN with multiple node types and multiple link types.
 Network schema


12

Definition 1.3. Network schema [2]. A network schema is signified by 𝑇𝐺 =
(𝐴, 𝑅). It is a meta template for a heterogeneous network 𝐺 = (𝑉, 𝐸), which is
defined over object types from A as well as link types from R. The graph G contains
the mapping function of object type ϕ:V → A, and also the link type's mapping
function ψ: E → R.
A HIN’s network schema points out the restrictions on the object set and set of
objects’ relationships. These restrictions could be used to guide semantic the

networks' investigations. A heterogeneous network that follows a network schema is
known as a network instance. With a link type R that connects object types X and Y,
namely 𝑋 →𝑅 𝑌, X and Y are called the link type R's source and target object types,
respectively. They are signified by R.X and R.Y. Figure 1.2 illustrates a HIN’s network
schema.

Figure 1.2. An illustration of HIN’s network schema.


Meta path
In a HIN, two objects could be linked through various paths and each path has

paricular meanings. These paths are known as HIN meta paths.
Definition 1.4. Meta path [2]. A HIN meta path 𝑃 on schema 𝑇𝐺 = (𝐴, 𝑅) is
signified by

𝐴1

𝑅1


𝐴2

𝑅2
𝑅𝑙
→ 𝐴3 … → 𝐴𝑙+1 , which determines a composite relation

𝑅 = 𝑅1 °𝑅2 ° … °𝑅𝑙 among nodes with ° is the relations' composition operator.



Reasons for heterogeneous information networks
In numerous studies, information networks are frequently considered as

homogeneous networks where nodes are in the same object type, and links are in the
same relation type. Nevertheless, in fact, most networks are heterogeneous where


13

nodes and also links belong to different types. Generally, HINs can be established in
any areas. For example, in biological networks, nodes can be genes, proteins,
microbes, diseases, miRNAs, lncRNAs, gene expressions, phenotypes,.. [24].
Therefore, HINs are forceful and thoughtful representations of natural interactions in
diverse domains among different object types [25].
An information network can be analyzed in a variety of ways. In homogeneous
information networks, various data mining tasks including ranking, clustering, link
prediction, influence analysis, and particularly social networks have been
investigated. However, most homogeneous information networks' methods can not
straightforwardly be utilized in HINs to mine heterogeneous data. The reason is that
heterogeneous links across objects of divergent types and a HIN generally hold
plentiful information than in homogeneous networks [25]. In recent years, various
object types are inter-connected. They are difficult to model by homogeneous
networks. Therefore, HIN is naturally taken into account to represent divergent object
types and relationsships.
1.1.2. Biological systems
Biological systems are a special class of heterogeneous information networks
which consists of a large number of biological entities such as genes, miRNAs,
lncRNAs, gene expressions, phenotypes, and so forth [7], [22], [26], [27]. Normally,
all biological processes are regulated by molecular entities and their interactions or
associations. Understanding of biological processes requires not only knowledge

about biological entities themselves but also knowledge about relationships among
them. Naturally, these biological processes are represented by a graph, also called a
network, in which nodes represent biomolecules and links represent interactions or
associations among molecular entities [22], [27]. In other words, networks are
representations of heterogeneous and complex biological systems. The analysis of
biomolecules’ interactions or associations plays a crucial role in understanding the
physiology and pathology of various forms of life, including the new drugs'
development and disease mechanisms' discovery. Therefore, biological interactions'


14

studying comes a dominating topic in biological networks [26]. In recent years, the
diversity of biological objects in addition to the amount of biological data's rapid
growth makes biological networks have to handle the challenges of the problem of
big data. Therefore, HINs are considered to be powerful tools to deal with the
heterogeneous and complex problem of biological networks [3].
Normally, biological networks could be classified into the common types as
Protein-Protein interactions (PPI), Gene regulation networks (GRNs), Metabolic
networks, Disease networks and so forth. Based on the research purposes, biological
networks could be defined as a gene-disease, drug-disease, drug-disease-gene,
lncRNA-disease, miRNA-disease networks, and so forth [7].
1.1.3. Non-coding RNAs (ncRNAs)
As known before, most of the human genome is transcribed into RNAs. RNAs
are divided into two forms. The first form of RNA can encode proteins and they
account for only approximately 2% of the human genome. The second form of RNA
accounts for nearly 98% of the human genome that can not be transferred into
proteins. The RNAs which can not be transferred into proteins are referred to as noncoding RNAs (ncRNAs) [11], [23].
Non-coding RNAs can be divided into different types such as tRNAs (transfer
RNAs), rRNAs (ribosomal RNAs), snRNAs (small nuclear RNAs), smcRNAs (small

non-coding RNAs), lncRNAs, miRNAs as well as circRNAs. Especially, miRNAs
and lncRNAs are two thoroughly studied types of ncRNAs because miRNAs are the
regulators of most protein-coding genes whereas lncRNAs are the most ubiquitously
found in mammalian [11].
Micro RNAs (miRNAs)
MiRNAs are a single-stranded, endogenous, small, evolutionarily conserved
class of ncRNAs with length of 22-26 nucleotides [28], [29].
Long non-coding RNAs (lncRNAs)
LncRNAs belong to a ncRNAs's subclass. They lengthen moreover than 200
nucleotides [11].


×