MICROARRAY DATA ANALYSIS TOOL (MAT)
A Thesis
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
Sudarshan Selvaraja
December, 2008
ii
MICROARRAY DATA ANALYSIS TOOL (MAT)
Sudarshan Selvaraja
Thesis
Approved: Accepted:
_______________________ _______________________
Advisor Department Chair
Dr. Zhong-Hui Duan Dr. Wolfgang Pelz
_______________________ _______________________
Committee Member Dean of the College
Dr. Yingcai Xiao Dr. Ronald F. Levant
_______________________ _______________________
Committee Member Dean of the Graduate School
Dr. Xuan-Hien Dang Dr. George R. Newkome
_______________________
Date
iii
ABSTRACT
Microarray is a technology that has been widely used by the biologists to probe the
presence of genes in a sample of DNA or RNA. Using the technology, the
oligonucleotide probes can be massively parallel immobilized on a microarray chip. It
allows the biologists to check the expression levels of thousands of genes together. This
thesis develops a software system that includes a database repository to store different
microarray datasets and a microarray data analysis tool for analyzing the stored data. The
repository currently allows datasets of GenepixPro format to be deposited, although it can
be expanded to include datasets of other formats. The user interface of the repository
allows users conveniently upload data files and perform preferred data preprocessing and
analysis. The analysis methods implemented includes the traditional k-nearest neighbor
(kNN) methods and two new kNN methods developed in this study. Additional analysis
methods can be added by future developers. The system was tested using a set of
microRNA gene expression data. The design and implementation of the software tool are
presented in the thesis along with the testing results from the microRNA dataset. The
results indicate that the new weighted kNN method proposed in this study outperforms
the traditional kNN method and the proposed mean method. We conclude that the system
developed in the thesis effectively provides a structured microarray data repository, a
flexible graphical user interface, and rational data mining methods.
iv
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Zhong-Hui Duan for giving me an
opportunity to work on this project for my Masters thesis. I was motivated to choose this
topic after I took Introduction to Bioinformatics course. I would like to thank her for
invaluable suggestions and steady guidance during the entire course of the project.
I am thankful to my committee members Dr. Yingcai Xiao and Dr. Xuan-Hien
Dang for their guidance, invaluable suggestion and time.
I would like to thank my friends Shanth Anand and Prashanth Puliyadi for
helping me to do Master’s and change my career path. I couldn’t have achieved this
without their help.
I would like to thank my friend Manik Dhawan for his guidance in writing and
formatting this report.
I would finally like to express my gratefulness towards my parents and all my
family members who were always there for me and cheering me on all situations and for
their great interest in my venture.
v
TABLE OF CONTENTS
Page
LIST OF TABLES viii
LIST OF FIGURES ix
CHAPTER
I. INTRODUCTION 1
1.1 Introduction to Bioinformatics 1
1.2 Introduction to Microarray Technology……………………………… 2
1.2.1
Genepix Experiment Procedural………………………………. 3
1.3 Applications of Microarrays 5
1.4 Need for Automated Analysis…………………………………………. 6
1.5 Knowledge Discovery in Data………………………………………… 7
1.5.1
KDD Steps……………………………………………………
8
1.6 Classification………………………………………………………… 8
1.6.1
General Approach…………………………………………… 9
1.6.2
Decision Trees…………………………………………………. 10
1.6.3 k Nearest - Neighbor Classifiers………………………………. 12
vi
1.7 Outline of the Current Study………………………………………… 14
II.
LITERATURE REVIEW … 17
2.1 Previous Work … 17
2.2 Existing Tools for Normalizing GPR Datasets……………………… 20
2.3 Stanford Microarray Database (SMD)… 21
2.4 Microarray Tools… 21
2.5 Available Source for Microarray Data………………………………… 24
III. MATERIALS AND METHODS … …
25
3.1 Database Design… … 25
3.1.1
Schema Design………………………………………………… 25
3.1.2
Table Details………………………………………………… 26
3.1.3
Attributes.…………………………………………………… 28
3.2 Description of Genepix Data Format 29
3.2.1 Features and Blocks…………………………………………… 31
3.2.2
Sample Dataset………………………………………………… 32
3.2.3 Transferring Genepix Dataset to Database……………………
33
3.3 Data Selection … 34
3.3.1 Creation of Training and Testing Dataset…………………… 34
3.4 Preprocessing………………………………………………………… 38
3.4.1 Preprocessing in MAT………………………………………… 39
3.5 Normalization … 41
3.6 Feature Selection … 42
vii
3.6.1 Student T-Test…………………………………………………. 42
3.6.2 Implementation of T-Test in MAT……………………………
43
3.7 Classification …
44
3.7.1 Classical kNN Method………………………………………… 44
3.7.2 Weighted kNN Method……………………………………… 46
3.7.3 Mean kNN Method…………………………………………… 47
IV. RESULTS AND DISCUSSIONS………………
49
4.1 A Case Study………………………………………………………… 49
4.2 Results…………………………………………………………………. 53
4.3 Discussion…………………………………………………………… 55
V. CONCLUSIONS AND FUTURE WORK… 56
5.1 Conclusion……………………………………………………………
56
5.2 Future Work…………………………………………………………… 56
REFERENCES… …
58
APPENDICES……………………………………………………………………… 61
APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2…… 62
APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL…………… 63
APPENDIX C CLASSIFICATION ALGORITHMS…………………. 66
viii
LIST OF TABLES
Table Page
1.1 Confusion matrix for a 2-class problem …………………… 9
1.2 Software used …………………… 16
2.1
List of microarray tools
………… 22
2.2
Available source for microarray data………………………………………………….
25
3.1 Tables used in MAT
…………………………………………………
26
3.2 Attributes and their description……………………………………………… 29
3.3 List of default choices for feature selection…………………………………… 40
4.1 Training and testing samples – Experiment 1………………………………… 53
4.2 Accuracy of three classification methods for different N features……………
53
4.3 Training and testing samples – Experiment 2………………………………… 54
4.4 Accuracy of three classification methods for different N features……………
54
ix
LIST OF FIGURES
Figure Page
1.1 Schematic view of a typical microarray experiment………………………….
3
1.2 Genepix experimental procedure……………………………………………. 4
1.3 Overview of KDD process 7
1.4 Mapping an input attribute set x into its class label y……………………… 8
1.5 A decision tree for the mammal classification problem…………………… 11
1.6 Classifying an unlabeled vertebrate………………………………………… 12
1.7 Schematic representation of k-NN classifier………………………………… 13
1.8 System diagram … 14
1.9 Application flow diagram……………………………………………………. 15
2.1
Sketch of the ProGene algorithm……………………………………………………
19
3.1 Database schema…………………………………………………………… 26
3.2 Genepix_version table design……………………………………………… 27
3.3 Genepix_header table design……………………………………………… 28
3.4 Genepix_sequence table design………………………………………………
28
3.5 Hypothetical arrays of blocks………………………………………………
31
3.6 Sample dataset……………………………………………………………… 32
x
3.7 Creation of repository……………………………………………………… 33
3.8 Selection of datasets…………………………………………………………
34
3.9 Temporary table names for training and testing datasets …………………….
35
3.10
Flowchart – Creation of dataset……………………………………………… 36
3.11
Replication of gene………………………………………………………… 37
3.12
Sample training dataset with median intensity values……………………… 37
3.13
Preprocessing in MAT………………………………………………………
39
3.14
T-Test formulas………………………………………………………………
42
3.15
Calculated p-values for the genes…………………………………………….
44
3.16
Pseudo code of kNN classical method………………………………………
45
3.17
Pseudo code of kNN mean method…………………………………………
48
4.1
Training samples selected for the experiment………………………………
50
4.2
Testing samples selected for the experiment…………………………………
51
4.3
Attribute selection and constraint specification for normalization…………
51
4.4
Training datasets……………………………………………………………
52
4.5
Testing datasets……………………………………………………………….
52
4.6
Feature selection and normalization…………………………………………
52
1
CHAPTER I
INTRODUCTION
1.1 Introduction to Bioinformatics
The central dogma of molecular biology is that DNA (deoxyribonucleic acid) acts
as template to replicate itself, DNA is transcribed to RNA and RNA is translated into
protein. DNA is the genetic material. It represents the answers to most of the researchers
and scientists for years: “What is the basis of inheritance?” The information stored in
DNA that allows the organization of inanimate molecules into functioning, living cells
and organism that are able to regulate their internal chemical composition, growth and
reproduction [1]. This is what allows us to inherit our parents’ features, ex: our parents’
curly hair, their nose and others. The various units that govern those characteristics at the
genetic level are called genes. The term bioinformatics refers to the use of computers to
retrieve, process, analyze and simulate biological information. Bioinformatics has led to
huge researches and has well proven itself for diagnosis, classification and discovery of
many aspects that lead to diseases. Although bioinformatics began with sequence
comparison it now encompasses a wide spread of activity for the modern scientific
research. It requires mathematical, biological, physical, and chemical knowledge. Its
implementation may further more require knowledge of computer science and etc.
2
1.2. Introduction to Microarray Technology
A DNA microarray is an orderly arrangement of tens to hundreds of thousands of
DNA fragments (probes) of known sequence. It provides a platform for probe
hybridization to radioactive or fluorescent labeled cDNAs (targets). The intensity of the
radioactive or fluorescent signals generated by the hybridization reveals the level of the
cDNAs in the biological samples under study. Figure 1.1 shows the major processes in a
typical microarray experiment. Microarray technology has been widely used to
investigate gene expression levels on a genome-wide scale [1, 2, 5, 10]. It can be used to
identify the genetic changes associated with diseases, drug treatments, or stages in
cellular processes such as apoptosis or the cycle of cell growth and division [10]. The
scientific tasks involved in analyzing microarray gene expression data include the
identification of co-expressed genes, discovery of sample or gene groups with similar
expression patterns, study of gene activity patterns under various stress conditions, and
identification of genes whose expression patterns are highly discriminative for
differentiating discerned biological samples.
Microarray platforms include Affymetrix GeneChips which uses presynthesized
oligonucleotides as probes and cDNA microarrays which use full length cDNAs as
probes. The array experiment uses slides or blotting membranes. The spot sizes are
typically less than 200 microns in diameter usually containing thousands of spots. The
spotted samples are known as probes. The spots can be DNA, cDNA or oligonucleotides
[2]. These are used to determine complementary binding of the unknown sequences thus
allowing parallel analysis for gene expression and gene discovery. An orderly
3
arrangement of probes is important as the location of each spot on the array is used for
the identification of a gene. The diagram of the microarray experiment is shown in Figure
1.1.
Figure 1.1 Schematic view of a typical microarray experiment.
In the current study we are using the microarray dataset which were generated
through cDNA microarray experiments. The arrays were scanned using Genepix pro
biological kit. The forthcoming section explains the experimental procedure of the
creation of the dataset.
1.2.1 Genepix Experiment Procedural
Genepix Pro is an automatic microarray slide scanner. Genepix Pro automatically
loads, scans, does analysis and saves results. It can accommodate up to 36 slides. The
auto loader accommodates microarrays on micro slides labeled with up to four
fluorescent dies. These micro arrays can contain few hundred spots or few thousand spots
representing an entire genome.
targets with microarray chip
fluorescent dye with probes
targets hybridized
to probes
4
Figure 1.2 Genepix experimental procedure [Copyright – Appendix A]
When the slide career is inserted into the scanner, sensors detect the location of
the scanner. Software helps to select of slides to be scanned. The graphical representation
of the slides will be shown on the screen for user selection which makes it easier for the
user to identify the slide. For each slide or for group of slide we can set the settings for
the experiment. We can also choose automatic analysis option from the software. If the
email address is specified in the settings, the experiment is done and the results will be
sent to the email address.
The robotic arm takes the first slide from the slide career and scans the bar code in
the slide and the slide is positioned for scanning. Genepix can be configured with four
5
lasers. Laser power wheel is used to adjust the laser strength for especially bright
samples. The laser excitation beam is delivered to the surface of the microarray slide and
the beam scans shortly across the access of the slide. As robotic arm moves slowly the
slide fluorescent signals emitted from the sample is collected by a photo multiplier tube.
Sensors detect any non-uniformity in the slide surface and robotic arm is used to adjust
the focus of the scan. Each channel is scanned sequentially and the developing images are
displayed on the monitor. The multichannel tiff images are saved automatically according
to file naming conventions specified by the user.
Once the scan has been completed the robotic arm replaces the slide in the career
and repeats the process for the other slides selected from the tray. Genepix automatically
finds the spot and calculates up to 108 measures and saves the result as GPR files. If the
experiment is conducted with single channel the number of measures will be 50 or else
the number of measure will be 50 to 108.
1.3 Applications of Microarrays
As we know the basic working of microarrays, we can now explore the different
applications of microarray technology.
Gene discovery: Microarray technology helps in the identification of new genes. They
help to know about the functioning and expression levels under different conditions.
Disease diagnosis: Microarray technology helps to learn more about different diseases
such as heart disease, mental illness, infectious disease and especially the study of cancer.
Different types of cancer have been classified on the basis of the organs in which the
6
tumors develop. With the help of microarray technology, it will be possible for the
researchers to further classify the types of cancer on the basis of the patterns of gene
activity in the tumor cells. This will help the pharmaceutical community to develop more
effective drugs as the treatment strategies will be targeted directly to the specific type of
cancer
Drug discovery: Pharmacogenomics is the study of correlations between therapeutic
responses to drugs and the genetic profiles of the patients [2]. Comparative analysis of
the genes from a diseased and a normal cell will help the identification of the biochemical
constitution of the proteins synthesized by the diseased genes. The researchers can use
this information to synthesize drugs which combat with these proteins and reduce their
effect.
Toxicological research: Microarray technology provides a robust platform for the
research of the impact of toxins on the cells and their passing on to the progeny [2].
Toxicogenomics establishes correlation between responses to toxicants and the changes
in the genetic profiles of the cells exposed to such toxicants [2].
1.4 Need for Automated Analysis
The intrinsic problem of a typical data set produced by microarrays is the sample
size and the high dimensionality of the data set. The dataset created by genepix pro has
various measures for thousands of genes. There is no way of analyzing the samples
manually. In this study we propose a microarray analysis tool (MAT) with their ability of
appropriately representing new methods of classification and finding new classes. The
7
tool follows the knowledge discovery in data (KDD) steps which are explained in detail
in the forthcoming section.
1.5 Knowledge Discovery in Data
The term knowledge discovery in data (KDD) refers to the process of finding the
knowledge in data and application of particular data mining methods. It involves the
evaluation and possible interpretation of the patterns known as knowledge. The unifying
knowledge of the KDD process is to extract useful information from large database.
Overview of the KDD process is shown in Figure 1.3.
Figure 1.3 Overview of KDD process
8
1.5.1 KDD Steps
Data selection processes the knowledge in the application domain and selects the
dataset that are relevant to the problem to be solved. Preprocessing step removes the
unwanted data from the database and find strategies to update the missing fields in the
dataset. Transformation is the process of transforming data from one type to another. In
this step we find the useful features to represent the data depending on the goal of the
task and normalize the data set. In data mining step we decide the algorithms suitable for
the study. The current study is mainly about classification and hence we choose the
classification algorithm to be implemented in this step. Interpretation and evaluation is
the process of creating the model. The model is tested with the test sample and accuracy
of the prediction is calculated.
1.6 Classification
Classification is the task of learning a target function f that maps each attribute set
x to one of the predefined class labels y [4].
Input Output
Attribute set (x) Class Label (y)
Figure 1.4 Mapping an input attribute set x into its class label y [4]
Classification
model
9
The input data for the classification model is a collection of records. Each record
is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute,
designated as the class label. A classification model can also serve as an explanatory tool
to distinguish between objects of different classes.
1.6.1 General Approach
Several approaches are taken in creating classification including decision trees,
networks, KNN classifiers and others. Each approach has a learning algorithm which
creates a model based on the input attribute set given. The model generated by the
learning algorithm should both fit the input data well and correctly predict the class labels
of records it has never seen before.
The training set consists of records whose class labels are known. The
classification model is build using the training set and the model is applied to the test data
with unknown class labels. The evaluation of the classification model is done using the
confusion matrix.
Table 1.1 Confusion matrix for a 2-class problem [4]
Predicted Class
Class = 1 Class = 0
Class = 1 f
11
f
10
Actual class
Class = 0 f
01
f
00
10
Each entry f
ij
in this table denotes the number of records from class i predicted to
be of class j. For instance, f
01
is the number of records from class 0 incorrectly predicted
as class1. Based on the entries in the confusion matrix, the total number of correct
predictions made by the model is (f
11
+ f
00
) and the total number of incorrect predictions
is (f
10
+ f
01
). Accuracy is calculated using the (Eq.1.1) and the error rate is calculated
using the (Eq. 1.2).
Accuracy
00011011
0011
ffff
ff
+++
+
=
(1.1)
Error rate
00011011
0110
ffff
ff
+++
+
=
(1.2)
1.6.2 Decision Trees
In data mining, a decision tree is a predictive model; that is, a mapping from
observations about an item to conclusions about its target value [1]. More descriptive
names for such tree models are classification tree or regression tree. In these tree
structures, leaves represent classifications and branches represent conjunctions of features
that lead to those classifications. The machine learning technique for inducing a decision
tree from data is called decision tree learning or decision trees. The tree has three types of
nodes [4].
• A root node that has no incoming edges and zero or more outgoing edges.
• Internal node, each of which has exactly one incoming edge and two or more
outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no
11
outgoing edges.
Figure1.5. A decision tree for the mammal classification problem [4].
In the decision tree, each leaf node is assigned a class label. The non-terminal
nodes, which include the root and other internal nodes, contain attribute test
conditions to separate records that have different characteristics. For example, the
root node shown in Figure 1.5 uses the attribute Body Temperature to separate warm-
blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-
mammals, a leaf node labeled Non-mammals is created as the right child of the root
node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to
distinguish mammals from other warm-blooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been
constructed. Starting from the root node, we apply the test condition to the record and
follow the appropriate branch based on the outcome of the test. This will lead us
Body
Temperature
Gives Birth
Mammals
Non -
Mammals
Non -
Mammals
Leaf
Nodes
Yes No
Warm Cold
Root node
Internal Node
12
either to another internal node, for which a new test condition is applied, or to a leaf
node. The class label associated with the leaf node is then assigned to the record. As
an illustration Figure 1.6 traces the path in the decision tree that is used to predict the
class label of a flamingo. The path terminates at a leaf node labeled Non-mammals.
Figure 1.6 Classifying an unlabeled vertebrate [4].
1.6.3 k Nearest - Neighbor classifiers
k Nearest neighbor method is a simple machine learning algorithm which is used
for classification purposes based on the training samples in the feature space. In this
method, the target object is classified by the majority vote of its neighbors and the object
13
is assigned to the class to which most of the neighbors belong (Figure 1.7). For the
purpose of identification of neighbors, objects are represented by position vectors in a
multidimensional feature space. In this method k training samples that are most similar to
the attributes of the test sample are found, which are considered as nearest neighbors and
are used to determine the class label of the test sample. The distance between sample x
and y can be calculated using the Euclidean distance (Eq. 1.3), Manhattan distance (Eq.
1.4), or other distance measures.
Euclidean distance
∑
=
−=
n
i
ii
yxyxd
1
2
)(),(
(1.3)
Manhattan distance
∑
=
−=
n
i
ii
yxyxd
1
),(
(1.4)
Where x
i
is the expression level of gene i in sample x; y
i
is the expression level of gene i
in sample y; and n in the number of genes whose expression values are measured.
Figure 1.7 Schematic representation of k-NN classifier
14
1.7 Outline of the Current Study
The objective of this study is to create database repository to store different
microarray datasets and create a microarray analysis tool (MAT) which can be used for
analysis of gene expressions. The tool has been designed such that it follows the KDD
steps. The database repository currently allows the genepix datasets although it can be
expanded to include different formats. The analysis methods implemented includes three
different kNN methods, classical kNN, weighted kNN and mean kNN. The system
diagram, application flow diagram and software used are shown below.
Figure 1.8 System diagram
User Interface (C++)
Screen’s for data mining process
Database (SQL Server 2005)
Dynamic scripts for creation of
training and testing datasets
Text files
Training and testing datasets, input
file for ttest, cls file for identifying
the type of samples.
Perl Script
Feature Selection and
Classification
Schema for the text files
15
Figure 1.9 Application flow diagram