Big Data Analysis for
Bioinformatics and
Biomedical Discoveries
Preface, ix
Acknowledgments, xiii
Editor, xv
Contributors, xvii
Section i
Commonly Used Tools for Big Data Analysis
chapter 1
Linux for Big Data Analysis
Shui Qing Ye and ding-You Li
chapter 2
Python for Big Data Analysis
dmitrY n. grigorYev
chapter 3
R for Big Data Analysis
Stephen d. Simon
Section ii
Next-Generation DNA Sequencing Data Analysis
chapter 4
Genome-Seq Data Analysis
min Xiong, Li Qin Zhang, and Shui Qing Ye
chapter 5
RNA-Seq Data Analysis
Li Qin Zhang, min Xiong, danieL p. heruth, and Shui Qing Ye
chapter 6
Microbiome-Seq Data Analysis
danieL p. heruth, min Xiong, and Xun Jiang
viii ◾ Contents
chapter 7
miRNA-Seq Data Analysis
danieL p. heruth, min Xiong, and guang-Liang Bi
chapter 8
Methylome-Seq Data Analysis
chengpeng Bi
chapter 9
ChIP-Seq Data Analysis
Shui Qing Ye, Li Qin Zhang, and Jiancheng tu
Section iii
Integrative and Comprehensive Big Data Analysis
chapter 10
Integrating Omics Data in Big Data Analysis
Li Qin Zhang, danieL p. heruth, and Shui Qing Ye
chapter 11
Pharmacogenetics and Genomics
andrea gaedigk, k atrin SangkuhL, and LariSa h. cavaLLari
chapter 12
Exploring De-Identified Electronic Health
Record Data with i2b2
mark hoffman
chapter 13
Big Data and Drug Discovery
geraLd J. WYckoff and d. andreW Skaff
chapter 14
Literature-Based Knowledge Discovery
hongfang Liu and maJid r aStegar-moJarad
chapter 15
Mitigating High Dimensionality in Big Data
deendaYaL dinakarpandian
INDEX, 265
e are entering an era of Big Data. Big Data offer both unprecedented opportunities and overwhelming challenges. This book is
intended to provide biologists, biomedical scientists, bioinformaticians,
computer data analysts, and other interested readers with a pragmatic
blueprint to the nuts and bolts of Big Data so they more quickly, easily,
and effectively harness the power of Big Data in their ground-breaking
biological discoveries, translational medical researches, and personalized
genomic medicine.
Big Data refers to increasingly larger, more diverse, and more complex
data sets that challenge the abilities of traditionally or most commonly
used approaches to access, manage, and analyze data effectively. The monumental completion of human genome sequencing ignited the generation of
big biomedical data. With the advent of ever-evolving, cutting-edge, highthroughput omic technologies, we are facing an explosive growth in the
volume of biological and biomedical data. For example, Gene Expression
Omnibus ( holds 3,848 data sets of
transcriptome repositories derived from 1,423,663 samples, as of June 9,
2015. Big biomedical data come from government-sponsored projects
such as the 1000 Genomes Project ( international consortia such as the ENCODE Project ( />encode/), millions of individual investigator-initiated research projects,
and vast pharmaceutical R&D projects. Data management can become a
very complex process, especially when large volumes of data come from
multiple sources and diverse types, such as images, molecules, phenotypes,
and electronic medical records. These data need to be linked, connected,
and correlated, which will enable researchers to grasp the information that
is supposed to be conveyed by these data. It is evident that these Big Data
with high-volume, high-velocity, and high-variety information provide us
both tremendous opportunities and compelling challenges. By leveraging
x ◾ Preface
the diversity of available molecular and clinical Big Data, biomedical scientists can now gain new unifying global biological insights into human
physiology and the molecular pathogenesis of various human diseases or
conditions at an unprecedented scale and speed; they can also identify
new potential candidate molecules that have a high probability of being
successfully developed into drugs that act on biological targets safely and
effectively. On the other hand, major challenges in using biomedical Big
Data are very real, such as how to have a knack for some Big Data analysis
software tools, how to analyze and interpret various next-generation DNA
sequencing data, and how to standardize and integrate various big biomedical data to make global, novel, objective, and data-driven discoveries.
Users of Big Data can be easily “lost in the sheer volume of numbers.”
The objective of this book is in part to contribute to the NIH Big Data to
Knowledge (BD2K) ( initiative and enable biomedical scientists to capitalize on the Big Data being generated in the omic
age; this goal may be accomplished by enhancing the computational and
quantitative skills of biomedical researchers and by increasing the number
of computationally and quantitatively skilled biomedical trainees.
This book covers many important topics of Big Data analyses in bioinformatics for biomedical discoveries. Section I introduces commonly used
tools and software for Big Data analyses, with chapters on Linux for Big
Data analysis, Python for Big Data analysis, and the R project for Big Data
computing. Section II focuses on next-generation DNA sequencing data
analyses, with chapters on whole-genome-seq data analysis, RNA-seq
data analysis, microbiome-seq data analysis, miRNA-seq data analysis,
methylome-seq data analysis, and ChIP-seq data analysis. Section III discusses comprehensive Big Data analyses of several major areas, with chapters on integrating omics data with Big Data analysis, pharmacogenetics
and genomics, exploring de-identified electronic health record data with
i2b2, Big Data and drug discovery, literature-based knowledge discovery,
and mitigating high dimensionality in Big Data analysis. All chapters in
this book are organized in a consistent and easily understandable format.
Each chapter begins with a theoretical introduction to the subject matter
of the chapter, which is followed by its exemplar applications and data
analysis principles, followed in turn by a step-by-step tutorial to help readers to obtain a good theoretical understanding and to master related practical applications. Experts in their respective fields have contributed to this
book, in common and plain English. Complex mathematical deductions
and jargon have been avoided or reduced to a minimum. Even a novice,
Preface ◾ xi
with little knowledge of computers, can learn Big Data analysis from this
book without difficulty. At the end of each chapter, several original and
authoritative references have been provided, so that more experienced
readers may explore the subject in depth. The intended readership of this
book comprises biologists and biomedical scientists; computer specialists
may find it helpful as well.
I hope this book will help readers demystify, humanize, and foster their
biomedical and biological Big Data analyses. I welcome constructive criticism and suggestions for improvement so that they may be incorporated
in a subsequent edition.
Shui Qing Ye
University of Missouri at Kansas City
MATLAB® is a registered trademark of The MathWorks, Inc. For product
information, please contact:
The MathWorks, Inc.
3 Apple Hill Drive
Natick, MA 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
sincerely appreciate Dr. Sunil Nair, a visionary publisher from
CRC Press/Taylor & Francis Group, for granting us the opportunity to
contribute this book. I also thank Jill J. Jurgensen, senior project coordinator; Alex Edwards, editorial assistant; and Todd Perry, project editor, for
their helpful guidance, genial support, and patient nudge along the way of
our writing and publishing process.
I thank all contributing authors for committing their precious time and
efforts to pen their valuable chapters and for their gracious tolerance to
my haggling over revisions and deadlines. I am particularly grateful to my
colleagues, Dr. Daniel P. Heruth and Dr. Min Xiong, who have not only
contributed several chapters but also carefully double checked all nextgeneration DNA sequencing data analysis pipelines and other tutorial
steps presented in the tutorial sections of all chapters.
Finally, I am deeply indebted to my wife, Li Qin Zhang, for standing
beside me throughout my career and editing this book. She has not only
contributed chapters to this book but also shouldered most responsibilities of gourmet cooking, cleaning, washing, and various household chores
while I have been working and writing on weekends, nights, and other
times inconvenient to my family. I have also relished the understanding,
support, and encouragement of my lovely daughter, Yu Min Ye, who is also
a writer, during this endeavor.
Shui Qing Ye, MD, PhD, is the William R. Brown/Missouri endowed chair
in medical genetics and molecular medicine and a tenured full professor
in biomedical and health informatics and pediatrics at the University of
Missouri–Kansas City, Missouri. He is also the director in the Division of
Experimental and Translational Genetics, Department of Pediatrics, and
director in the Core of Omic Research at The Children’s Mercy Hospital.
Dr. Ye completed his medical education from Wuhan University School
of Medicine, Wuhan, China, and earned his PhD from the University of
Chicago Pritzker School of Medicine, Chicago, Illinois. Dr. Ye’s academic
career has evolved from an assistant professorship at Johns Hopkins
University, Baltimore, Maryland, followed by an associate professorship at
the University of Chicago to a tenured full professorship at the University
of Missouri at Columbia and his current positions.
Dr. Ye has been engaged in biomedical research for more than 30 years;
he has experience as a principal investigator in the NIH-funded RO1 or
pharmaceutical company–sponsored research projects as well as a coinvestigator in the NIH-funded RO1, Specialized Centers of Clinically
Oriented Research (SCCOR), Program Project Grant (PPG), and private
foundation fundings. He has served in grant review panels or study sections
of the National Heart, Lung, Blood Institute (NHLBI)/National Institutes of Health (NIH), Department of Defense, and American Heart
Association. He is currently a member in the American Association for
the Advancement of Science, American Heart Association, and American
Thoracic Society. Dr. Ye has published more than 170 peer-reviewed
research articles, abstracts, reviews, book chapters, and he has participated in the peer review activity for a number of scientific journals.
Dr. Ye is keen on applying high-throughput genomic and transcriptomic approaches, or Big Data, in his biomedical research. Using direct
DNA sequencing to identify single-nucleotide polymorphisms in patient
xvi ◾ Editor
DNA samples, his lab was the first to report a susceptible haplotype and
a protective haplotype in the human pre-B-cell colony-enhancing factor
gene promoter to be associated with acute respiratory distress syndrome.
Through a DNA microarray to detect differentially expressed genes,
Dr. Ye’s lab discovered that the pre-B-cell colony-enhancing factor gene
was highly upregulated as a biomarker in acute respiratory distress syndrome. Dr. Ye had previously served as the director, Gene Expression
Profiling Core, at the Center of Translational Respiratory Medicine in
Johns Hopkins University School of Medicine and the director, Molecular
Resource Core, in an NIH-funded Program Project Grant on Lung
Endothelial Pathobiology at the University of Chicago Pritzker School
of Medicine. He is currently directing the Core of Omic Research at The
Children’s Mercy Hospital, University of Missouri–Kansas City, which
has conducted exome-seq, RNA-seq, miRNA-seq, and microbiome-seq
using state-of-the-art next-generation DNA sequencing technologies. The
Core is continuously expanding its scope of service on omic research. Dr.
Ye, as the editor, has published a book entitled Bioinformatics: A Practical
Approach (CRC Press/Taylor & Francis Group, New York). One of Dr. Ye’s
current and growing research interests is the application of translational
bioinformatics to leverage Big Data to make biological discoveries and
gain new unifying global biological insights, which may lead to the development of new diagnostic and therapeutic targets for human diseases.
Chengpeng Bi
Division of Clinical Pharmacology,
Toxicology, and Therapeutic
The Children’s Mercy Hospital
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Guang-Liang Bi
Department of Neonatology
Nanfang Hospital, Southern
Medical University
Guangzhou, China
Larisa H. Cavallari
Department of Pharmacotherapy
and Translational Research
Center for Pharmacogenomics
University of Florida
Gainesville, Florida
Deendayal Dinakarpandian
Department of Computer
Science and Electrical
University of Missouri-Kansas
City School of Computing and
Kansas City, Missouri
Andrea Gaedigk
Division of Clinical Pharmacology,
Toxicology & Therapeutic
Children’s Mercy Kansas City
Department of Pediatrics
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Dmitry N. Grigoryev
Laboratory of Translational
Studies and Personalized
Moscow Institute of Physics and
Dolgoprudny, Moscow, Russia
Daniel P. Heruth
Division of Experimental and
Translational Genetics
Children’s Mercy Hospitals and
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
xviii ◾ Contributors
Mark Hoffman
Department of Biomedical
and Health Informatics and
Department of Pediatrics
Center for Health Insights
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Xun Jiang
Department of Pediatrics, Tangdu
The Fourth Military Medical
Xi’an, Shaanxi, China
Ding-You Li
Division of Gastroenterology
Children’s Mercy Hospitals and
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Hongfang Liu
Biomedical Statistics and
Mayo Clinic
Rochester, Minnesota
Majid Rastegar-Mojarad
Biomedical Statistics and
Mayo Clinic
Rochester, Minnesota
Katrin Sangkuhl
Department of Genetics
Stanford University
Stanford, California
Stephen D. Simon
Department of Biomedical
and Health Informatics
University of MissouriKansas City School of Medicine
Kansas City, Missouri
D. Andrew Skaff
Division of Molecular Biology and
University of Missouri-Kansas
City School of Biological
Kansas City, Missouri
Jiancheng Tu
Department of Clinical
Laboratory Medicine
Zhongnan Hospital
Wuhan University School of
Wuhan, China
Gerald J. Wyckoff
Division of Molecular Biology
and Biochemistry
University of Missouri-Kansas
City School of Biological
Kansas City, Missouri
Contributors ◾ xix
Min Xiong
Division of Experimental and
Translational Genetics
Children’s Mercy Hospitals and
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Li Qin Zhang
Division of Experimental and
Translational Genetics
Children’s Mercy Hospitals and
University of Missouri-Kansas
City School of Medicine
Kansas City, Missouri
Commonly Used Tools
for Big Data Analysis
Linux for Big
Data Analysis
Shui Qing Ye and Ding-you Li
1.1 Introduction
1.2 Running Basic Linux Commands
1.2.1 Remote Login to Linux Using Secure Shell
1.2.2 Basic Linux Commands
1.2.3 File Access Permission
1.2.4 Linux Text Editors
1.2.5 Keyboard Shortcuts
1.2.6 Write Shell Scripts
1.3 Step-By-Step Tutorial on Next-Generation Sequence Data
Analysis by Running Basic Linux Commands
1.3.1 Step 1: Retrieving a Sequencing File Locate the File Downloading the Short-Read Sequencing File
(SRR805877) from NIH GEO Site Using the SRA Toolkit to Convert .sra Files
into .fastq Files
1.3.2 Step 2: Quality Control of Sequences Make a New Directory “Fastqc” Run “Fastqc”
1.3.3 Step 3: Mapping Reads to a Reference Genome Downloading the Human Genome and
Annotation from Illumina iGenomes Decompressing .tar.gz Files
4 ◾ Big Data Analysis for Bioinformatics and Biomedical Discoveries Link Human Annotation and Bowtie Index
to the Current Working Directory Mapping Reads into Reference Genome
1.3.4 Step 4: Visualizing Data in a Genome Browser Go to Human (Homo sapiens) Genome
Browser Gateway Visualize the File
As biological data sets have grown larger and biological problems have
become more complex, the requirements for computing power have also
grown. Computers that can provide this power generally use the Linux/
Unix operating system. Linux was developed by Linus Benedict Torvalds
when he was a student in the University of Helsinki, Finland, in early
1990s. Linux is a modular Unix-like computer operating system assembled
under the model of free and open-source software development and distribution. It is the leading operating system on servers and other big iron systems such as mainframe computers and supercomputers. Compared to
the Windows operating system, Linux has the following advantages:
1. Low cost: You don’t need to spend time and money to obtain licenses
since Linux and much of its software come with the GNU General
Public License. GNU is a recursive acronym for GNU’s Not Unix!.
Additionally, there are large software repositories from which you
can freely download for almost any task you can think of.
2. Stability: Linux doesn’t need to be rebooted periodically to maintain
performance levels. It doesn’t freeze up or slow down over time due
to memory leaks. Continuous uptime of hundreds of days (up to a
year or more) are not uncommon.
3. Performance: Linux provides persistent high performance on workstations and on networks. It can handle unusually large numbers
of users simultaneously and can make old computers sufficiently
responsive to be useful again.
4. Network friendliness: Linux has been continuously developed by a
group of programmers over the Internet and has therefore strong