Tải bản đầy đủ (.pdf) (243 trang)

Methods in molecular biology vol 1611 protein function prediction methods and protocols

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.13 MB, 243 trang )

Methods in
Molecular Biology 1611

Daisuke Kihara Editor

Protein
Function
Prediction
Methods and Protocols


METHODS

IN

MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:
/>

Protein Function Prediction
Methods and Protocols

Edited by


Daisuke Kihara
Department of Biological Sciences and Computer Science
Purdue University
West Lafayette, Indiana, USA


Editor
Daisuke Kihara
Department of Biological Sciences and Computer Science
Purdue University
West Lafayette, Indiana, USA

ISSN 1064-3745
ISSN 1940-6029 (electronic)
Methods in Molecular Biology
ISBN 978-1-4939-7013-1
ISBN 978-1-4939-7015-5 (eBook)
DOI 10.1007/978-1-4939-7015-5
Library of Congress Control Number: 2017937538
© Springer Science+Business Media LLC 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Humana Press imprint is published by Springer Nature
The registered company is Springer Science+Business Media LLC
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.


Preface
Knowing the function of a protein and understanding how it is carried out are the ultimate
goals of molecular biology and biochemistry. From the early stage of bioinformatics in the
1980s, the development of computational tools to aid in elucidating protein function was a
major focus of the field. Numerous methods have been developed since then. Computationally, protein function can be predicted through similarity searches because similarity
implies homology from an evolutionary standpoint, and also because it indicates that the
proteins have the same physical structures where the function takes place. Thus, based on
this similarity principle, methods were developed to compare global or local sequences and
the structures of proteins. Databases were also developed, which organize function information of proteins and serve as references to be queried against. In this book, wellestablished sequence- and structure-based tools and databases are introduced, which are
very useful for biology labs. In addition, this book introduces software which addresses
function beyond its conventional meaning, reflecting the diversity of the current active
research field.
This book begins by introducing two sequence-based function prediction methods,
PFP and ESG, in Chapter 1. The chapter also describes a web server, NaviGO, which can
analyze Gene Ontology annotations. Then, Chapters 2, 3, and 4 discuss tools suitable for
the functional analysis of metagenomics data. The tools in these three chapters are based on
sequence database searches faster than conventional homology search methods, a necessity
when processing the large amounts of sequence data which typify metagenome sequences.
Chapter 2 introduces GhostX, which uses a suffix array for fast sequence comparison.
Fun4Me in Chapter 3 is a pipeline that combines protein coding gene detection in query
sequences and a fast sequence database search utilizing a hashing technique. SUPERFOCUS in Chapter 4 combines fast search algorithms with preclustered reference sequence
databases. In Chapter 5, we have MPFit, a program that detects when query proteins are
moonlighting proteins, i.e., a protein with dual functions.

The next chapter (Chapter 6) describes SignalP, a well-established web server that
predicts subcellular localization by recognizing a signal peptide in a query sequence. Subcellular localization is one of the three functional categories in the Gene Ontology (Cellular
Component), and it can be a clue for other biological functions of a protein since localization and biological function are closely correlated.
The following four chapters deal with protein structures. ProFunc in Chapter 7 is a
popular web server that performs multiple different analyses on a query protein structure,
including global and local structure matching to known proteins. Chapter 8 describes GLoSA, which finds ligand binding sites similar to a query binding site within a reference
database. eMatchSite, the following chapter (Chapter 9), aligns two ligand binding sites to
quantify similarities between them. In Chapter 10, WATsite2.0 is introduced, which predicts
bound water molecules in a ligand binding site. Water molecules bound to proteins mediate
ligand-protein interactions and are thus important in protein function.
The subsequent five chapters cover resources that address protein function through
pathways, networks, and genomes. Chapter 11 discusses recent updates of KEGG, focusing
on enzymes and pathways. KEGG is one of the most comprehensive databases of pathways,
genomes, and other biomolecules and is a fundamental resource for understanding protein

v


vi

Preface

function at a systems level. Chapter 12 is about the Microbial Genome Database, a valuable
resource to perform comparative genomics. The Saccharomyces Genome Database (SGD) is
described in Chapter 13. S. cerevisiae is one of the most extensively studied organisms. SGD
has long served as a reliable source for protein function and other resources, including gene
expression and phenotypes, in S. cerevisiae. Chapter 14 introduces MouseNet, which predicts gene function in mice from a gene expression network. FANTOM5 in Chapter 15 is a
database of human and mouse genomes. Transcription start sites and promoter activities of
various cells can be browsed and searched. The last chapter (Chapter 16) introduces
Spatiocyte, a software for simulating the diffusion and localization of proteins in a cell.

Results from the simulation, i.e., a phenotype, can be compared against microscope observations. Proteins exhibit their function through dynamic interactions in a cell environment.
Thus, ultimately functions must be considered in a dynamic system, which this software aims
to do.
I hope readers enjoy this book as a practical guide for using bioinformatics tools related
to protein function prediction. Moreover, I also hope that this compilation itself exhibits a
snapshot of the current research field and our understanding of the concept of protein
function, while indicating the future direction of the field.
Editing of this book was greatly aided by Mr. Joshua McGraw, Ms. Sarah Rodenbeck,
Ms. Lenna X. Peterson, and Mr. Charles Christoffer of my research group. I would like to
conclude this preface by recognizing and acknowledging their help as a happy memory of
my research activities.
West Lafayette, IN, USA

Daisuke Kihara


Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v
ix

1 Using PFP and ESG Protein Function Prediction Web Servers . . . . . . . . . . . . . . .
Qing Wei, Joshua McGraw, Ishita Khan, and Daisuke Kihara
2 GHOSTX: A Fast Sequence Homology Search Tool for Functional
Annotation of Metagenomic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shuji Suzuki, Takashi Ishida, Masahito Ohue,
Masanori Kakuta, and Yutaka Akiyama
3 From Gene Annotation to Function Prediction for Metagenomics . . . . . . . . . . . .

Fatemeh Sharifi and Yuzhen Ye
4 An Agile Functional Analysis of Metagenomic Data Using
SUPER-FOCUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Genivaldo Gueiros Z. Silva, Fabyano A.C. Lopes, and Robert A. Edwards
5 MPFit: Computational Tool for Predicting Moonlighting Proteins. . . . . . . . . . . .
Ishita Khan, Joshua McGraw, and Daisuke Kihara
6 Predicting Secretory Proteins with SignalP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Henrik Nielsen
7 The ProFunc Function Prediction Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Roman A. Laskowski
8 G-LoSA for Prediction of Protein-Ligand Binding Sites and Structures . . . . . . . .
Hui Sun Lee and Wonpil Im
9 Local Alignment of Ligand Binding Sites in Proteins
for Polypharmacology and Drug Repositioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Michal Brylinski
10 WATsite2.0 with PyMOL Plugin: Hydration Site Prediction
and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ying Yang, Bingjie Hu, and Markus A. Lill
11 Enzyme Annotation and Metabolic Reconstruction Using KEGG . . . . . . . . . . . .
Minoru Kanehisa
12 Ortholog Identification and Comparative Analysis of Microbial
Genomes Using MBGD and RECOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ikuo Uchiyama
13 Exploring Protein Function Using the Saccharomyces Genome Database . . . . . . .
Edith D. Wong
14 Network-Based Gene Function Prediction in Mouse
and Other Model Vertebrates Using MouseNet Server . . . . . . . . . . . . . . . . . . . . . .
Eiru Kim and Insuk Lee

1


vii

15

27

35
45
59
75
97

109

123
135

147
169

183


viii

15

16


Contents

The FANTOM5 Computation Ecosystem: Genomic Information
Hub for Promoters and Active Enhancers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Imad Abugessaisa, Shuhei Noguchi, Piero Carninci, and Takeya Kasukawa
Multi-Algorithm Particle Simulations with Spatiocyte . . . . . . . . . . . . . . . . . . . . . . . 219
Satya N.V. Arjunan and Koichi Takahashi

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

237


List of Contributors
IMAD ABUGESSAISA  Division of Genomics Technologies, RIKEN Center for Life Science
Technologies, Yokohama, Kanagawa, Japan
YUTAKA AKIYAMA  Department of Computer Science, School of Computing, Tokyo Institute of
Technology, Tokyo, Japan; Education Academy of Computational Life Sciences (ACLS),
Tokyo Institute of Technology, Yokohama, Japan; Department of Computer Science,
Graduate School of Information Science and Engineering, Tokyo Institute of Technology,
Tokyo, Japan
SATYA N.V. ARJUNAN  Laboratory for Biochemical Simulation, RIKEN Quantitative
Biology Center, Suita, Osaka, Japan
MICHAL BRYLINSKI  Department of Biological Sciences, Louisiana State University, Baton
Rouge, LA, USA; Center for Computation & Technology, Louisiana State University,
Baton Rouge, LA, USA
PIERO CARNINCI  Division of Genomics Technologies, RIKEN Center for Life Science
Technologies, Yokohama, Kanagawa, Japan
ROBERT A. EDWARDS  Computational Science Research Center, San Diego State University,
San Diego, CA, USA; Department of Biology, San Diego State University, San Diego, CA,

USA; Department of Computer Science, San Diego State University, San Diego, CA, USA
BINGJIE HU  Department of Medicinal Chemistry and Molecular Pharmacology, College of
Pharmacy, Purdue University, West Lafayette, IN, USA; Computational ADME, Drug
Disposition, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN, USA
WONPIL IM  Department of Biological Sciences and Bioengineering Program, Lehigh
University, Bethlehem, PA, USA
TAKASHI ISHIDA  Department of Computer Science, School of Computing, Tokyo Institute of
Technology, Tokyo, Japan; Education Academy of Computational Life Sciences (ACLS),
Tokyo Institute of Technology, Yokohama, Japan; Department of Computer Science,
Graduate School of Information Science and Engineering, Tokyo Institute of Technology,
Tokyo, Japan
TAKEYA KASUKAWA  Division of Genomics Technologies, RIKEN Center for Life Science
Technologies, Yokohama, Kanagawa, Japan
MASANORI KAKUTA  Department of Computer Science, Graduate School of Information
Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan
MINORU KANEHISA  Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
ISHITA KHAN  Department of Computer Science, Purdue University, West Lafayette, IN,
USA
DAISUKE KIHARA  Department of Biological Sciences and Computer Science, Purdue
University, West Lafayette, IN, USA
EIRU KIM  Department of Biotechnology, College of Life Science and Biotechnology, Yonsei
University, Seoul, Korea
ROMAN A. LASKOWSKI  European Bioinformatics Institute, Hinxton, Cambridge, UK
HUI SUN LEE  Department of Biological Sciences and Bioengineering Program, Lehigh
University, Bethlehem, PA, USA
INSUK LEE  Department of Biotechnology, College of Life Science and Biotechnology, Yonsei
University, Seoul, Korea

ix



x

List of Contributors

MARKUS A. LILL  Department of Medicinal Chemistry and Molecular Pharmacology, College
of Pharmacy, Purdue University, West Lafayette, IN, USA
JOSHUA MCGRAW  Department of Biological Sciences, Purdue University, West Lafayette,
IN, USA
FABYANO A.C. LOPES  Cellular Biology Department, Universidade de Brası´lia (UnB),
Brası´lia, DF, Brazil
HENRIK NIELSEN  Department of Bio and Health Informatics, Technical University
of Denmark, Lyngby, Denmark
SHUHEI NOGUCHI  Division of Genomics Technologies, RIKEN Center for Life Science
Technologies, Yokohama, Kanagawa, Japan
MASAHITO OHUE  Department of Computer Science, Graduate School of Information
Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan; Department of
Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo, Japan
FATEMEH SHARIFI  School of Informatics and Computing, Indiana University, Bloomington,
IN, USA
GENIVALDO GUEIROS Z. SILVA  Computational Science Research Center, San Diego State
University, San Diego, CA, USA
SHUJI SUZUKI  Department of Computer Science, Graduate School of Information Science
and Engineering, Tokyo Institute of Technology, Tokyo, Japan; Education Academy of
Computational Life Sciences (ACLS), Tokyo Institute of Technology, Yokohama, Japan
KOICHI TAKAHASHI  Laboratory for Biochemical Simulation, RIKEN Quantitative Biology
Center, Suita, Osaka, Japan
IKUO UCHIYAMA  Laboratory of Genome Informatics, National Institute for Basic Biology,
National Institutes of Natural Sciences, Okazaki, Aichi, Japan
QING WEI  Department of Computer Science, Purdue University, West Lafayette, IN, USA

EDITH D. WONG  Department of Genetics, Stanford University, Stanford, CA, USA
YING YANG  Department of Medicinal Chemistry and Molecular Pharmacology, College of
Pharmacy, Purdue University, West Lafayette, IN, USA
YUZHEN YE  School of Informatics and Computing, Indiana University, Bloomington, IN,
USA


Chapter 1
Using PFP and ESG Protein Function Prediction Web Servers
Qing Wei, Joshua McGraw, Ishita Khan, and Daisuke Kihara
Abstract
Elucidating biological function of proteins is a fundamental problem in molecular biology and bioinformatics. Conventionally, protein function is annotated based on homology using sequence similarity search
tools such as BLAST and FASTA. These methods perform well when obvious homologs exist for a query
sequence; however, they will not provide any functional information otherwise. As a result, the functions of
many genes in newly sequenced genomes are left unknown, which await functional interpretation. Here, we
introduce two webservers for function prediction methods, which effectively use distantly related sequences
to improve function annotation coverage and accuracy: Protein Function Prediction (PFP) and Extended
Similarity Group (ESG). These two methods have been tested extensively in various benchmark studies and
ranked among the top in community-based assessments for computational function annotation, including
Critical Assessment of Function Annotation (CAFA) in 2010–2011 (CAFA1) and 2013–2014 (CAFA2).
Both servers are equipped with user-friendly visualizations of predicted GO terms, which provide intuitive
illustrations of relationships of predicted GO terms. In addition to PFP and ESG, we also introduce
NaviGO, a server for the interactive analysis of GO annotations of proteins. All the servers are available
at />Keywords Protein function prediction, Genome annotation, BLAST, Gene Ontology, Automated
function prediction, Sequence analysis

1

Introduction
Functional interpretation of novel proteins is a central problem in

molecular biology and bioinformatics. As genome sequencing and
proteomic technologies advance at a striking pace, an overwhelming amount of sequence data awaits to be analyzed and assigned
with functional interpretations. Since performing biological experiment for such purposes does not scale up in terms of time, effort
and expense, automatic function prediction (AFP) methods have
been pursued and have become one of the important problems in
bioinformatics. There are many AFP algorithms developed in the
past years in order to achieve accurate annotation and wider coverage to replace the conventional function prediction methods which
use homology as the source of information [1, 2]. A review by

Daisuke Kihara (ed.), Protein Function Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1611,
DOI 10.1007/978-1-4939-7015-5_1, © Springer Science+Business Media LLC 2017

1


2

Qing Wei et al.

Hawkins & Kihara summarizes several categories of AFP methods
beyond traditional sequence similarity, which leverage sequence,
structural, genomic, cellular and metabolic context-based information [3]. A review by Sael et al. [4] focuses on AFP methods for
non-homologous proteins in the sequence and structure-based
categories.
For the advancement of such computational techniques, it is
very important that there are community-wide efforts for objective
evaluation of prediction accuracy. Among several efforts carried out
in the protein function prediction community in the past, a recent
notable one is CAFA (Critical Assessment of Function Annotation)
[5]. The first round of CAFA was held in 2010–2011 [5], and the

second round, CAFA2, was held in 2013–2014 [6]. CAFA3 is
planned in 2016–2017.
Here, we introduce two publicly available webservers for function prediction methods: Protein Function Prediction (PFP) [7, 8]
and Extended Similarity Group (ESG) [9]. Both webservers take a
list of query sequences and output a list of predicted Gene Ontology (GO) terms [10, 11]. The servers have been maintained over
years and extensively benchmarked in the past [12, 13]. In both
CAFA1 and CAFA2, PFP and ESG were ranked among the top
function prediction methods. In the CAFA1 experiment, ESG was
ranked fourth in the molecular function (MF) GO category among
54 participating groups [5], while PFP did well in all the three
categories in CAFA 2 [6]. In an earlier community-based assessment, the function prediction category of (CASP) held in 2006,
PFP was ranked the top [14].
PFP and ESG were designed to achieve complementary goals:
PFP is for achieving a large prediction coverage by retrieving
annotations widely including from weakly similar sequences. On
the other hand, ESG is for improving specificity by accumulating
contribution of consistently predicted GO terms in an iterative
search. The interactive webserver of PFP and ESG [15] is developed to assist in the sequence-based function prediction and to
enhance the understanding of predicted functions by an effective
visualization of the predictions in a hierarchical GO topology. In
addition, we also describe NaviGO, a newly developed web-based
tool for interactive analysis of GO term annotations of proteins.
All the servers are available at />
2

Function Prediction Algorithms in PFP and ESG
In this section, we briefly explain the main idea of PFP and ESG
algorithms. For more details, please refer to the original papers
[7–9].



Using PFP and ESG Protein Function Prediction Web Servers

2.1 The PFP
Algorithm

3

The PFP algorithm uses PSI-BLAST [1] to obtain sequence hits for
a target sequence and computes the score for GO term fa as follows:
s ðf a Þ ¼

N
X
i¼1

NX
funcði Þ 
j ¼1



ðÀlogðE À valueði ÞÞ þ b ÞP f a jf j

ð1Þ

where N is the number of sequence hits considered in the PSIBLAST hits; Nfunc(i) is the number of GO annotations for the
sequence hit i; E-value(i) is the PSI-BLAST E-value for the
sequence hit i; fj is the jth annotation of the sequence hit i; and
constant b takes value 2 (¼ log10 125) to keep the score positive

when retrieved sequences up to an E-value of 125 are used. The
conditional probabilities P(fa|fj) are used to consider co-occurrence
of GO terms in a single sequence annotation, which are computed
as the ratio of the number of proteins co-annotated with GO terms
fa and fj as compared with ones annotated only with the term fj. To
take into account the hierarchical structure of GO, PFP transfers
the raw score to the parental terms by computing the proportion of
proteins annotated with fa relative to all proteins that belong to the
parental GO term in the database. The score of a GO term computed as the sum of the directly computed score by Eq. 1 and the
ones from the parental propagation is called the raw score.
Compared to the conventional usage of PSI-BLAST that uses a
strict E-value cutoff, e.g., 0.001, for transferring function annotations, the characteristic of PFP is that it collects GO annotations even
from very weakly sequences up to an E-value of 125. Individual
weakly similar sequences do not contribute much to a raw score,
but a GO term can accumulate a substantially large score and be
predicted with confidence if the GO term appears in many sequences.
2.2 The ESG
Algorithm

ESG recursively performs PSI-BLAST searches from sequence hits
obtained in the initial search from the query sequence Q, which will
retrieve N sequence hits (N is “the number of hits per stage”
parameter in the ESG input page as shown in the next section),
S1, S2,. . .SN, each with E-value E1, E2,. . .EN, respectively. Each
sequence hit in a search is assigned a weight Wi that is computed
as the proportion of the Àlog(E-value) of the sequence relative to
the sum of the Àlog(E-value) from all the sequence hits considered
in the search of the same level:
Wi ¼


ÀlogðE i Þ þ b
N È
À Á
É
P
Àlog E j þ b

ð2Þ

j ¼1

where score –log(Ei) is shifted by a constant value b, which makes
the score a nonnegative value. This weight is assigned for GO terms
annotating the sequence hit and the probability of the GO term fa
annotating the query sequence Q is defined as the sum of weights
of fa that come from sequences annotated with fa:


4

Qing Wei et al.

Fig. 1 Computing the ESG score. (a) For a single-layer search, a score of a function fa is computed as a sum of
the weight of sequences that have fa in their GO annotation. (b) When a two-layer search is performed, a score
comes from a weighted combination of the second level search and the first level search. This figure is
adopted from the original paper of ESG (Chitale, Hawkins, Park, & Kihara, Bioinformatics, 25: 1739–1745,
2009) with permission from the publisher

P dQ ðf a Þ ¼


N
X
i¼1

W i Á I S i ðf a Þ

ð3Þ

The function I indicates whether the given sequence Si has
annotation fa:
&
1 if S i has f a annotation
ð4Þ
I S i ðf a Þ ¼
0 otherwise
The index d on the left side of Eq. 3 indicates that function
information comes from direct annotations to sequences. Additionally, multilevel exploration (“the number of stages” parameter
in the ESG input page) of the sequence-similarity space (PSIBLAST) shown in Fig. 1 is performed around the target protein
by sharing the weights between levels using a weight parameter v.
In the second round, each of the sequences S1, S2,. . .SN retrieved in
the first round is in turn used as a query. Suppose sequence Si
obtains Ni sequences by a PSI-BLAST run, each referred to as Sij.
The weights for Sij, Wij can be computed in a similar manner to
Eq. 2. Combining the two levels of searches:


Using PFP and ESG Protein Function Prediction Web Servers

P dQ ðf a Þ ¼


N
X
i¼1

W i Á P dS i ðf a Þ

P dS ðf a Þ ¼ v Á I S i ðf a Þ þ ð1 À vÞ Á

Ni
X
j ¼1

W ij Á I S ij ðf a Þ

5

ð5Þ
ð6Þ

Equation 5 is a variation of Eq. 3, representing that the score of
a GO term fa for the query Q is contributed by sequences retrieved
at the first level (S1 to SN). The weights for GO terms found in the
second level search are computed similarly, where Eq. 2 defines the
weight Wi. Eq. 6 defines the score for fa for sequence Si as a
combination of I S i ðf a Þ, which is sequence Si’s annotation, and
the second level search. The first and the second terms are weighted
by a factor v. Moreover, the equations can be recursively extended
to multiple levels of searches to explore broader space around the
query sequence. The score for each GO term ranges from 0.0 to
1.0.

ESG predicts a GO term with a high score if it appears many
times consistently in the multiple searches including the initial
search and the second level searches. In general, the number of
GO terms predicted by ESG is smaller (5–10 GO terms) than PFP
(often over 50 terms), and terms predicted by high scores by ESG
are usually highly accurate.

3

Input and Output of the Servers

3.1 Query Input Page
of PFP and ESG

In Subheading 3, we explain how to use the webservers with an
example. PFP is available at and ESG
is at Query sequences can be submitted to both PFP and/or ESG from the combined submission
page at Please also refer to
a detailed instruction at />php and for PFP and
ESG, respectively. Both the servers may be used without making an
account; however, users are encouraged to create their account on
the servers. With an account, users may automatically keep and
refer to prediction results that have been processed earlier.
PFP and ESG accept query inputs of FASTA formatted protein
sequences. Users may submit sequences separated by line breaks in
the text box titled “Enter Query Sequence(s)” or upload a FASTA
file containing multiple sequences (Fig. 2). To view a sample of the
format, users may click on “Load Sample” to fill the field with an
example sequence. Selecting “Clear” will remove all inputs
sequences including uploaded files. Currently, up to 100 sequences



Fig. 2 Query input page of ESG. Query sequences can be pasted in the submission window or a sequence file
can be uploaded. The query page of PFP is essentially the same, except that it does not have the number of
hits and the number of stages parameters


Using PFP and ESG Protein Function Prediction Web Servers

7

may be submitted at a time to avoid overloading the computer
server by the job queue.
For ESG, there are two more parameters that must be entered:
“Number of hits” and “Number of stages.” “Number of hits”
indicates the number of PSI-BLAST hits to be considered at each
level of ESG. The default value of this parameter is set to 10 in our
web server. “Number of stages” indicates the level of searches to be
performed by ESG. The default value for this parameter is chosen as
2. We recommend not changing the “Number of stages” parameter
to a larger value as the computational time will suffer exponentially
and we did not observe an improvement during benchmark in the
original paper [9]. As for the “Number of hits” parameter, it can be
increased if a prediction result by the default value is not satisfactory. For example, we used 50 for this value since it performed well
during the benchmark [9]. However, if the parameter value is
increased from 10 to 50, it requires roughly five times more computational time (with the two-stage setting).
3.2 Output Page with
Case Studies

After selecting the submit button at the bottom section of the page,

users will be directed to the job page displaying the status of that
job. The job will be queued and assigned CPU time when available.
You may refresh the page manually to check the status. Average
computational time for PFP and ESG is 40.1 s and 7.5 min [15],
respectively. When the job is completed, clicking on the job ID will
display the predicted GO terms for the query sequences. Below we
explain in detail how the results are presented.

3.2.1 PFP Output Page

The PFP results page shows the input sequences at the top section
followed by the predicted terms for each GO category (Molecular
Function (MF), Biological Process (BP), and Cellular Component
(CC)), which have confidence greater than 5% of score of the top
hit (Fig. 3). The results page also provides a link to the results in the
XML format, which users may download for further processing.
Selecting “Visualization of Predicted GO terms” will allow users to
view the predicted terms in an interactive GO hierarchy. This tool
allows users to pan and zoom through sub-nodes of related
branches and is color mapped based on their assigned probability.
Alternatively, users may select to color the nodes based on the
number of child nodes under predicted terms. There are three
different layouts users may choose (tree, radial, and circle) for
visualizing the GO hierarchy as well as configurable layouts and
interactive nodes in the Cytoscape [16] (Fig. 4).
Three links are provided below the visualization redirect links,
which allow users to download static images of the GO hierarchy
visualization. Selecting to download the image will render the SVG
image and generate a figure. At the top of each static image is also a
link to download the PNG image file. Users may also save the SVG

image by bookmarking the static page for future reference.


Fig. 3 An example of predicted GO terms by PFP is shown in the PFP output page. The query used is hemF,
oxygen-dependent coproporphyrinogen-III oxidase (UniProt ID: Q87FB2). Each category of GO terms is separated by
Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Prediction confidence is annotated
by the color of the PFP Score, whereas red is very high confidence (>20 K) and blue is low confidence (100–500)


Using PFP and ESG Protein Function Prediction Web Servers

9

Fig. 4 Cytoscape output demonstrating a hierarchical Tree Layout of the PFP prediction. Each node represents
a predicted GO term. Red shades in this figure indicate the prediction confidence

At the bottom section of the output page, the predicted results
are categorized by MF, BP, and CC GO terms including the confidence, term ID, and term description. GO terms are colored as red,
orange, green, and black, whereas red indicates high confidence of
prediction (>70%) and black represents a low confidence (<30%).
PFP allows users to trace the origin of the predicted GO terms
through a dropdown list. Since the PFP algorithm often retrieves
GO annotations from distantly related sequences that may not be
obvious homologs, this tool provides useful insights as to how
predictions are computed and the function of the query sequence.
For each predicted GO term, clicking the [þ] sign will open a
dropdown list of sequence IDs which contributed toward the
prediction. The contribution of each sequence is shown as the
percentage of the score that originates from similar sequences
(Fig. 5).



10

Qing Wei et al.

Fig. 5 Example of the PFP GO term dropdown box displaying several links to other UniProt proteins that
conferred the prediction, as well as the percent of their contribution. This list is shown for a GO term,
GO:0004109, predicted for a query protein, Q87FB2

As an example, here we discuss prediction by PFP for oxygendependent coproporphyrinogen-III oxidase (UniProt ID: Q87FB2)
(Fig. 3). This protein is involved in the first step of the
protoporphyrinogen-IX from coproporphyrinogen-III synthesis
pathway during heme biosynthesis. According to the EMBL-EBI
database, this protein contains four MF, four BP, and one CC GO
terms. PFP correctly predicts two of the four MF terms with medium
to high confidence: GO:0004109 (coproporphyrinogen oxidase
activity) and GO:0042803 (protein homodimerization activity). By
expanding the dropdown list of GO:0004109 (coproporphyrinogen
oxidase activity), we can trace the proteins that confer this prediction
(Fig. 5). Proteins include hemF of Escherichia coli O6:K15:H31
(UniProt ID: Q0TF33) (the protein in the bottom of Fig. 5) in
the list serve to catalyze the aerobic oxidative decarboxylation of
propionate groups of rings A and B of coproporphyrinogen-III to
yield the vinyl groups in protoporphyrinogen-IX, and thus have the
annotation of GO:0004109.
All four BP terms are predicted by PFP with very high confidence, which are GO:0006779 (porphyrin-containing compound
biosynthetic process), GO:0006782 (protoporphyrinogen IX
biosynthetic process), GO:0006783 (heme biosynthetic process),
and GO:0055114 (oxidation-reduction process). Expanding the

dropdown of GO:0006779 (porphyrin-containing compound
biosynthetic process) reveals other hemF proteins such as (UniProtID: B7M6U5) of Escherichia coli O8 (strain IAI1) which support
this prediction. PFP also correctly predicts the only CC term,
GO:0005737 (cytoplasm), with very high confidence (Fig. 3).
3.2.2 ESG Output Page

Both ESG and PFP display identically formatted results page. To
understand ESG’s output page, refer to Subheading 3.2.1. PFP
Output page.


Using PFP and ESG Protein Function Prediction Web Servers

3.3 GO Term Analysis
Using NaviGO

11

In the last section, we introduce NaviGO, a recently developed
web-based tool for Gene Ontology visualization and similarity
quantification, which is useful for understanding the relationships
between predicted GO terms. It is accessible at http://kiharalab.
org/web/navigo.
To enable a quantitative analysis of GO terms and gene functions from various aspects, on NaviGO, users can compute similarity of GO terms using six different scoring schemes that incorporate
a variety of information ranging from GO topological structure,
contextual association, and GO annotation frequency. There are
four major functionalities, which are accessible through tabs on the
top bar of the web site, i.e., GO Parents, GO Set, GO Enrichment,
and Protein Set.
In the GO Parents page, users are able to retrieve parental GO

terms in the GO hierarchy (Directed Acyclic Graph, DAG) for a list
of query GO terms. It uses a lite version of GO Visualizer [15] to
help users understand relationships of GO terms topologically in
the GO DAG. Results are rendered in an interactive DAG where
query GO terms are circled with bold black outlines. Additionally,
parental GO terms will be listed in the text area below the
visualization.
In the GO Set page, the tool will compute pairwise GO similarity scores for a list of input GO terms and output them as three
formats (Fig. 6): a table, a network graph, and a bubble chart.

Fig. 6 (a) Workflow for NaviGO. Two types of input data are accepted, a set of GO terms or a set of genes with
GO annotations. Similarity of GO terms is computed with six different GO scores including IAS, CAS, and PAS. If
input data is a list of genes, then pairwise similarity scores for each pair of genes are computed. If GO
enrichment analysis is selected, statistical significance of enrichment of GO terms is computed. (b),
Presentation of results in NaviGO. Results are provided by a network view where similar GO terms or genes
are connected; and in a bubble chart where similarity of GO terms is shown in a 2D plot of multi-dimensional
scaling, or in a tabulated fashion, where significance of score similarity is indicated by a color scale


12

Qing Wei et al.

In the result table, Resnik’s Similarity, Lin’s Similarity, Relevance
Similarity [17], Co-occurrence Association Score, PubMed Association Score [18], and Interaction Association Score [19] of pairs of
input GO terms are colored based on score cutoffs. Table columns
are sortable by clicking on score names at top row of the table.
Common parents between a pair of GO terms are shown in the last
column as well as a link to the interactive visualization, which
illustrates parental GO terms in the GO DAG. In the network

graph format, we showed an interactive network that summarizes
the GO similarity as clusters where nodes are GO terms and edges
indicate similarity score above a user-defined cutoff. The bubble
chart format uses multidimensional scaling [20] to map the similarity into 2D coordinates and the user is able to choose the scoring
schemes for either X or Y coordinates.
In the GO Enrichment tab, NaviGO will take the NCBI taxonomy ID of the organism and a list of annotated genes in the
organism and output the enrichment p-value for each unique GO
term in the input annotation. Enriched GO terms are color mapped
in GO visualizer. User can also adjust the number of enriched GO
terms to visualize.
In the Protein Set tab, users can input a list of annotated
proteins and NaviGO will calculate the functional similarity
between each pair of input proteins using Funsim score [8, 17]
with different similarity schemes similar as in the GO Set tab. The
confidence of similarity predictions is classified into five levels: very
high, high, moderate, low, and the rest. It indicates the score is
within top 1%, 5%, 10%, and 20% relative to the score distribution
of all protein pairs of an arbitrary organism specified by the user.
The upper section in the result page shows an interactive clustering
view based on protein similarity score (Fig. 6). A user-defined
cutoff value controls the connectivity of edges between nodes,
and scoring schemes can be switched using the bar on the top
right-hand corner of the network panel. The computed analysis
results can also be download as a table in the CSV format.

Acknowledgments
This work was supported partly by the National Institutes of Health
(R01GM097528), the National Science Foundation (IIS1319551,
DBI1262189, IOS1127027).
References

1. Altschul SF, Madden TL, Schaffer AA, Zhang
J, Zhang Z, Miller W, Lipman DJ (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res 25(17):3389–3402

2. Pearson WR (1990) Rapid and sensitive
sequence comparison with FASTP and
FASTA. Methods Enzymol 183:63–98


Using PFP and ESG Protein Function Prediction Web Servers
3. Hawkins T, Kihara D (2007) Function prediction of uncharacterized proteins. J Bioinforma
Comput Biol 5(1):1–30
4. Sael L, Chitale M, Kihara D (2012) Structureand sequence-based function prediction for
non-homologous proteins. J Struct Funct
Genom
13(2):111–123.
doi:10.1007/
s10969-012-9126-6
5. Radivojac P, Clark WT, Oron TR, Schnoes
AM, Wittkop T, Sokolov A, Graim K, Funk
C, Verspoor K, Ben-Hur A, Pandey G, Yunes
JM, Talwalkar AS, Repo S, Souza ML, Piovesan
D, Casadio R, Wang Z, Cheng J, Fang H,
Gough J, Koskinen P, Toronen P, NoksoKoivisto J, Holm L, Cozzetto D, Buchan
DWA, Bryson K, Jones DT, Limaye B, Inamdar
H, Datta A, Manjari SK, Joshi R, Chitale M,
Kihara D, Lisewski AM, Erdin S, Venner E,
Lichtarge O, Rentzsch R, Yang H, Romero
AE, Bhat P, Paccanaro A, Hamp T, Kaszner

R, Seemayer S, Vicedo E, Schaefer C, Achten
D, Auer F, Boehm A, Braun T, Hecht M,
Heron M, Honigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer
C, Mahlich Y, Roos M, Bjorne J, Salakoski T,
Wong A, Shatkay H, Gatzmann F, Sommer I,
Wass MN, Sternberg MJE, Skunca N, Supek F,
Bosnjak M, Panov P, Dzeroski S, Smuc T,
Kourmpetis YAI, van Dijk ADJ, Braak CJF,
Zhou Y, Gong Q, Dong X, Tian W, Falda M,
Fontana P, Lavezzo E, Di Camillo B, Toppo S,
Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A,
Linial M, Babbitt PC, Brenner SE, Orengo C,
Rost B, Mooney SD, Friedberg I (2013) A
large-scale evaluation of computational protein
function prediction. Nat Methods 10
(3):221–227.
/>nmeth/journal/v10/n3/abs/nmeth.2340.
html supplementary-information
6. Jiang Y, Ronnen Oron T, Clark WT, Bankapur
AR, D’Andrea D, Lepore R, Funk CS, Kahanda
I, Verspoor KM, Ben-Hur A, Koo E, PenfoldBrown D, Shasha D, Youngs N, Bonneau R,
Lin A, Sahraeian SM, Martelli PL, Profiti G,
Casadio R, Cao R, Zhong Z, Cheng J, Altenhoff A, Skunca N, Dessimoz C, Dogan T,
Hakala K, Kaewphan S, Mehryary F, Salakoski
T, Ginter F, Fang H, Smithers B, Oates M,
Gough J, To¨ro¨nen P, Koskinen P, Holm L,
Chen C-T, Hsu W-L, Bryson K, Cozzetto D,
Minneci F, Jones DT, Chapman S, Dukka
BKC, Khan IK, Kihara D, Ofer D, Rappoport
N, Stern A, Cibrian-Uhalte E, Denny P, Foulger RE, Hieta R, Legge D, Lovering RC,

Magrane M, Melidoni AN, MutowoMeullenet P, Pichler K, Shypitsyna A, Li B,
Zakeri P, ElShal S, Tranchevent L-C, Das S,
Dawson NL, Lee D, Lees JG, Sillitoe I, Bhat
P, Nepusz T, Romero AE, Sasidharan R, Yang

13

˜ o-Corte´s AE,
H, Paccanaro A, Gillis J, Seden
Pavlidis P, Feng S, Cejuela JM, Goldberg T,
Hamp T, Richter L, Salamov A, Gabaldon T,
Marcet-Houben M, Supek F, Gong Q, Ning
W, Zhou Y, Tian W, Falda M, Fontana P,
Lavezzo E, Toppo S, Ferrari C, Giollo M, Piovesan D, Tosatto S, del Pozo A, Ferna´ndez JM,
Maietta P, Valencia A, Tress ML, Benso A, Di
Carlo S, Politano G, Savino A, Rehman HU,
Re M, Mesiti M, Valentini G, Bargsten JW, van
Dijk AD, Gemovic B, Glisic S, Perovic V, Veljkovic V, Veljkovic N, Almeida-e-Silva DC, Vencio RZ, Sharan M, Vogel J, Kansakar L, Zhang
S, Vucetic S, Wang Z, Sternberg MJ, Wass MN,
Huntley RP, Martin MJ, O’Donovan C,
Robinson PN, Moreau Y, Tramontano A, Babbitt PC, Brenner SE, Linial M, Orengo CA,
Rost B, Greene CS, Mooney SD, Friedberg I,
Radivojac P (2016) An expanded evaluation of
protein function prediction methods shows an
improvement in accuracy. Genome Biol 17
(1):184. doi:10.1186/s13059-016-1037-6
7. Hawkins T, Luban S, Kihara D (2006)
Enhanced automated function prediction
using distantly related sequences and contextual association by PFP. Protein Sci 15
(6):1550–1556. doi:10.1110/ps.062153506

8. Hawkins T, Chitale M, Luban S, Kihara D
(2009) PFP: automated prediction of Gene
Ontology functional annotations with confidence scores using protein sequence data. Proteins 74(3):566–582. doi:10.1002/prot.
22172
9. Chitale M, Hawkins T, Park C, Kihara D
(2009) ESG: extended similarity group
method for automated protein function prediction.
Bioinformatics
25(14):1739–1745.
doi:10.1093/bioinformatics/btp309
10. Seok YJ, Sondej M, Badawi P, Lewis MS,
Briggs MC, Jaffe H, Peterkofsky A (1997)
High affinity binding and allosteric regulation
of Escherichia coli glycogen phosphorylase by
the histidine phosphocarrier protein, HPr. J
Biol Chem 272(42):26511–26521
11. D’Ari L, Rabinowitz JC (1991) Purification,
characterization, cloning, and amino acid
sequence of the bifunctional enzyme 5,10methylenetetrahydrofolate
dehydrogenase/
5,10-methenyltetrahydrofolate cyclohydrolase
from Escherichia coli. J Biol Chem 266
(35):23953–23958
12. Khan IK, Wei Q, Chapman S, Kc DB, Kihara D
(2015) The PFP and ESG protein function
prediction methods in 2014: effect of database
updates and ensemble approaches. GigaScience
4:43. doi:10.1186/s13742-015-0083-4
13. Chitale M, Khan IK, Kihara D (2013) In-depth
performance evaluation of PFP and ESG



14

Qing Wei et al.

sequence-based function prediction methods
in CAFA 2011 experiment. BMC Bioinform
14(Suppl 3):S2. doi:10.1186/1471-2105-14S3-S2
14. Lopez G, Rojas A, Tress M, Valencia A (2007)
Assessment of predictions submitted for the
CASP7 function prediction category. Proteins
69(Suppl 8):165–174. doi:10.1002/prot.
21651
15. Khan IK, Wei Q, Chitale M, Kihara D (2015)
PFP/ESG: automated protein function prediction servers enhanced with Gene Ontology
visualization
tool.
Bioinformatics
31
(2):271–272. doi:10.1093/bioinformatics/
btu646
16. Shannon P, Markiel A, Ozier O, Baliga NS,
Wang JT, Ramage D, Amin N, Schwikowski
B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular
interaction networks. Genome Res 13
(11):2498–2504. doi:10.1101/gr.1239303

17. Schlicker A, Domingues FS, Rahnenfuhrer J,
Lengauer T (2006) A new measure for functional similarity of gene products based on

Gene Ontology. BMC Bioinform 7:302.
doi:10.1186/1471-2105-7-302
18. Chitale M, Palakodety S, Kihara D (2011)
Quantification of protein group coherence
and pathway assignment using functional association. BMC Bioinform 12:373–373. doi:10.
1186/1471-2105-12-373
19. Yerneni S, Khan I, Wei Q, Kihara D (2015)
IAS: interaction specific GO term associations
for predicting protein–protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform. doi:10.1109/TCBB.2015.2476809
20. Sa´nchez J, Mardia KV, Kent JT, Bibby JM
(1982) Multivariate analysis. Academic Press,
London-New York-Toronto-Sydney-San Francisco 1979. xv, 518 pp., $ 61.00. Biom J 24
(5):502–502. doi:10.1002/bimj.4710240520


Chapter 2
GHOSTX: A Fast Sequence Homology Search Tool
for Functional Annotation of Metagenomic Data
Shuji Suzuki, Takashi Ishida, Masahito Ohue, Masanori Kakuta,
and Yutaka Akiyama
Abstract
Metagenomic analysis based on whole genome shotgun sequencing data requires fast protein sequence
homology searches for predicting the function of proteins coded on metagenome short reads. However,
huge amounts of sequence data cause even general homology search analyses using BLASTX to become
difficult in terms of computational cost. GHOSTX is a sequence homology search tool specifically developed for functional annotation of metagenome sequences. The tool is more than 160 times faster than
BLASTX and has sufficient search sensitivity for metagenomic analysis. Using this tool, user can perform
functional annotation of metagenomic data within a short time and infer metabolic pathways within an
environment.
Keywords Metagenomic analysis, Sequence homology search, Whole genome shotgun sequencing,
Functional annotation, Substitution-score matrix


1

Introduction
Metagenomics is the study of the genomes of uncultured microbes
obtained directly from microbial communities in their natural habitats. Such analyses have recently become more popular and important as the throughput of DNA sequencers has increased.
Previously, metagenomic analysis was performed based on 16S
rRNA data obtained from Sanger-sequencing methods, and the
aim was to obtain the phylogenetic profiles of microbial communities from a target environment. However, whole-genome shotgun (WGS) sequencing, carried out using next-generation
sequencing (NGS) technologies, produces huge amounts of metagenomic data. This enables us to uncover an abundance of orthologous groups, i.e., the distribution of gene/protein functions, in
environmental samples. Based on such information, we can infer
metabolic pathways within an environment and compare a metagenomic sample to the others based on its functions or functional

Daisuke Kihara (ed.), Protein Function Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1611,
DOI 10.1007/978-1-4939-7015-5_2, © Springer Science+Business Media LLC 2017

15


×