Tải bản đầy đủ (.pdf) (575 trang)

Insect molecular biology and biochemistry

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.22 MB, 575 trang )


INSECT MOLECULAR
BIOLOGY AND
BIOCHEMISTRY


This page intentionally left blank

â•…â•…â•…â•…â•…


INSECT MOLECULAR
BIOLOGY AND
BIOCHEMISTRY
Edited by
LAWRENCE I. GILBERT
Department of Biology
University of North Carolina
Chapel Hill, NC

Amsterdam • Boston • Heidelberg • London • New York • Oxford
Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo
Academic Press is an imprint of Elsevier


Academic Press is an imprint of Elsevier
32 Jamestown Road, London NW1 7BY, UK
225 Wyman Street, Waltham, MA 02451, USA
525 B Street, Suite 1800, San Diego, CA 92101-4495, USA
First edition 2012
Copyright © 2012 Elsevier B.V. All Rights Reserved


No part of this publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means electronic, mechanical, photocopying,
recording or otherwise without the prior written permission of the publisher
Permissions may be sought directly from Elsevier’s Science & Technology Rights
Department in Oxford, UK: phone (+ 44) (0) 1865 843830; fax (+44) (0) 1865 853333;
email: Alternatively, visit the Science and Technology Books website at
www.elsevierdirect.com/rights for further information
Notice
No responsibility is assumed by the publisher for any injury and/or damage to persons
or property as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions or ideas contained in the material herein.
Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses
and drug �dosages should be made
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
ISBN: 978-0-12-384747-8
For information on all Academic Press publications
visit our website at elsevierdirect.com
Typeset by TNQ Books and Journals Pvt Ltd.
www.tnq.co.in

Printed and bound in China
10 11 12 13 14 15 10 9 8 7 6 5 4 3 2 1


CONTENTS

Preface

Contributors
1Insect Genomics
Subba R. Palli, Hua Bai, and John Wigginton

vii
ix
1

2Insect MicroRNAs: From Molecular Mechanisms to Biological Roles
Xavier Belles, Alexandre S. Cristino, Erica D. Tanaka, Mercedes Rubio, and Maria-Dolors Piulachs

30

3Insect Transposable Elements
Zhijian Tu

57

4Transposable Elements for Insect Transformation
Alfred M. Handler and David A. O’Brochta

90

5Cuticular Proteins
Judith H. Willis, Nikos C. Papandreou, Vassiliki A. Iconomidou, and Stavros J. Hamodrakas

134

6Cuticular Sclerotization and Tanning
Svend O. Andersen


167

7Chitin Metabolism in Insects
Subbaratnam Muthukrishnan, Hans Merzendorfer, Yasuyuki Arakane, and Karl J. Kramer

193

8Insect CYP Genes and P450 Enzymes
René Feyereisen

236

9Lipid Transport
Dick J. Van der Horst and Robert O. Ryan

317

10Insect Proteases
Michael R. Kanost and Rollie J. Clem

346

11Biochemistry and Molecular Biology of Digestion
Walter R. Terra and Clélia Ferreira

365

12Programmed Cell Death in Insects
Susan E. Fahrbach, John R. Nambu, and Lawrence M. Schwartz


419

13Regulation of Insect Development by TGF-β Signaling
Philip A. Jensen

450

14Insect Immunology
Ji Won Park and Bok Luel Lee

480

15Molecular and Neural Control of Insect Circadian Rhythms
Yong Zhang and Patrick Emery

513

Index

553


This page intentionally left blank

â•…â•…â•…â•…â•…


PREFACE


In 2005 the seven-volume series “Comprehensive Molecular Insect Science” appeared and summarized the research
in many fields of insect research, including one volume on Biochemistry and Molecular Biology. That volume covered
many, but not all, fields, and the newest references were from 2004, with many chapters having 2003 references as the
latest in a particular field. The series did very well and chapters were cited quite frequently, although, because of the
price and the inability to purchase single volumes, the set was purchased mainly by libraries. In 2010 I was approached
by Academic Press to think about bringing two major fields up to date with volumes that could be purchased singly, and
would therefore be available to faculty members, scientists in industry and government, postdoctoral researchers, and
interested graduate students. I chose Insect Molecular Biology and Biochemistry for one volume because of the remarkable
advances that have been made in those fields in the past half dozen years.
With the help of outside advisors in these fields, we decided to revise 10 chapters from the series and select five more
chapters to bring the volume in line with recent advances. Of these five new chapters, two, by Subba Palli and by Xavier
Belles and colleagues, are concerned with techniques and very special molecular mechanisms that influence greatly the
ability of the insect to control its development and homeostasis. Another chapter, by Park and Lee, summarizes in a
sophisticated but very readable way the immunology of insects, a field that has exploded in the past six years and which
was noticeably absent from the Comprehensive series. The other two new chapters are by Yong Zhang and Pat Emery,
who deal with circadian rhythms and behavior at the molecular genetic level, and by Philip Jensen, who reviews the
role of TGF-β in insect development, again mainly at the molecular genetic level. In most cases the main protagonist
is Drosophila melanogaster, but where information is available representative insects from other orders are discussed in
depth. The 10 updated chapters have been revised with care, and in several cases completely rewritten. The authors are
leaders in their research fields, and have worked hard to contribute chapters that they are proud of.
I was mildly surprised that, almost without exception, authors who I invited to contribute to this volume accepted the
invitation, and I am as proud of this volume as any of the other 26 volumes I have edited in the past half-century. This
volume is splendid, and will be of great help to senior and beginning researchers in the fields covered.
LAWRENCE I. GILBERT
Department of Biology,
University of North Carolina,
Chapel Hill


This page intentionally left blank


â•…â•…â•…â•…â•…


CONTRIBUTORS

Svend O. Andersen

The Collstrop Foundation, The Royal Danish
Academy of Sciences and Letters, Copenhagen,
Denmark
Yasuyuki Arakane

Division of Plant Biotechnology,
Chonnam National University, Gwangju,
South Korea
Hua Bai

Department of Ecology and Evolutionary Biology,
Brown University, Providence, RI, USA

René Feyereisen

INRA Sophia Antipolis, France
Stavros J. Hamodrakas

Department of Cell Biology and Biophysics,
Faculty of Biology, University of Athens, Athens,
Greece
Alfred M. Handler


USDA, ARS, Center for Medical, Agricultural,
and Veterinary Entomology, Gainesville,
FL, USA
Vassiliki A. Iconomidou

Xavier Belles

Instituto de Biología Evolutiva (CSIC-UPF),
Barcelona, Spain

Department of Cell Biology and Biophysics,
Faculty of Biology, University of Athens, Athens,
Greece

Rollie J. Clem

Philip A. Jensen

Alexandre S. Cristino

Michael R. Kanost

Division of Biology, Kansas State University,
Manhattan, KS, USA

Queensland Brain Institute, The University of
Queensland, Brisbane St Lucia, Queensland,
Australia


Department of Biology, Rocky Mountain College,
Billings, MT, USA

Department of Biochemistry, Kansas State
University, Manhattan, KS, USA
Karl J. Kramer

Patrick Emery

University of Massachusetts Medical School,
Department of Neurobiology, Worcester, MA, USA

Department of Biochemistry,
Kansas State University, and USDA-ARS,
Manhattan, KS, USA

Susan E. Fahrbach

Bok Luel Lee

Department of Biology, Wake Forest University,
Winston-Salem, NC, USA

Pusan National University, Busan, Korea
Hans Merzendorfer

Clélia Ferreira

University of São Paulo, São Paulo, Brazil


University of Osnabrueck, Osnabrueck,
Germany


xâ•… Contributors
Subbaratnam Muthukrishnan

Lawrence M. Schwartz

John R. Nambu

Erica D. Tanaka

Department of Biochemistry,
Kansas State University, Manhattan, KS, USA

Department of Biological Sciences, Charles E.
Schmidt College of Science, Florida Atlantic
University, Boca Raton, FL, USA

Department of Biology, 221 Morrill Science Center,
University of Massachusetts, Amherst, MA, USA

Instituto de Biología Evolutiva (CSIC-UPF),
Barcelona, Spain
Walter R. Terra

David A. O’Brochta

University of Maryland, Department of

Entomology and The Institute for Bioscience and
Biotechnology Research, College Park, MD, USA

University of São Paulo, São Paulo, Brazil
Zhijian Tu

Department of Biochemistry, Virginia Tech,
Blacksburg, VA, USA

Subba R. Palli

Department of Entomology, University of
Kentucky, Lexington, KY, USA
Nikos C. Papandreou

Department of Cell Biology and Biophysics,
Faculty of Biology, University of Athens, Athens,
Greece
Ji Won Park

Pusan National University, Busan, Korea

Dick J. Van der Horst

Utrecht University, Utrecht, The Netherlands
John Wigginton

Department of Entomology,
University of Kentucky, Lexington,
KY, USA

Judith H. Willis

Department of Cellular Biology,
University of Georgia, Athens, GA, USA

Maria-Dolors Piulachs

Instituto de Biología Evolutiva (CSIC-UPF),
Barcelona, Spain
Mercedes Rubio

Instituto de Biología Evolutiva (CSIC-UPF),
Barcelona, Spain
Robert O. Ryan

Children’s Hospital Oakland Research Institute,
Oakland, CA, USA

Yong Zhang

University of Massachusetts Medical School,
Department of Neurobiology, Worcester, MA, USA


1  Insect Genomics
Subba R Palli
Department of Entomology, University of Kentucky,
Lexington, KY, USA
Hua Bai
Department of Ecology and Evolutionary

Biology, Brown University, Providence, RI, USA
John Wigginton
Department of Entomology, University of Kentucky,
Lexington, KY, USA
© 2012 Elsevier B.V. All Rights Reserved

Summary
Genomic sequencing has become a routinely used molecular biology tool in many insect science laboratories. In
fact, whole-genome sequences for 22 insects have already
been completed, and sequencing of genomes of many
more insects is in progress. This information explosion
on gene sequences has led to the development of bioinformatics and several “omics” disciplines, Â�including proteomics, transcriptomics, metabolomics, and structural

genomics. Considerable progress has already been made
by utilizing these technologies to address long-�standing
problems in many areas of molecular entomology.
Attempts at integrating these independent approaches
into a comprehensive systems biology view or model
are just beginning. In this chapter, we provide a brief
overview of insect whole-genome sequencing as well as
information on 22 insect genomes and recent developments in the fields of insect proteomics, transcriptomics,
and structural genomics.

1.1.╇ Introduction
1.2.╇ Genome Sequencing
1.2.1.╇ Genome Assembly
1.2.2.╇ Homology Detection
1.2.3.╇ Gene Ontology Annotation
1.2.4.╇ Conserved Domains and Localization Signal Recognition
1.2.5.╇ Fisher’s Exact Test

1.2.6.╇ Sequenced Genomes
1.3.╇ Genome Analysis
1.3.1.╇ Forward and Reverse Genetics
1.3.2.╇ DNA Microarray
1.3.3.╇ Next Generation Sequencing (NGS)
1.3.4.╇ Other Methods
1.4.╇ Proteomics
1.4.1.╇ Sample Protein Labeling and Separation
1.4.2.╇ Enrichment for PTM
1.4.3.╇ Applications of Proteomics
1.5.╇ Structural Genomics
1.5.1.╇ Analysis of Protein–Ligand Interactions
1.5.2.╇ Cytochrome C: A Case Study
1.5.3.╇ Selecting a Template Structure
1.5.4.╇ Target–Template Sequence Alignment
1.5.5.╇ Modeling Suite Choice
1.5.6.╇ Critical Assessment of Protein Structure
1.5.7.╇ Structural Determination
1.6.╇ Metabolomics
1.7.╇ Systems Biology
1.8.╇ Conclusions and Future Prospects

DOI:10.1016/B978-0-12-384747-8.10001-7

2
2
3
3
4
5

5
6
9
9
10
14
16
16
18
18
18
19
20
20
21
21
21
21
22
22
23
23


2  1: Insect Genomics

1.1.╇ Introduction
Research on insects, especially in the areas of physiology, biochemistry, and molecular biology, has undergone
notable transformations during the past two decades.
Completion of the sequencing of the first insect genome,

the fruit fly Drosophila melanogaster, in 2000 was followed by a flurry of activities aimed at sequencing the
genomes of several additional insect species. Indeed,
genome sequencing has become a routinely used method
in molecular biology laboratories. Initial expectations of
genome sequencing were that much could be learned by
simply looking at the genetic code. In practice, insects
are too complex for a complete understanding based on
nucleotide sequences alone, and this has led to the realization that insect genome sequences must be complemented with information on mRNA expression as well as
the proteins they encode. This has led to the development
of a variety of “omics” technologies, including functional
genomics, transcriptomics, proteomics, metabolomics,
and others. The vast amount of data generated by these
technologies has led to a sudden increase in the field of
bioinformatics, a field that focuses on the interpretation
of biological data. Developments in the World Wide
Web have allowed the distribution of this “omics” data,
along with analysis, tools to people all over the world.
Integrating these data into a holistic view of all the simultaneous processes occurring within an organism allows
complex hypotheses to be developed. Instead of breaking
down interactions into smaller, more easily understandable units, scientists are moving towards creating models
which encompass the totality of an organism’s molecular, physical, and chemical phenomena. This movement,
known as systems biology, focuses on the integration and
analysis of all the available data about an entire biological
system, and it aims to paint an authentic and comprehensive portrait of biology.
During the past two decades, research on insects has
produced large volumes of information on the genome
sequences of several model insects. Genome sequencing allows quantificatation of mRNAs and proteins, as
well as predictions on protein structure and function.
Attempts to integrate this data into systems biology
models are currently just beginning. While it is difficult to cover all the developments in these disciplines,

we will try to summarize the latest developments in
these existing fields. In the first section of this chapter,
insect genome sequencing and the lessons learned from
this will be presented. In the next section, analysis of
sequenced genomes using “omics” and high-throughput
sequencing technologies will be summarized. In the
third part of this chapter, an overview of proteomics and
structural genomics will be covered. A brief overview of
insect systems biology approaches will be presented at
the end of this chapter.

Genomic DNA

Fragment
genomic DNA

Clone into vector

Sequence clones

ctgagcgggtcggcgcgttcgtccgtcatatacggcaag
atcctctcaatcctctctgagctacgcacgctcggcatgc
aaaactccaacatgtgcatctccctcaagctcaagaaca
gaaagctgccgcctttcctcgaggagatctgggatgtg

Assemble sequence
into contigs
Assemble contigs
into scafolds
Map scafolds

to chromosomes
Genome map

Figure 1╇ The whole-genome shotgun sequencing (WGS)
method begins with isolation of genomic DNA from nuclei
isolated from isogenic lines of insects. The DNA is then
sheared and size-selected. The size-selected DNA is then
ligated to restriction enzyme adaptors and cloned into
plasmid vectors. The plasmid DNA is purified and sequenced.
The sequences are assembled using bioinformatics tools.

1.2.╇ Genome Sequencing
Almost all insect genomes sequenced to date employed
the whole-genome shotgun sequencing (WGS) method
(Figure 1). Shotgun genome sequencing begins with isolation of high molecular weight genomic DNA from nuclei
isolated from isogenic lines of insects. The genomic DNA is
then randomly sheared, end-polished with Bal31 nuclease/
T4 DNA polymerase primers and, finally, the DNA is sizeselected. The size-selected, sheared DNA is then ligated to
restriction enzyme adaptors such as the BstX1adaptors. The
genomic fragments are then inserted into restriction enzymelinearized plasmid vectors. The plasmid DNA is purified
(generally by the alkaline lysis plasmid purification method),
isolated, sequenced, and assembled using bioinformatics
tools. Automated Sanger sequencing technology has been the
main sequencing method used during the past two decades.
Most genomes sequenced to date employed this technology.
Sanger sequencing must be distinguished from next generation sequencing technology, which has entered the marketplace during the past four years and is rapidly changing the
approaches used to sequence genomes. Genomes sequenced
by NGS technologies will be completed more quickly and at
a lower price than those from the first few insect genomes.



1: Insect Genomics  3
1.2.1.╇ Genome Assembly

Genomes and transcriptomes are assembled from shorter
reads that vary in size, depending on the sequencing technology used. Contigs are created from these short reads by
comparing all reads against each other. If sequence identity and overlap length pass a certain threshold value, they
are lumped together into a contig by a program called an
assembler. Many assembly programs are available, which
differ mainly in the details of their implementation and
of the algorithms employed. The most commonly used
assembler programs are: The Institute for Genomic
Research (TIGR) Assembler; the Phrap assembly program developed at the University of Washington; the
Celera Assembler; Arachne, the Broad Institute of MIT
assembler; Phusion, an assembly program developed by
the Sanger Center; and Atlas, an assembly program developed at the Baylor College of Medicine.
The contigs produced by an assembly program are
then ordered and oriented along a chromosome using a
variety of additional information. The sizes of the fragments generated by the shotgun process are carefully
controlled to establish a link between the sequence-reads
generated from the ends of the same fragment. In WGS
projects, multiple libraries with varying insert sizes are
normally sequenced. Additional markers such as ESTs are
also used during the assembly of genome sequences. The
ultimate goal of any sequencing project is to determine
the sequence of every chromosome in a genome at single
base-pair resolution. Most often gaps occur within the
genome after assembly is completed. These gaps are filled
in through directed sequencing experiments using DNA
from a variety of sources, including clones isolated from

libraries, direct PCR amplification, and other methods.
1.2.2.╇ Homology Detection

After assembly, sequences representing the genome or
transcriptome are analyzed for functional interpretation
by comparing them with known homologous sequences.
Proteins typically carry out the cellular functions encoded
in the genome. Protein coding sequences, in the form of
open reading frames (ORFs), must first be distinguished
from other sequences or those that encode other types
of RNA. Transcriptome analysis is simplified by the fact
that the sequenced mRNAs have already been processed
for intron removal in the cell. Distinguishing the correct
ORF where translation occurs, from 5′ and 3′ untranslated
regions, is easily accomplished by a blast search against a
protein database, or possibly by selecting the longest ORF.
Finding genes in eukaryotic genomes is more complex,
and presents a unique set of challenges.
1.2.2.1.╇ Genomic ORF detection╅ Detection of ORFs
is more complex in eukaryotes than prokaryotes due to
the presence of alternate splicing, poorly understood

promoter sequences, and the under-representation of
protein coding segments compared to the whole genome.
If transcriptome data are available, a number of programs
exist to map these sequences back to an organism’s genome
(Langmead et€ al., 2009; Clement et€ al., 2010). This
strategy is especially useful when analyzing non-model
organisms, or those projects that lack the manpower
of worldwide genome sequencing consortiums. In this

manner a large number of transcripts can potentially
be identified, along with their regulatory and promoter
sequences, and information on gene synteny.
De novo gene prediction algorithms often use Hidden Markov Models or other statistical methods to recognize ORFs, which are significantly longer than might
be expected by chance. These algorithms also search for
sequences containing start and stop codons, polyA tails,
promoter sequences, and other characteristics indicative
of protein coding segments (Burge and Karlin, 1997).
De novo gene discovery is partially dependent on the
organism used, since compositional differences such as GC
content and codon frequency introduce bias, which must
be considered for each organism. Artificial intelligence
algorithms can be trained to recognize these differences
when a sufficient number of protein coding sequences
are available. These may originate from transcriptome
sequencing, or more traditional approaches such as PCR
amplification and Sanger sequencing of mRNAs. Based
on a small sample proportion of known genes, artificial
intelligence programs can learn the codon bias and splice
sites, for example, and extrapolate these findings to the rest
of the genome. However, this process is often inaccurate
(Korf, 2004).
Comparative genomics is the process of comparing
newly sequenced genomes to more well-curated reference
genomes. Two highly related species will likely have well
conserved protein coding sequences with similar order
along a chromosome. The contigs or scaffolds from a
newly assembled genome can be mapped to the reference,
or the shorter reads can be mapped and assembled in a
hybrid approach. Programs that perform this task may

often be used to map transcriptome data to a genome,
since the two approaches are mechanistically similar.
1.2.2.2.╇ Transcriptome gene annotation╅ By defini�
tion, mRNA represents protein coding sequences, and
finding the correct ORF requires only a blast search.
However, ribosomal RNA (rRNA) may represent more
than 99% of cellular RNA content. The presence of
rRNA may be detrimental to the assembly process
because stretches of mRNA may overlap, and thus cause
erroneously assembled RNA amalgams. Strategies to
reduce the amount of sequenced rRNA include mRNA
purification and rRNA removal. Oligo (dt) based strat�
egies, such as the Promega PolyATract mRNA isolation
kit, use oligo (dt) sequences which bind to the poly A tail


4  1: Insect Genomics

of mRNA. The poly T tract is linked to a purification
tag, such as biotin, which binds to streptavidin-coated
magnetic beads. The beads can be captured, allowing
the non-poly adenylated RNA to be washed away. The
Invitrogen Ribominus kit uses a similar principle, except
oligo sequences complementary to conserved portions of
rRNA allow it to be subtracted from total RNA.
During RNA amplification, oligo (dt) primers may
be used to increase the proportion of mRNA to total
RNA. This process may introduce bias near the 3′ side of
mRNA, and thus protocols have been developed to normalize the representation of 5′, 3′, and middle segments
of mRNA (Meyer et€al., 2009). If the rRNA sequence has

already been determined, many assembly programs can
be supplied a filter file of rRNA and other detrimental
contaminant sequences, such as common vectors, which
will be excluded from the assembly process.
1.2.2.3.╇ Homology detection╅ Annotation is the step
of linking sequences with their functional relevance. Since
protein homology is the best predictor of function, the
NCBI blastx algorithm (Altschul et€al., 1990) is a good
place to start in predicting homology and thus function.
The blastx algorithm translates sequences in all six possible
reading frames and compares them against a database of
protein sequences.
For less technically inclined users, the blastx algorithm
may be most easily implemented in Windows-based programs such as Blast2GO (Conesa et€al., 2005; Conesa and
Gotz, 2008; Blast2GO offers
a comprehensive suite of tools for blasting and advanced
functional annotation. However, relying on the NCBI
server to perform blast steps often introduces a substantial bottleneck between the server and querying computer. Local blast searches, performed by the end user’s
computer(s), may significantly reduce annotation time.
The blast program suite and associated databases may be
downloaded for local blast searches (.
nih.gov/blast/executables/blast+/LATEST/). The NCBI
non-redundant protein database is quite large and time
consuming to search. Meyer et€ al. (2009) advocate a
local approach where sequences are first queried against
the smaller, better curated swiss-prot database, and then
sequences with no match are blasted against the NR
protein database (Meyer et€ al., 2009). Faster algorithms
such as AB-Blast (previously known as WU-Blast) may
also speed up the blasting process. After a blastx search,

sequences may be compared to other nucleotide sequences
(blastn), or translated and compared to a translated
sequence to help identify unigenes, or unique sequences.
However, blastx is the first choice, since the amino acid
sequence is more conserved than the nucleotide sequence.
This step will also yield the correct open reading frame
of a sequence. In some cases, homologous relationships
may be discovered using blastn and tblastn where blastx

did not. The statistically significant expectation value, or
the probability that two sequences are related by chance
(also called an e value) is an important consideration in
blasting, because setting an e value too low may create
false relationships, while setting an e value too high may
exclude real ones. As sequence length increases, the probability of finding significant blast hits also increases. In
practice, blasting at a low e value and small sequence overlap length initially, and then filtering the results based on
the distribution of hits obtained, may be beneficial.
1.2.3.╇ Gene Ontology Annotation

Gene Ontology (GO) provides a structured and controlled
vocabulary to describe cellular phenomena in terms of
biological processes, molecular function, and subcellular
localization. These terms do not directly describe the gene
or protein; on the contrary they describe phenomena,
and if there is sufficient evidence that the product of a
gene, a protein, is involved in this phenomenon, then the
probability increases that a paralogous protein is involved
(Ashburner et€al., 2000).
For example, GO analysis for the Drosophila melanogaster protein Tango molecular functions indicates that it
is a transcription factor which heterodimerizes with other

proteins and binds to specific DNA elements and recruits
RNA polymerase. The evidence shows what types of
experiments or analyses were performed to determine the
function. The GO evidence codes can be inferred experimentally from experiments, assays, mutant phenotypes,
genetic interactions or expression patterns, as well as computationally from sequence, sequence model, and sequence
or structural similarity. The biological processes information shows that Tango is involved in brain, organ, muscle,
and neuron development. The cellular components information indicates that Tango’s subcellular localization is
primarily nuclear. Gene Ontology annotation programs
often allow the user to set evidence code weights manually.
For example, evidence inferred from direct experiments
may provide more confidence than evidence inferred from
computational analysis which has been manually curated.
Uncurated computational evidence may contain the least
confidence level. Tango and its human paralog, the Aryl
Hydrocarbon Receptor Nuclear Translocator (ARNT),
are both well-studied proteins. However, when using the
Tribolium castaneum sequence, for example, a good GO
mapping algorithm must decide how to report the more
relevant information on TANGO without losing pertinent information about the better studied ARNT.
Gene ontology mapping is great when a well-studied
parologous protein is available and the blast e value is
low enough to provide statistical confidence in the evolutionary relatedness and conservation of function between
two proteins. In our example, the user now has a wealth
of information about the T. castaneum Tango function,


1: Insect Genomics  5

and can design primers for qRTPCR, RNAi, protein
expression, or link function to the mRNAs which may

have changed between two treatment groups in a transcriptome expression survey such as microarray analysis.
Enzyme codes are a numerical classification for reactions
that are catalyzed by enzymes, given by the Nomenclature
Committee of the International Union of Biochemistry
and Molecular Biology (NC-IUBMB) in consultation
with the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature (JCBN). Enzyme codes can be
inferred from GO relationships.
The Kyoto Encyclopedia of Gene and Genomes
(KEGG) is a database of enzymatic, biochemical, and signaling pathways that also maps a variety of other data.
KEGG is an integrated database resource consisting of
systems, genomic, and chemical information (Kanehisa
and Goto, 2000; Kanehisa et€al., 2006). The KEGG pathway database consists of hand-drawn maps for cell signaling and communication, ligand receptor interactions, and
metabolic pathways gathered from the literature. Figure 2
shows the pathway for D. melanogaster hormone biosynthesis annotated in KEGG. The information in this database could help in interpretation of data from genome
analysis employing “omics” methods.

Domain detection algorithms do not require an absolute paralog to predict function, but often use multiple
sequence alignments and Hidden Markov Models based
on a number of homologous proteins that share common domains. Examples include SMART (Schultz et€al.,
1998), PFAM (Finn et€ al., 2010), and the NCBI Conserved Domain Database (CDD) (Marchler-Bauer et€al.,
2002). Some databases, such as SCOP (Lo Conte et€al.,
2002), CATH (Martin et€ al., 1998), and DALI (Holm
and Rosenstrom, 2010), focus on structural relationships
and evolution. These databases group and classify protein
folds based on their structural and evolutionary relatedness. Domain recognition programs have strengths and
weaknesses depending on their focus, algorithm implementation, and the database used. Interproscan (�Zdobnov
and Apweiler, 2001) is a direct or indirect gateway to
the majority of these programs and the information they
can reveal. Interproscan may be accessed on the web, or
through the Blast2GO program suite. Other programs

accessed via Interproscan allow the identification of localization signals (i.e., nuclear localization signals), transmembrane spanning domains, sites for post-translational
modifications, sequence repeats, intrinsically disordered
regions, and many more.

1.2.4.╇ Conserved Domains and Localization
Signal Recognition

1.2.5.╇ Fisher’s Exact Test

Perturbations in the expression levels between two treatment groups of gene products involved in GO phenomena
or KEGG signaling, or which belong to domain/protein

Conserved domains often act as modular functional
units and can be useful in predicting a protein’s function.

INSECT HORMONE BIOSYNTHESIS
CYP4C7
(Juvenile hormone)

Farnesol

Farnesoic acid

Methyl farnesoate

Farnesyl diphosphate

JH III diol

Juvenile hormone III


CYP15A1

FAMeT

Steroid biosynthesis

12-trans-Hydroxy JH III

JHEH

JHDK

Farnesol

JH III diol phosphate
3.1.1.59
3.1.1.59
JH III acid

JHAMT
JHEH
JH III acid diol
2-Deoxy3-dehydroecdysone

2,22-Dideoxy3-dehydroecdysone

3-Dehydroecdysone

Phm


Dib

Sad

CYP306A1

CYP302A1

CYP315A1

3-Epiecdysone

1.1.1.1.1.3.16
1.1.1.-

(Molting hormone)
Cholesterol

7-Dehydrocholesterol

CYP306A1
Phm
Diketol

3 ,5 -Ketodiol

CYP302A1
Dib


3 ,5 -Ketotriol

CYP314A1
1.149922

CYP315A1
Sad
2-Deoxyecdysone

20-Hydroxyecdysone

Ecdysone
1.-.-.-

26-Hydroxyecdysone

1.-.-.-

20,26-Dihydroxyecdysone

00981 6/09/10
(c) Kanehisa Laboratories

Figure 2╇ The pathway for D. melanogaster hormone biosynthesis annotated in the Kyoto Encyclopedia of Gene and Genomes
(KEGG). Reproduced from KEGG database (www.genome.jp/dbget-bin/www_bget?pathway+map00981).


6  1: Insect Genomics

families, can indicate the physiologic effects of the treatment and the mechanisms that are ultimately responsible

for changes in phenotypes. mRNA expression changes must
be tested for statistical significance to ensure that changes
between treatments are not the result of sampling a variable
population. Fisher’s Exact Test calculates a p-value which
corresponds to the probability that functional groups are
over-represented by chance. A low p-value might indicate
that the over-represented functional groups share some
regulatory mechanism which was perturbed by treatment.
1.2.6.╇ Sequenced Genomes

Table 1 lists some sequenced genomes.
Fruit fly, Drosophila melanogaster. The D. melanogaster sequencing project used several types of sequencing strategies, including sequencing of individual clones,
and sequencing of genomic libraries with three insert sizes
(Adams et€ al., 2000). A portion of the D. melanogaster
genome corresponding to approximately 120 megabases
of euchromatin was assembled. This assembled genomic
sequence contained 13,600 predicted genes. Some of the
proteins coded by these predicted genes showed high similarity with vertebrate homologs involved in processes such
as replication, chromosome segregation, and iron metabolism. About 700 transcription factors have been identified based on their sequence similarity with those reported

from other organisms. Half of these transcription factors are zinc-finger proteins, and 100 of them contained
homoeodomains. Genome sequencing identified 22 additional homeodomain-containing proteins and 4 additional
nuclear receptors. Nuclear receptors are sequence-specific
ligand-dependent transcription factors that function as
both transcriptional activators and repressors, and which
regulate many physiological and metabolic processes.
The D. melanogaster genome encodes 20 nuclear receptor proteins. General translation factors identified in other
sequenced genomes are also present in the D. melanogaster
genome. Interestingly, the D. melanogaster genome contained six genes encoding proteins highly similar to the
messenger RNA (mRNA) cap-binding protein, eIF4E,

suggesting that there may be an added level of complexity to regulation of cap-dependent translation in the fruit
fly. The cytochrome P450 monooxygenases (P450s) are a
large superfamily of proteins that are involved in synthesis or degradation of hormones and pheromones, as well
as the metabolism of natural and synthetic toxins and
insecticides (Feyereisen, 2006; see also Chapter 8 in this
volume). Eighty-six genes coding for P450 enzymes and
four P450 pseudo genes were identified in the D. melanogaster genome. About 20% of the proteins encoded by the
D. melanogaster genome are likely targeted to the cellular
membranes, since they contain four or more hydrophobic helices. The largest families of membrane proteins are

Table 1  List of Sequenced Genomes
Number of
genes predicted

Reference

Tribolium castaneum
160
Drosophila ananassae
176
Drosophila erecta
134
Drosophila grimshawi
138
Drosophila melanogaster
120
Drosophila mojavensis
161
Drosophila persimilis
138

Drosophila pseudoobscura 127
Drosophila sechellia
115
Drosophila simulans
111
Drosophila virilis
172
Drosophila willistoni
187
Drosophila yakuba
127
Apis mellifera
236
Pediculus humanus
108
Anopheles gambiae
278
Aedes aegypti
1380

16404
15276
15324
15270
13600
14849
17325
16363
16884
15983

14680
15816
16423
10157
10773
14000
15419

Richards et€al., 2008
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Adams et€al., 2000
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Richards et€al., 2005
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
Drosophila 12 Genome Consortium, 2007
The Honey Bee Genome Consortium, 2006
Kirkness et€al., 2010
Holt et€al., 2002
Nene et€al., 2007

Culex quinquefasciatus

579


18883

Arensburger et€al., 2010

Acyrothosyphon pisum
Nasonia vitripennis
Nasonia giraulti
Nasonia longicornis
Bombyx mori

464
240

10249
17279

The Pea Aphid Genome Consortium, 2010
Werren et€al., 2010

432

14623

The International Silkworm Genome
Consortium, 2008

Common name

Scientific name


Beetle, Red flour
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Fruit fly
Honey bee
Louse, body
Malaria mosquito
Yellow fever
mosquito
Southern house
mosquito
Pea aphid
Wasp, parasitoid

Silkworm

Genome
size (Mb)


1: Insect Genomics  7


sugar permeases, mitochondrial carrier proteins, and the
ATP-binding cassette (ABC) transporters coded by 97, 38,
and 48 genes respectively. Among the proteins involved in
biosynthetic networks, 31 triacylglycerol lipases that are
involved in lipolysis and energy storage and redistribution
and 32 uridine diphosphate (UDP) glycosyl transferases
(which participate in the production of sterol glycosides
and in the biodegradation of hydrophobic compounds)
are encoded by the D. melanogaster genome. One additional ferritin gene and two additional transferrin genes
have been identified by genome sequencing.
In 2005, Richards and colleagues published the genome
of a second Drosophila species, Drosophila pseudoobscura
(Richards et€al., 2005). In 2007 the Drosophila Genome
Consortium completed the sequencing of 10 additional
Drosophila genomes: D. sechellia; D. simulans; D. yakuba;
D. erecta; D. ananassae; D. persimilis; D. willistoni;
D. mojavensis; D. virilis; and D. grimshawi (Drosophila
12 Genome Consortium, 2007). Comparative analysis
of sequences from these 10 genomes and the 2 genomes
published earlier (D. melanogaster and D. pseudoobscura)
identified many changes in protein-coding genes, noncoding RNA genes, and cis-regulatory regions. Many
characteristics of the genomes, such as the overall size,
the total number of genes, the distribution of transposable element classes, and the patterns of codon usage, are
well conserved among these 12 genomes. Interestingly,
a number of genes coding for proteins involved environmental interactions, and reproduction showed rapid
change. In these 12 genomes, microRNA genes are more
conserved than the protein-coding genes (see Chapter 2
in this volume). Genome-wide alignments of the 12 Drosophila species resulted in the prediction and refinement
of thousands of protein-coding exons, genes coding for

RNAs such as miRNAs, transcriptional regulatory motifs,
and functional regulatory regions (Stark et€al., 2007). For
more information on comparative analysis of 12 Drosophila species genomes, the reader is directed to Ashburner’s
excellent preface article (Ashburner, 2007).
Malaria mosquito, Anopheles gambiae. 278â•›Mb of
genome sequence from An. gambiae was obtained by the
WGS method (Holt et€ al., 2002). About 10-fold coverage of the genome sequence was achieved. The size of
the assembled An. gambiae genome is larger than that of
D. melanogaster (120â•›Mb). About 14,000 predicted genes
were identified in the assembled genome sequence. When
compared to the D. melanogaster genome, the An. gambiae
genome contained 100 additional serine proteases, central
effectors of innate immunity, and other proteolytic processes (see Chapters 10 and 14 in this volume). The presence of additional serine proteases in An. gambiae may be
due to differences in feeding behavior, as well as its intimate
interactions with both vertebrate hosts and parasites. Also,
36 additional proteins containing fibrinogen domains
(carbohydrate-binding lectins that participate in the first

line of defense against pathogens by activating the complement pathway in association with serine proteases) and
24 additional cadherin domain-containing proteins were
found in An. gambiae. Most of the genes coding for transcription factors, the C2H2 zinc-finger, POZ, Myb-like,
basic helix–loop–helix, and homeodomain-containing
proteins reported from sequenced genomes are also present in the An. gambiae genome. An over-representation
of the MYND domain was observed in the An. gambiae
genome. This domain is predominantly found in chromatin proteins, which are believed to mediate transcriptional
repression.
Genes coding for proteins involved in the visual system,
structural components of the cell adhesion and contractile
machinery, and energy-generating glycolytic enzymes that
are required for active food seeking are present in higher

numbers in the An. gambiae genome when compared
with the D. melanogaster genome. Genes coding for salivary gland components, as well as anabolic and catabolic
enzymes involved in protein and lipid metabolism, are
over-represented in the An. gambiae genome. Genes coding for proteins involved in insecticide resistance, such as
transporters and detoxification enzymes, were also found
in higher numbers in the An. gambiae genome when compared to their numbers in the D. melanogaster genome.
Red flour beetle, Tribolium castaneum. The 160-Mb
T. castaneum genome sequence was obtained by WGS, and
contained 16,404 predicted genes (Richards et€al., 2008).
The T. castaneum genome showed expansions in odorant and gustatory receptors, as well as P450s and other
detoxification enzyme families (see also Chapter 7 in this
volume). In addition, the T. castaneum genome contained
more ancestral genes involved in cell–cell communication
when compared to other insect genomes sequenced to date.
RNA interference is systemic in T. castaneum, and thus
works very well. The SID-1 multi-transmembrane protein involved in double-stranded RNA (dsRNA) uptake
in C. elegans was not found in D. melanogaster. However,
three genes that encode proteins similar to SID-1 were
found in the T. castaneum genome. Expansions of odorant
receptors, CYP proteins, proteinases, diuretic hormones,
a vasopressin hormone and receptor, and chemoreceptors
suggest that these adaptations allowed T. castaneum to
become a serious pest of stored grain.
Honeybee, Apis mellifera. The 236-Mb A. mellifera
genome was assembled based on 1.8â•›Gb of sequence
obtained by WGS (The Honey Bee Genome Consortium,
2006). About 10,157 potential genes were identified in
the assembled genome sequence. Genes coding for most
of the highly conserved cell signaling pathways are present
in the A. mellifera genome. Seventy four genes coding for

96 homeobox domains were identified in the A. mellifera
genome. When compared to the D. melanogaster genome,
the A. mellifera genome contained more genes coding for
odorant receptors and proteins involved in nectar and


8  1: Insect Genomics

pollen utilization. This genome also showed fewer genes
coding for proteins involved in innate immunity, detoxification enzymes, cuticle-forming proteins, and gustatory
receptors.
Parasitoid wasps, Nasonia vitripennis, N. giraulti,
and N. longicornis. 240â•›Mb of N. vitripennis genome
was assembled from sequences obtained by the Sanger
sequencing method (Werren et€al., 2010). Sequences from
two sibling species, N. giraulti and N. longicornis, were
completed with one-fold Sanger and 12-fold, 45 base-pair
(bp) Illumina genome coverage. The assembled genome
sequence contained 17,279 predicted genes. About 60%
of Nasonia genes code for proteins showing high similarity
with human proteins, 18% of the genes code for proteins
showing similarity with other arthropod homologs, and
about 2.4% of Nasonia genes code for proteins similar to
those in A. mellifera, which could therefore be hymenoptera-specific. About 12% of genes code for proteins that
showed no similarity with known proteins, and therefore
may be Nasonia-specific.
Body louse, Pediculus humanus humanus. 108â•›Mb
of P. h. humanus genome was assembled from 1.3 million
pair-end reads from plasmid libraries obtained by WGS
(Kirkness et€ al., 2010). The body louse has the smallest

genome size of all the insect genomes sequenced so far.
The assembled genome contained 10,773 protein-coding
genes and 57 microRNAs. Compared with other insect
genomes, the body-louse genome contains significantly
fewer genes associated with environmental sensing and
response. These proteins include odorant and gustatory
receptors and detoxifying enzymes. Only 104 non-sensory
G protein-coupled receptors and 3 opsins were identified in
P. h. humanus genome. This insect has the smallest repertoire of GPCRs identified in any sequenced insect genome
to date. Only 10 odorant receptors were detected in P. h.
humanus genome. Only 37 genes in the P. h. humanus
genome encode for P450s. Despite its smaller size, the
P. h. humanus genome contains homologs of all 20 nuclear
receptors identified in D. melanogaster genome.
Pea aphid, Acyrthosiphon pisum. The 464-Mb
genome of A. pisum was assembled from 4.4 million
Sanger sequencing reads (The Pea Aphid Genome Consortium, 2010). Analysis of the A. pisum genome showed
extensive gene duplication events. As a result, the aphid
genome appears to have more genes than any of the
previously sequenced insects. Genes coding for proteins
involved in chromatin modification, miRNA synthesis,
and sugar transport are over-represented in the A. pisum
genome when compared with other insect genomes
sequenced to date. About 20% of the predicted genes in
the A. pisum genome code for proteins with no significant
similarity to other known proteins. Proteins involved in
amino acid and purine metabolism are encoded by both
host and symbiont genomes at different enzymatic steps.
N Selenocysteine biosynthesis is not present in the pea


aphid, and selenoproteins are absent. Several genes in the
A. pisum genome were found to have arisen from bacterial ancestors and some of these genes are highly expressed
in bacteriocytes, which may function in the regulation of
symbiosis. Interestingly, the genes coding for proteins that
function in the IMD pathway of the immune system are
absent in the A. pisum genome.
Yellow fever Mosquito, Aedes aegypti. The 1.38-Gb
genome of Ae. aegypti was assembled from sequence reads
obtained by WGS (Nene et€al., 2007). This is the largest
insect genome sequenced to date, and is about five times
larger than the An. gambiae and D. melanogaster genomes.
Approximately 47% of the Ae. aegypti genome consists
of transposable elements. The presence of large numbers
of transposable elements could have contributed to the
larger size of the Ae. aegypti genome. About 15,419 predicted genes were identified in the assembled genome.
Compared to the genome of An. gambiae, an increase in
the number of genes encoding odorant binding proteins,
cytochrome P450s, and cuticle proteins was observed in
the Ae. aegypti genome.
Silk moth, Bombyx mori. The silkworm genome was
sequenced by Japanese and Chinese laboratories simultaneously. The Japanese group used the sequence data
derived from WGS to assemble 514â•›Mbs including gaps,
and 387╛Mbs without gaps (Mita et€ al., 2004). Chinese
scientists assembled sequences obtained by WGS into
a 429-Mb genome (Xia et€al., 2004). The two data sets
were merged and assembled recently (The International
Silkworm Genome, 2008). This resulted in the 8.5-fold
sequence coverage of an estimated 432-Mb genome. The
repetitive sequence content of this genome was estimated
at 43.6%. Gene models numbering 14,623 were predicted

using a GLEAN-based algorithm. Among the predicted
genes, 3000 of them showed no homologs in insects or
vertebrates. The presence of specific tRNA clusters, and
several sericin gene clusters, correlates with the main function of this insect: the massive production of silk.
Recently, a consortium of international scientists
sequenced the genomic DNA of 40 domesticated and
wild silkworm strains to coverage of approximately threefold. This represents 99.88% of the genome, and led to
the development of a single base-pair resolution silkworm
genetic variation map (Xia et€al., 2009). This effort identified ~16 million single-nucleotide polymorphisms, many
indels, and structural variations. These studies showed
that domesticated silkworms are genetically different from
wild ones; nonetheless, they have managed to maintain
large levels of genetic variability. These findings suggest
a short domestication event involving a large number of
individuals. Candidate genes, numbering 354, that are
expressed in the silk gland, midgut, and testes, may have
played an important role during domestication.
The southern house mosquito, Culex quinquefasciatus. C. quinquefasciatus is a vector of important viruses


1: Insect Genomics  9

such as the West Nile virus and the St Louis encephalitis virus, and harbors nematodes that cause lymphatic
filariasis. Arensburger sequenced and assembled the whole
genome of C. quinquefasciatus (Arensburger et€al., 2010).
A larger number of genes, 18,883, reported from the
other two mosquito genomes (Aedes aegypti and Anopheles
gambiae), were identified in the assembled C. quinquefasciatus genome. An increase in the number of genes coding
for olfactory and gustatory receptors, immune proteins,
enzymes such as cytosolic glutathione transferases and

cytochrome P450s involved in xenobiotic detoxification
was observed.

1.3.╇ Genome Analysis
Since its discovery, Sanger sequencing has been largely
applied in most genome sequencing projects (Sanger
et€al., 1977); therefore, a large volume of sequence information from a variety of species has been deposited into
various databases. With deciphered full genome sequences
for a number of species, scientists could now begin to
address biological questions on a genome-wide level.
These analyses include the measurement of global gene
expression, the identification of functional elements, and
the mapping of genome regions associated with quantitative traits. Various new technologies have also been
developed to assist with genome analysis. These include
DNA microarrays (Schena et€al., 1995), serial analysis of
gene expression (SAGE) (Schena et€al., 1995), chromatin
immunoprecipitation microarrays (Ren et€al., 2000; Iyer
et€al., 2001; Lieb et€al., 2001), next generation sequencing (NGS) (Margulies et€al., 2005; Shendure et€al., 2005),
genome-wide RNAi screens (Kiger et€ al., 2003), comparative genomics (Kiger et€al., 2003), and metagenomics
(Chen and Pachter, 2005). These genomic analysis tools
have greatly improved our understanding of how biological and cellular functions are regulated by the RNAs or
proteins encoded in an organism’s genome. Especially
in the agricultural research field, functional genomics
studies will enhance our understanding of the biology of
insect pests and disease vectors, which in turn will assist
the design of future pest control strategies. Here, we will
discuss technologies used for functional genomics studies,
with an emphasis on forward genetics, DNA microarray,
and NGS technologies, and their applications in research
on insects.

1.3.1.╇ Forward and Reverse Genetics

The function of genes is often studied using forward
genetics approaches. In forward genetic screens, insects
are treated with mutagens to induce DNA lesions, followed by a screen to identify mutants with a phenotype of interest. The mutated gene is then identified by
employing standard genetic and molecular methods.

Follow-up studies on the mutant phenotype, including molecular analyses of the gene, often lead to determination of its function. Forward genetics approaches
have been used for determining the function of many
genes. In the fruit fly, D. melanogaster, genetic screens
have been used for a number of years to discover gene–
phenotype associations. With the availability of massive
amounts of data derived from whole-genome and omics
studies, a systems biology approach needs to be applied
to enhance the power of gene function discovery in€vivo.
Mobile elements or chemicals are often used as mutagenesis tools (Ryder and Russell, 2003). The P element
has been widely used in D. melanogaster forward genetics
since its development as a tool for transgenesis in 1982
(Rubin and Spradling, 1982). The insertion of P elements into the D. melanogaster genome allowed subsequent cloning and characterization of a large number of
fly genes. P-element mediated transgenesis is often used
to create mutants by excising the flanking genes based
on imprecise mobilization of the P elements. P elements
were also modified to study genes, not only based on a
phenotype, but also based on RNA or protein expression
patterns, which are often referred to as enhancer trap
and gene trap technologies. P elements are also being
used as mutagenesis agents in a project aimed at generating insertions in every predicted gene in the fruit fly
genome.
Recent developments in transgenic techniques focused
on the site-specific integration of transgenes at specific

genomic sites, which employ recombinases and integrases,
have made forward genetics in D. melanogaster effective
and specific. One of the major drawbacks of P-element
mediated transgenesis is the non-specific and positional
effects caused by inserting exogenous DNA into insect
genome. Recently, several methods have been developed
to eliminate these unwanted, non-specific effects in transgenic insects. Transgene co-placement was developed by
Siegal and Hartl (1996). This method uses two transgenes, a rescue fragment and its mutant version, which are
inserted into the same locus by using a P-element vector
that contains the recognition sites FRT (FLP recombinase
recognition site) and loxP (the Cre recombinase recognition site). After integration, FLP can remove one transgene, such as the rescue gene. Cre can remove the other
transgene, which may be the mutant version. A method
was developed by Golic (Golic et€al., 1997) by using FLP
recombinase for remobilization of transgene by a donor
transposon that contains a transgenic insert together with
a marker gene such as white flanked by two FRT sites, and
an acceptor transposon that contains a second marker and
one FRT site. The remobilization of the donor transposon
by FLP can be followed by the changes in the expression
of white gene. The remobilization results in the excision of
transgene and its potential integration into the FRT site
of the acceptor transposon.


10  1: Insect Genomics

Homologous recombination is the best method for
in€vivo gene targeting, since positional effects can be eliminated completely. Insertional gene targeting (Rong and
Golic, 2000) and replacement gene targeting (Gong and
Golic, 2003) are two alternative methods that have been

developed. Insertional gene targeting results in the insertion of a target gene at a region of homology. Replacement gene targeting results in replacement of endogenous
homologous DNA sequences with exogenous DNA
through a double reciprocal recombination between two
stretches of homologous sequences. Site-specific zincfinger-nuclease-stimulated gene targeting has been developed to further improve in€vivo gene targeting (Bibikova
et€al., 2003; Beumer et€al., 2006). The most widely used
site-specific integration in D. melanogaster employs the
bacteriophage Φ C31 integrase. The bacteriophage Φ C31
integrase catalyzes the recombination between the phase
attachment site (attP), previously integrated into the fly
genome, and a bacterial attachment site (attB) present
in the injected transgenic construct (Groth et€al., 2004).
A combination of different transgenic methods should aid
in D. melanogaster functional genomics studies aimed at
determining the function of every gene in this insect.
In the reverse genetics approach, studies on the function of the genes start with the gene sequences, rather
than a mutant phenotype, which is often used in forward
genetics approaches. In this approach, the gene sequence
is used to alter the gene function by employing a variety
of methods. The effect of the altered gene function on
physiological and developmental processes of insects is
then determined. Reverse genetics is an excellent complement to forward genetics, and some of the experiments
are much easier to perform using reverse genetics rather
than forward genetics. For example, RNA interference,
a reverse genetics method (covered in Chapter 2 in this
volume) is a better method compared to forward genetics
to investigate the functions of all the members of a gene
family. The availability of whole-genome sequences for a
number of insects and the functioning of RNAi in these
insects will keep scientists busy studying the functions of
all genes in insects during the next few years.


1.3.2.╇ DNA Microarray

In most cases, a group of functionally associated genes
share similar expression patterns, which may be temporal, spatial, developmental, or physiological. For example,
environmental changes and pathological conditions could
alter global gene expression patterns. To understand and
characterize the biological roles of an individual gene or a
cluster of genes, a high-throughput quantitative method
is needed to detect gene expression at the whole-genome
level. The DNA microarray technique is one such method
that has been developed for monitoring global gene expression patterns. Through robotic printing of thousands of
DNA oligonucleotides onto a solid surface, one DNA
microarray chip can accommodate more than 50,000
probes (unique DNA sequences). DNA microarrays
utilize the principle of Southern blotting (Schena et€ al.,
1995). First, fluorescently labeled probes are synthesized
from RNA samples by reverse transcription; the probes
are then hybridized to DNA microarrays which contain
complementary DNA. After washing away the unbound
probes, the intensity of the fluorescent signal for each spot
is captured using a microarray scanner. DNA microarrays
have been widely used in functional genomics research. In
addition to their application on gene expression profiling,
DNA microarrays can also be used to identify transcriptional or functional elements in the genome, or identify
single nucleotide polymorphisms (SNP) among alleles
within or between populations. The applications of DNA
microarrays and various other types of arrays are listed in
Table 2.
1.3.2.1.╇ Global gene expression analysis (transcriptome

analysis)
1.3.2.1.1.╇ DNA microarray fabrication.╅ The DNA
microarrays used for global gene expression analysis usually contain tens of thousands of probes which cover all
the predicted genes in a genome, or sequences representing transcribed regions, also called expressed sequence
tags (ESTs). For example, the Affymetrix GeneChip®

Table 2  List of Applications of DNA Microarray
Application

Description

Type of microarray

Gene expression

Measuring global gene expression pattern under various
biological conditions
Identifying transcriptional or functional elements at
a whole-genome level
Genome-wide scanning of Adenosine methylation events.
Analogously to ChIP-on-chip
Genome-wide detection of the expression of miRNAs
(small non-coding RNAs)
Detecting polymorphisms within a population
Low-density DNA microarray for the identification of viruses
and pathogens

Expression array

ChIP-on-chip

DamID
miRNA profiling
SNP detection
Pathogen and
virus detection

Tiling array
DNA methylation array
miRNA array
SNP array
Virus Chip, FluChip


1: Insect Genomics  11

1.3.2.1.2.╇ Target preparation and hybridization.╅
Total RNA or mRNA is isolated from experimental
samples using commercial TRIzol reagent or RNA isolation and purification kits. Total RNA (1â•›μg to 15â•›μg) or
mRNA (0.2â•›μg to 2â•›μg) is reverse transcribed into firststrand cDNA. For smaller amounts of total starting RNA
(10â•›ng to 100â•›ng), Affymetrix offers a two-cycle target
labeling method to obtain sufficient amounts of labeled
targets for DNA hybridization. Then, cDNAs are labeled
and hybridized to spotted or oligonucleotide microarrays.
In oligonucleotide microarrays, one mRNA sample labeled
with one fluorescent dye is analyzed on a single channel.
Alternatively, two different fluorescent dyes, such as Cy3
and Cy5, can be used to determine gene expression changes
from two different experimental conditions.

or hybridization are to be compared, they need to be normalized before further analysis.

After normalization, various statistical analysis methods can be applied to identify differentially expressed
genes between two treatments. Usually, a t-test is used
for comparing the means of two sample populations,
while ANOVA (analysis of variance) is applied for comparing multiple sets of samples or treatments to obtain
more accurate variance estimates. Since many genes are
tested for statistical differences, multiple test corrections,
such as the Bonferroni correction and the Benjamini and
Hochberg false discovery rate (FDR) (Benjamini and
Hochberg, 1995), are applied to adjust the P-value and
correct the occurrence of false positives. Bonferroni correction is a very stringent method that uses α/n as the
threshold P-value for each test where n is the number of
tests or the number of genes. In contrast, the Benjamini
and Hochberg FDR is less stringent, and the rate of false
negative discovery is lower. Various statistical analysis programs are now available from either commercial microarray providers or open source websites. These include
GeneSpring from Silicon Genetics (acquired by Agilent
in 2004) and Significance Analysis of Microarrays (SAM)
(Tusher et€al., 2001). Besides differential expression analysis, genes with similar expression patterns can be grouped
into one or more clusters using hierarchical clustering
methods. Hierarchical clustering analysis helps to visualize gene expression patterns and identify relationships
between functionally associated genes (Eisen et€al., 1998).
On the other hand, programs such as Gene Set Enrichment Analysis (GSEA) are used to determine whether
there is a statistically significant, coordinated difference
between control and treatment samples for a predefined
set of genes that are involved in a similar biological process
(Subramanian et€ al., 2005). Unlike traditional microarray analyses at the single gene level, GSEA has addressed
a situation where the fold change between control and
treatment samples is small, but there is a concordant difference in the representation of functionally related genes.
Several published microarray datasets have been deposited
in various online databases, including Gene Expression
Omnibus (GEO) at NCBI, ArrayExpress at the European

Bioinformatics Institute, and Stanford Genomic Resource
at Stanford University. A list of microarray analysis tools
and databases is shown in Table 3.

1.3.2.1.3.╇ Data analysis.╅ Although the data analysis
methods among commercial microarrays vary, the basic
concepts are similar. After hybridization, the fluorescence
images are captured by a microarray scanner. The fluorescence intensity data are then corrected and adjusted from
the background (noise), which may result from non-�
specific hybridization or autofluorescence. In two-channel
arrays, the fluorescence intensity ratio between two dyes is
calculated and adjusted. If the data from a different array

1.3.2.1.4.╇ Applications.╅ The primary goal of developing gene expression microarray technology is to monitor differentially expressed genes at the whole-genome
level. Therefore, microarray technology has been used to
study the molecular basis of pesticide resistance (Djouaka
et€ al., 2008; Zhu et€ al., 2010) (Figure 3), insect–plant
interactions (Held et€ al., 2004), insect host–parasitoid
associations (Lawniczak and Begun, 2004; Barat-Houari
et€al., 2006; Mahadav et€al., 2008; Kankare et€al., 2010),

Drosophila Genome 2.0 Array contains over 500,000 data
points representing 18,500 transcripts and various SNPs
(Affymetrix technical data sheets). DNA microarrays can
be prepared by various methods, including photolitho�
graphy, ink-jet technology, and spotted array technology.
Photolithography and ink-jet technologies are used for
fabricating so-called oligonucleotide microarrays, which
are made by synthesizing or printing short oligonucleotide sequences (25-mer in Affymetrix array or 60-mer
in Agilent array) directly onto a solid array surface. The

photolithography method is used by Affymetrix and NimbleGen, while the ink-jet print method is used by Agilent.
Typically, multiple probes per gene are used in order to
achieve precise estimation of gene expression. Long oligonucleotides have better hybridization specificities than
short ones, although short oligonucleotides can be printed
at a higher density and synthesized at lower cost. In contrast, spotted microarrays are made by synthesizing probes
prior to deposition onto the array surface. The probes
used for spotted microarrays can be oligonucleotides,
cDNA or PCR products. Because of their relatively low
cost and flexibility, the spotted microarray technology
has been widely used to produce custom arrays in many
academic laboratories and facilities. However, spotted
microarrays are less uniform and contain low probe density when compared with oligonucleotide arrays. As the
cost of custom commercial arrays such as Agilent Custom
Gene Expression Microarrays (eArray) has decreased, the
use of spotted microarray is decreasing as well.


12  1: Insect Genomics
Table 3  List of Microarray Data Analysis Tools and Microarray Databases
Statistical Analysis Programs
GeneSpring
SAM
Bioconductor
Partek

/> /> /> />
Cluster and Pathway Analysis Tools
Cluster and TreeView
Cluster 3.0
Java TreeView

Gene Set Enrichment Analysis (GSEA)
Gene Set Analysis (GSA)
Genepattern
Genecruiser
Advanced Pathway Painter

/> /> />www.broadinstitute.org/gsea/
/> /> /> />
Microarray Databases
Gene Expression Omnibus
ArrayExpress Archive
Stanford Genomic Resources
Arraytrack
Genevestigator

/> /> /> /> />
insect behavior (McDonald and Rosbash, 2001; Etter and
Ramaswami, 2002; Dierick and Greenspan, 2006; Adams
et€al., 2008; Kocher et€al., 2008), development and reproduction (White et€al., 1999; Kawasaki et€al., 2004; Dana
et€ al., 2005; Kijimoto et€ al., 2009; Bai and Palli, 2010;
Parthasarathy et€ al., 2010a, 2010b), etc. Understanding
the mechanisms of pesticide resistance is critical for prolonging the life of existing insecticides, designing novel
pest control reagents, and improving control strategies.
As a result, several laboratories have begun using microarrays to identify genes responsible for insecticide resistance.
For example, using a custom microarray, one cytochrome
P450 gene, CYP6BQ9, has been identified to be responsible for the majority of deltamethrin resistance in
T. castaneum (Zhu et€al., 2010) (Figure 3). Another microarray study discovered that two cytochrome P450 genes,
CYP6P3 and CYP6M2, are upregulated in multiple pyrethroid-resistant Anopheles gambiae populations collected
in Southern Benin and Nigeria (Djouaka et€ al., 2008).
A global view of tissue-specific gene expression profiling

has been reported in Drosophila melanogaster (Chintapalli
et€ al., 2007). This study identified many genes that are
uniquely expressed in specific fly tissues, and provided
useful information for understanding the tissue-specific
functions of these candidate genes.
Biological processes and cellular functions are rarely
regulated by only one or a few genes. Therefore, monitoring the expression changes of a group of genes under different biological conditions could provide useful insights
into biological processes and cellular functions. Microarrays have been applied to detect gene expression patterns
during insect embryonic development (Furlong et€ al.,
2001; Stathopoulos et€al., 2002; Tomancak et€al., 2002;

Altenhein et€al., 2006; Sandmann et€al., 2007) and metamorphosis (White et€al., 1999; Butler et€al., 2003), under
various nutrient conditions (Zinke et€al., 2002; Fujikawa
et€al., 2009), with aging (Weindruch et€al., 2001; Pletcher
et€al., 2002; Terry et€al., 2006; Pan et€al., 2007), and in
many other circumstances.
In combination with newly developed statistical and
bioinformatics methods, and gene ontology and signaling
pathway databases, microarray technology has also been
applied to identify a signaling pathway or a specific cellular function that is altered under various biological conditions (Subramanian et€al., 2005). With these approaches,
it is possible to discover the interactions between individual pathways and obtain a global network view (Costello
et€al., 2009; Avet-Rochex et€al., 2010).
1.3.2.2.╇ DNA–protein interaction (chromatin immunoÂ�­
precipitation)â•… Chromatin immunoprecipitation (ChIP)
was developed in the late 1980s (Hebbes et€ al., 1988)
and has been widely applied to the study of protein–
DNA interactions in€vivo. Particularly, transcription factors, histone modifications, and DNA replication-related
proteins can be studied using ChIP. By combining ChIP
with DNA microarray technology, a process typically
called ChIP-on-chip, all the possible DNA-binding sites

of a protein of interest throughout the genome can be
examined. ChIP-on-chip technology first appeared in
2000 in studies of DNA-binding proteins in the budding
yeast, Saccharomyces cerevisiae (Ren et€al., 2000; Iyer et€al.,
2001). With the availability of high-density oligonucleotide arrays which contain short sequences representing
non-coding regions or entire genomes, ChIP-on-chip
has also been applied to the global identification of


1: Insect Genomics  13
Figure 3╇ Application of microarray and RNA interference
technologies to identify and fight insecticide resistance.
Reprinted with permission from Zhu et€al. (2010).
(A) The V plot of differentially expressed genes identified by
microarrays. Fold suppression or overexpression of genes in
QTC279 strain when compared with their levels in the Lab-S
strain was plotted against the P values of the t-test. The
horizontal bar in the plot shows the nominal significant level
0.001. The vertical bars separate the genes that are a minimum
of 2.0-fold difference. Three genes identified by the Bonferroni
multiple-testing correction as differentially expressed between
resistant and susceptible strains are shown.
(B) Injection of CYP6BBQ9 dsRNA into Tribolium castaneum
QTC279 beetles reduces CYP6BBQ9 mRNA levels. The
mRNA levels of CYP6BQ9 were quantified by qRT-PCR at 5
days after dsRNA injection. The relative mRNA levels were
shown as a ratio in comparison with the levels of rp49 mRNA.
(C) Dose–response curves for T. castaneum adults exposed
to deltamethrin. At 5 days after dsRNA injection, the following
were exposed to various doses of deltamethrin: Lab-S (◯),

a susceptible strain; QTC279 (▽), a deltamethrin-resistant
strain; QTC279-CYP6BQ9 RNAi (●), a QTC279 strain injected
with CYP6BQ9 dsRNA; and QTC279-malE RNAi (▼), a
QTC279 strain injected with malE dsRNA as a control.

transcriptional regulatory networks in various organisms. These projects include ENCODE (human) (The
ENCODE Project Consortium 2004) and modENCODE (worm and fly) (Celniker et€al., 2009). The goal
of these projects is the genome-wide characterization of
all possible functional elements using ChIP-on-chip and

other high-throughput technologies. ChIP-on-chip technology will likely contribute to a better understanding of
genome organization, including functionally important
elements, non-coding RNA, and chromatin markers.
This may eventually lead to the comprehensive understanding of gene regulatory networks within an organism’s genome.
Many ChIP-on-chip protocols have been published, or
are available online. In general, cells or tissues are treated
using a reversible cross-linker (e.g., formaldehyde), so that
protein and DNA are fixed in€ vivo. Then the protein–
DNA complex within the nucleus is extracted and separated from cytoplasm. Purified protein–DNA complexes
(referred to as “chromatin” hereafter) are sonicated using
a conventional sonicator or Bioruptor® in order to generate DNA fragments that range from 200 to 1000â•›bp. The
sonication conditions need to be pre-adjusted to obtain
optimally sized DNA fragments. Before sonication, an aliquot of chromatin needs to be saved as a reference sample
(or input samples). Usually a chromatin pre-clean step
using protein-A beads is included to remove non-specific
binding during the immunoprecipitation step. For the
immunoprecipitation step, a certain amount (e.g., 10â•›μg)
of antibody and protein-A beads is added to pre-clean
the chromatin. Chromatin bound to protein-A beads is
then purified, eluted, and reverse-cross-linked. Since the

amount of a single ChIP DNA sample is normally around
a few nanograms, and this is not enough for microarray
hybridization, an amplification step is required. There
are two ways to amplify ChIP DNA: ligation-mediated
PCR (LM-PCR) and whole-genome amplification
(WGA). The WGA method is considered to have lower
background compared to the LM-PCR method (O’Geen


1.3.2.3.╇ DNA–protein interaction (chromatin immuÂ�no
�precipitation)
Due to the availability of whole-genome sequences,
the application of ChIP-on-chip technology is mainly
used in model insects. ChIP-on-chip has been applied
to dissecting the transcriptional regulatory network of

Cross-link
Fragmentation

Immunoprecipitation

Reverse cross-link
DNA purification
Amplification

Chip hybridization

Data analysis

et€al., 2006). Amplified ChIP DNA and Input DNA are

then denatured, fluorescently labeled, and hybridized to
either a spotted or a oligonucleotide microarray (typically
a tiling array). If there is a known target binding site for
the protein of interest, the quality of ChIP samples can
be assessed using real-time qPCR before submitting the
samples for microarray analysis.
The data preprocessing steps of ChIP-on-chip are similar to those used in gene expression microarrays. After
microarray scanning and fluorescence intensity recording,
the enrichment of each binding site across the genome is
obtained by comparing the intensity of each spot between
ChIP DNA and Input DNA. Enriched regions can then
be further analyzed, including identification of genes
associated with each binding region, and conserved motif
searching. The enrichment can also be visualized using
many free available genome browsers, such as UCSC
Genome Browser ( Integrated
Genome Browser (IGB, and
Integrative Genomics Viewer (IGV, adinst
itute.org/igv/). The workflow of a chromatin immunoprecipitation experiment is shown in Figure 4.
Antibody quality is a critical factor for successful ChIPon-chip experiments. Since there are a variety of antibodies for a protein of interest, each with a specific affinity,
it is always better to examine all the available antibodies
in a small-scale ChIP-PCR experiment. If there are no
suitable antibodies for a protein of interest, an epitopetagged protein can be used (Zhang et€al., 2008). In this
way, an antibody for the epitope instead of one for the
protein of interest can be used in immunoprecipitation.
In Drosophila, transgenic flies may be generated to express
epitope-tagged proteins in€vivo.
The success of ChIP experiments also depends on the
sonication step. It is suggested that 200- to 1000-bp DNA
fragments should be obtained after sonication or DNA

shearing. Undersonication will result in many large fragments (larger than 1000â•›bp) and lead to loss of resolution.
Oversonication could interfere with the protein–DNA
complex formation, and may result in more noise.
As mentioned above, the WGA amplification method
is considered better than the LM-PCR method. Due to
the bias caused by PCR amplification, the signal-to-noise
ratio normally decreases after a PCR reaction; therefore,
minimizing the number of PCR cycles is suggested. As
reported by O’Geen et€al. (2006), the WGA amplification
method has higher signal-to-noise ratio and more enriched
binding sites when compared to the LM-PCR method.

Chromatin immunoprecipitation

14  1: Insect Genomics

Sequencing

Chip normalization

Base Calling

Background
Adjustment

Reference genome
Alignment

Binding site mapping
Target gene identification

Motif analysis
Figure 4╇ The workflow of a chromatin immunoprecipitationsequence identification experiment. After cross-linking, the
chromatin is precipitated with antibodies; the precipitated
chromatin is cross-linked, and the DNA purified and
amplified. The amplified DNA is then sequenced and aligned
to the reference genome and potential binding sites are
identified.

embryogenesis (Sandmann et€al., 2007; Zeitlinger et€al.,
2007; Liu et€al., 2009), chromatin modification (Alekseyenko et€ al., 2008; Smith et€ al., 2009; Tie et€ al., 2009),
epigenetic silencing (Negre et€al., 2006), etc. Interestingly,
a high-resolution transcriptional regulatory atlas of mesoderm development was constructed through the analysis
of a key set of transcription factors, including Twist, Tinman, Myocyte enhancing factor 2, Bagpipe and Biniou, in
the Drosophila embryo (Zinzen et€al., 2009).
1.3.3.╇ Next Generation Sequencing (NGS)

Although DNA microarray technologies are widely used
in many aspects of biological and medical research, there
are some limitations. The design of the microarrays is
based on our current knowledge of sequenced genomes
from computationally predicted raw genome structures.
These structures include gene coding regions, introns,
enhancers, and non-coding RNAs. Due to a lack of comprehensive knowledge on the chromosome landscape,


×