Understanding and Assembling
454 Transcriptome sequences
Transcriptome Workshop
Nov 2010
Stephen Bridgett
Aims
•
•
•
•
•
•
•
•
Why sequence transcriptomes?
How does 454 sequencing work?
What are ‘sff’ files?
Using sff tools
What is assembly?
Challenges to assembly
Newbler assembler and Output files
Exercises with sample data
Why sequence transcriptomes?
• Gives more dynamic view of the activity in a cell, (than
genome sequencing would) as:
• Gives relative expression levels for different cells under
different conditions.
• Could identify alternate splicing, and fusion genes
(important in several cancers).
• Focuses on gene sequences, which are often the main
research focus.
How does 454 sequencing work ?
454 sequencer
DNA Capture bead,
emPCR,
Pyrosequencing reaction,
Signal image,
Base calling
An Animation of 454 sequencing
• Animation of 454 sequencing from Wellcome trust
website to explain Flowgrams:
•
/>
•
/>
454Animation_from_Wellcome_WTX056030.swf
• To help understand output from assembler alignment.
Data obtained from 454 sequencing
Roche 454 ‘titanium’ genome reads approx. 400 bases long.
Transcriptome reads tend to be a bit shorter eg. 350 bases.
Typically 700,000 reads from one sequencing plate.
Plates can be divided into 2, 4, 8 or 16 lanes.
Samples can have an MID (multiplex index) ‘barcode’
added, so several samples can be run together in the same
lane.
What are ‘sff’ files ?
• ‘Sff’ files are Roche’s “Standard Flowgram Format” files,
containing the sequence data produced from a 454 run.
• The sff files contain:
• a Manifest header at the start describing the contents,
• flow intensity signal values for each base in each read.
• They are in binary format, so need converted to text
format, such as a fasta file (using the ‘sffinfo’ program)
• The Sequence Read Archive (SRA at EBI or NCBI)
request that these .sff files be uploaded, to obtain accession
number for publications.
What is Assembly?
Merge the short reads into long contigs (ideally a full transcript),
by finding the best sequence overlaps between reads.
Eg: Roche’s Newbler assembler, MIRA assembler, TgiCl assembler, Phrap, Cap3,
MOSAIK reference guided assembler, etc.
This is an ‘overlap’ assembler (there are also deBruijn graph
assemblers to cope with the very large numbers of short illumina
reads)
Reads overlapped to form a contig, viewed in the gsAssembler graphical interface.
Newbler is an ‘overlap assembler’. There are also de-Bruijn graph assemblers designed to
cope with the vary-large numbers of short reads from illumina or SOLiD, such as Velvet,
CLC cell, Cotex, SOAP-denovo, Abyss.
Challenges for assembly (1)
• Contaminants in samples (eg. from Bacteria or Human).
• Ribosomal RNA (small and large sub-units).
• PCR artifacts (eg. Chimeras and Mutations)
• Sequencing errors, such as “Homopolymer” errors – when eg. 3+
run of same base.
• MID’s (multiplex indexes), primers/adapters (eg. SMART adapters
used to synthesise cDNA) still in the raw reads.
• Repeats and large or polyploid genomes – repeated sequences in the
transcriptome make assembly more difficult.
Challenges for assembly (2)
• Extra sample preparation steps in cDNA synthesis - more risk of
cloning errors or contamination, wider range of read lengths.
• Large expression level range (eg. 105) - some transcripts have low
read coverage and some very high coverage.
• Alternative splicing - differing
reads from same part of genome.
• Roche’s Newbler 2.3 assembler sometimes didn’t finish transcriptome
assembly, seemed to get lost when “Detangling Alignments”, but the
latest Newber 2.5 beta is able to.
Blast search to check for contaminants
• Blastx search of 5,000 randomly picked reads against UniRef90 or
Non-redundant dataset.
• Sorted by frequency of Description (or Tax) with evalue > e-8
Frequency
1689 (16.9 %)
Subject_description
Picea sitchensis (Sitka Spruce)
907 (9.1 %)
Vitis vinifera (Common Grape Vine)
311 (3.1 %)
Physcomitrella patens subsp. Patens (Moss)
282 (2.8 %)
Arabidopsis thaliana (Thale cress)
218 (2.2 %)
Oryza sativa Japonica Group (Rice)
153 (1.5 %)
Zea mays (Maize)
58 (0.6 %)
Oryza sativa Indica (Rice)
58 (0.6 %)
Oryza sativa (Rice)
Homopolymer error
6
5
4
A
C
T
G
3
2
1
0
Cycle 1
A
Cycle 2
?c
Cycle 3
TT
Cycle 4
-
Cycle 5
AAAAA ?a
• Different between signal of 1 and signal of 2 = 100%.
• Different between signal of 5 and 6 is 20% so errors more
likely after eg. AAAAA.
Roche software
• Roche have developed Data-Analysis software for
processing, assembling and mapping the 454 reads:
• sffinfo - extract fasta, quality and flowgrams as text from .sff files.
• sfffile - join, split or trim sff files.
• gsAssembler (Newbler) - to assembly reads into contigs/isotigs.
• gsMapper - to map reads to a transcriptome or genome reference.
• gsAmplicon – to analyse Variants in Amplicons.
• (These run on 32 and 64 bit Linux. There is information on
the wiki about obtaining and installing these.)
Exercise 1A – sff files
Aims:
• Using ‘sffinfo’ and ‘sfffile’
• Summarise the read statistics
• Blast the reads for contaminants
The exercises are on the wiki:
http://taw2010wiki
What is “Newbler” ?
Roche's “GS De Novo Assembler” (where “GS” = “Genome Sequencer”)
Designed to assemble reads from the Roche 454 sequencer.
Accepts:
454 Flx Standard reads, and
454 Titanium reads.
single and paired-end reads.
Optionally can include Sanger reads.
Initial versions focused on assembling Genomic reads.
Latest versions (2.3 and now 2.5) improve transcriptome assembly.
Runs on Linux, and has 32 bit and 64 bit versions.
Has Command-line and Java-based GUI interface.
Rarely called “Newbler” (for “New Assembler”) in Roche's
documentation, rather “runAssembler”, or “gsAssembler”.
How does Newbler work?
cDNA Reads Alignments Contig graph Final untangled
assembly
Inputs to Newbler assembler
Newbler accepts:
Roche's .sff files (standard flowgram format)
Fasta files, with or without Quality files, such as Sanger
reads, (which can be used as a scaffolds.)
Parameters specified by the user, to guide the assembly,
(or parameters can all be left at their default values.)
Command-line interface
• The simplest command to run Newbler is:
runAssembly [options] reads.sff
• Which creates an the assembly in an output directory called:
P_yyyy_mm_dd_hh_min_sec_runAssembly
where P_ = Project, followed by date and time
• There are a large number of optional parameters available for
controlling and refining the assembly.
Common command-line options
• -cdna for transcriptome (cDNA) assembly
• -urt ‘use read tips’ to produce longer isotigs
• -o output_directory to set name of output directory
• -vt trimmingFile.fasta to trim primers, adapters from
start or end of reads
• -vs screeningFile.fasta to remove reads that closely
matching a cloning vector such as E.Coli or rRNA.
• (-vs and -vt also match reverse-complements of given sequences.)
Isogroups, Isotigs, Contigs ?
• Some definitions to understand Newbler output:
• An isogroup: - tries to represent a gene
- collection of isotigs containing reads that imply
connections between the isotigs.
• An Isotig: - represents an individual transcript.
•
- different isotigs from a given isogroup can be inferred
splice-variants.
• Contigs:
- contigs forming an isotig may be thought of as exons.
- this is not strictly correct, as untranslated regions (UTRs)
and introns (in the case of primary transcripts) may exists in the reads
generated from the sample.
Isotigs - more details
• Connections between contigs in an isogroup are represented by sequences (reads)
that have alignments diverging consistently towards two or more different
contigs or by a depth spike.
• The assembler trims and ignores any poly-A tails, so the true orientation of
reads in the assembly cannot be determined. So an isotig may be output as the
reverse-complement of the true biological transcript.
• For more details see pages 165 - 169 of the Roche software manual (which is on
your computer’s Desktop in the ‘manual’ folder)
Output files for Transcriptome projects (1)
In the Assembly subdirectory:
• 454Isotigs.fna fasta file of all Isotigs, and Contigs which are not in an isotig.
• 454Isotigs.qual quality scores (Phred-based) for each base in '454Isotigs.fna’
file. (eg: 20 = 1 in 100 probability of incorrect base call; 50 = 1 in 100,000)
• 454Contigs.fna fasta file of all contigs, which are used to create the Isotigs.
• 454Contigs.qual quality scores for each base.
• 454NewblerMetrics.txt statistics of the assembly, eg: number of reads and
bases aligned, overlaps found, mean contig sizes,
• 454ReadStatus.txt status of each read in assembly (Assembled,
PartiallyAssembled, Singleton, TooShort, Outlier), and alignment 3' and 5' positions
within contig.
• 454TrimStatus.txt each read's original and revised trim-points used in the
assembly.
Output files (2)
• 454AlignmentInfo.tsv base consensus and quality, read-depth and flow-signal,
at each position in each contig.
• Can easily be parsed by Perl script to obtain eg: average coverage depth for each
contig and isotig.
• eg:
Position
Consensus
>contig00008
1
G
2
A
3
T
4
T
5
G
...etc...
Quality
Score
64
64
64
64
64
Unique
Depth
26
27
27
27
27
Align
Signal
Depth
(incl.
duplicates)
32
33
33
33
33
0.98
0.94
1.97
1.97
0.97
Signal
StdDev
0.05
0.13
0.14
0.14
0.06
Output files (3)
• 454Contigs.ace = ACE format file, showing how reads were aligned
to form contigs, viewable in eg. Tablet, or Consed.
• Unlike traditional ace files, in Newbler’s ace files:
• the same read can be in several contigs (but is given an extra suffix),
eg: if one contig is in a repeat (higher coverage) region, and the next is contig is a
non-repeat (low coverage) region, and the read spans the junction.
• a contig (and hence a read) can be shared between several isotigs.
• But a read should only be in one isogroup.
Output files (4)
Only with -cdna option:
•
454IsotigLayout.txt how contigs are laid along each isotig in the isogroup, (454RefLink
also gives which isotigs are in each isogroup).
•
eg:
>isogroup00003 numIsotigs=8
Length : 495
508
142
Contig : 02209 02600 02782
isotig00004 >>>>>
>>>>>
isotig00005 >>>>>
>>>>>
isotig00006
>>>>> >>>>>
isotig00007
>>>>> >>>>>
isotig00008 >>>>>
>>>>>
isotig00009
>>>>> >>>>>
etc……
numContigs=11
171
251
308
00425 02597 00426
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
98
61
61
02119 02340 02624
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
566
306
(bp)
02132 02630 Total:
>>>>> 1484
>>>>> 1484
>>>>> 1497
>>>>> 1497
>>>>>
1472
>>>>>
1485