Ion Torrent have released a set of longer read 314 data, along with this technical note.
(Graphic from PRINSEQ)
The reads are indeed much longer than we have seen with our previous 316 runs, with a mean of 223bp and longest read being 398bp. Curiously this longer-read protocol has been done on a 314 chip rather than a 316 but Life inform me that it should work on the other chip just as well.
I thought it might be informative to assess how well this dataset performs in a real-world example, in this case de novo assembly.
The first question I wanted to look at is:
How useful are these long reads in improving assemblies?
I ran the SFF files from Life Tech’s long-read 314 chip run (B14_387) and their 316 chip run (B13_328). Both data files are from the standard E. coli K-12 DH10B library.
For assembly I decided to use Newbler 2.6, run under default settings. I use Newbler because it runs fairly quickly and I’m impatient. For “real work” I tend to use MIRA.
| Sample | Contigs | N50 | % aligned bases | Coverage | Inferred error rate |
|---|---|---|---|---|---|
| 314-long | 452 | 22080 | 90.46% | 13.82x | 2.79% |
| 316 | 225 | 56368 | 96.51% | 36.04x | 1.52% |
Above is an A50 plot. A50 plots are a way of visualising how contiguous and complete an assembly is. They are described in more detail at the great BioHaskell blog. Suffice to say, the steeper the slope, the better the assembly is.
So, in fact, the long-read 314 data set doesn’t give as good an assembly (judged by metrics) as the short-read 316 data.
Is this due to the much lower depth of coverage (~14x vs 36x), or the increased error rate seen in the 314 long-read data(2.8% vs 1.5%) ?
What is the effect of depth of coverage on Ion Torrent assemblies?
Sub-sampling the larger 316 dataset to generate reads approximating different depths of coverage from 10x to 35x (assuming a 4.7Mb genome) we can test the effect of reducing coverage levels on the higher quality 316 data. This gives the following A50 curves:
| Sample | Contigs | N50 | % aligned bases | Actual coverage | Inferred error rate |
|---|---|---|---|---|---|
| 314-long | 452 | 22080 | 90.46% | 13.82x | 2.79% |
| 316-15x | 383 | 27922 | 96.26% | 13.97x | 1.66% |
| 316-20x | 253 | 45490 | 96.11% | 18.59x | 1.64% |
Actual coverage is the number of aligned bases in the assembly divided by the theoretical size of the genome (4.7Mb) and thus represents bases that made it through to the assembly rather than simply the number of bases in the input.
So it seems that the longer read set perform worse in assembly than the shorter reads, at nearly identical depth of coverage.
Higher error rate can be seen to have a deleterious effect on the percentage aligned bases (96 vs 90%) and negatively affects the resulting assembly.
My conclusion is that the higher error rate seen in the long-read data results in this less contiguous assembly, ameliorating the effects of the longer read. It is possible that Newbler is clipping the low quality tails from the data. I would like to repeat this analysis with MIRA.
However, I suspect at a much higher coverage level these effects will start to melt away and the assemblies will improve significantly compared to the shorter data. It would be nice to have some more long-read data to throw into the mix to test this prediction.
Coverage over a certain threshold does not improve contig statistics
Another finding here is that although the 20x coverage genome looks significantly better than the 15x genome there is not much improvement in the assembly metrics when going up from 20x to 25 – 35x.
This is down to repeats in the E. coli genome. When there are repeats in the genome longer than the read length, this becomes the bottleneck for getting longer contigs. The only way out of this problem is to get longer reads, or to use a mate-pair protocol. Piling a load more reads will not make much difference to the assembly at this point.
(This is why it is a bit pointless to do what many do and add a huge pile of short reads to a 454 assembly unless your specific aim is to correct indel errors from homopolymers).
Shooting for 20x coverage means that you could aim to sequence perhaps two or three 4Mb genomes per 316 chip (assuming a yield of 200-250Mb which is what we’ve seen) or one 4Mb genome on a very good 314 chip (now priced at $99).
In the next blog post I will look more carefully at the resulting assemblies to assess quality at different coverage levels, and also compare with the Illumina HiSeq data.
Update 10th August 2011
Thanks for all your comments. This data set does seem curious.
I wanted to just check that these results weren’t just Newbler 2.6 being weird, so I also assembled just the 314-long dataset with MIRA 3.2.1.17_dev and CLC Genomics Workbench 4.7.2. The results were pretty meh;
| Assembler | Contigs>200 | N50 | Total |
|---|---|---|---|
| CLC Genomics Workbench 4.7.2 | 2956 | 2429 | 4535270 |
| MIRA 3.2.1.17_dev | 2630 | 3036 | 5462034 |
I feel I should add – just in case there is any confusion – that this is not a criticism of either assembler which usually give similar results to Newbler when run on a regular dataset.
A reader, aeonsim asked in the comments what happens if you combine the long and short read dataset. Actually the long read set when added to the 316 data worsens the assembly, despite increasing the coverage.
| Dataset | Contigs | N50 | Coverage |
|---|---|---|---|
| 316 | 225 | 56368 | 36.04x |
| 314-long | 452 | 22080 | 13.82x |
| 316 + 314-long | 253 | 44823 | 49.86x |
| 316-30x | 224 | 58657 | 27.93x |
| 316-15x + 314-long | 279 | 40185 | 27.82x |
Lex Nederbragt has verified my findings, and had a closer look at the reads and although he finds signal suggestive of homopolymer errors towards the ends of the reads, this does not fully explain why the reads don’t assemble well. He compares to a 454 dataset and finds the Ion Torrent reads lacking.
–
Comments, criticisms as always welcome. Please post in the comments below rather than by personal e-mail if your organisation allows it.



Thanks for the comparison, Nick, and the result do indeed look strange. I’d be interested to see, as you mentioned, whether newbler has removed a lot of reads and bases. If you look at the ‘runMetrics’ section of the 454NewblerMetrics.txt file, the number of reads and bases _after_ trimming is reported. Do these numbers differ a lot from your raw input?
Also, in your experience, does MIRA outperform newbler on IonTorrent data? (and what about 454 reads…)
Here are the relevant metrics.
314-long
runMetrics
{
inputFileNumReads = 350109;
inputFileNumBases = 78073794;
totalNumberOfReads = 346959;
totalNumberOfBases = 71849454;
}
So about 8% of the input bases are thrown away by Newbler.
readStatus
{
numAlignedReads = 338692, 97.62%;
numAlignedBases = 64993374, 90.46%;
inferredReadError = 2.79%, 1811297;
numberAssembled = 262949;
numberPartial = 75727;
numberSingleton = 2043;
numberRepeat = 29;
numberOutlier = 1700;
numberTooShort = 4511;
}
28% of reads are partially assembled.
316
runMetrics
{
inputFileNumReads = 1687490;
inputFileNumBases = 175570901;
totalNumberOfReads = 1685548;
totalNumberOfBases = 175490875;
}
Less than 0.05% of these bases are thrown away.
readStatus
{
numAlignedReads = 1649835, 97.88%;
numAlignedBases = 169373996, 96.51%;
inferredReadError = 1.52%, 2581134;
numberAssembled = 1524554;
numberPartial = 125272;
numberSingleton = 23590;
numberRepeat = 56;
numberOutlier = 4109;
numberTooShort = 7967;
}
~8% are partially assembled.
Note that my coverage values are computed from bases incorporated into the assembly so reads or bases being thrown away don’t affect my coverage comparison.
Oh and MIRA doesn’t usually outperform Newbler in terms of contiguity, but I like the output more, particularly as there is a better audit trail for figuring out what is going on and where misassemblies may be found. And contigs are explicitly marked as repeats.
Other options include CLC. I did a round-up a while back @ http://pathogenomics.bham.ac.uk/blog/2011/05/first-look-at-ion-torrent-data-de-novo-assembly/
Would it be useful to generate perfect datasets at different read length & coverage (i.e. computationally fragment & sample the genome sequence) to establish upper bounds on what to expect?
Matt Cahill et al already did that for this useful paper in PLoS ONE (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011518). Looking at their table S1 for K-12 DH10B we can see they estimate the following number of gaps for a given read length:
36bp = 723 contigs
75bp = 294 contigs
125bp = 199 contigs
250bp = 129 contigs
500bp = 84 contigs
1000bp = 53 contigs
So the 316 chip at 37x coverage giving 225 contigs for ~100bp reads is in very good agreement with those data.
As regards coverage, I guess this can be modelled as a simple Poisson process using Lander-Waterman statistics.
krobison, another article that could be useful to generate perfect datasets to investigate sequencing strategies is:
http://www.mdpi.com/2073-4425/1/2/263/
Interesting
Would be good to see what happens if you merge the two datasets and try an assembly using both the 314-long and 316 data.
And how 13x 316 and 13x 314-long compare to a 26x 316 dataset when assembled.
Good idea, I will do that in a bit!
[...] stab at using these reads for assembly with newbler, the program developed by 454 Life Sciences, on his blog. He reported, compared to the shorter Ion reads, much worse results for the longer reads. On [...]
I’ve updated the post with a few more details including results in response to @aeonsim’s useful suggestion.
[...] that was the first thing that popped into my head. Nick Loman has already taken a look at Ion Torrent and the impact of the new longer reads on assembly? Long story short, it doesn’t help. We see that too. Even after accounting for potential [...]
[...] I wanted to assess the quality of the assemblies that you might expect to get from assembling Ion Torrent reads de novo for bacterial genomes (see my last post for the initial results of assembly). [...]