Well we’ve had our Ion Torrent for nearly 6 weeks now, but we haven’t yet run it. A combination of factors have impeded our progress. After waiting for installation there have been various changes to the protocols and reagents, plus we needed to buy some more equipment to make libraries (a Bioruptor, for shearing). Our training is now scheduled for next week. We’re not complaining too much, we ideally need the 316 chips which produce >100Mb to start doing the work we have pencilled in for it (genomic epidemiology of Acinetobacter).
However, I couldn’t help but be intruiged when I saw that Ion Torrent posted an E. coli K-12 DH10B dataset (registration required) on their Torrent Dev community site. This is from a single run of a 314 chip and consists of 49Mb of raw sequence – way above the minimum spec (10Mb).
An accompanying Ion Torrent application note is a useful initial guide to these reads. And in fact whilst I was preparing this post Keith Robison recently posted a more detailed analysis of the quality scores.
However, I am interested in a particular application: de novo assembly, where the genome sequence is predicted without a reference sequence. This is crucial for novel gene discovery and detection of large-scale insertions and deletions, both important for the kind of bacterial genomic epidemiology work we want to use the instrument for.
Read Analysis
Firstly, as is recommended when assembling de novo, I assessed the read data for quality.
The Ion Torrent server makes the reads available in both FASTQ and SFF format, which makes life easy for feeding to various pipelines.
I ran it through the excellent PRINSEQ to get some quality reports (I could have equally used FastQC).
There are a total of 522,099 sequences and a total of 49,040,224 bases, giving a mean read length of 93 bases.
Figure 1. Read length distribution.
Figures 2a and 2b. Base qualities
Base qualities start around Q30 and then fall towards the 3′ end of the read, in common with both 454 and Illumina. The tails are reportedly quite low quality (Q10 = 1/10 chance of base being wrong). But it ain’t Pac Bio!
Figure 3. GC distribution, looks pretty good and consistent with E. coli K-12′s ~50% GC.
So the initial reports look fine. I might be inclined to trim some of the low quality sequence but for now I have left all the sequences in.
De novo Assembly
Aligning these reads to a reference is fairly trivial, and Ion Torrent supply some software called TMAP, written by Nils Homer (author of BFAST) to achieve this. Early reports on the Ion Torrent community forum are that it is both fast and accurate.
However Ion Torrent do not have an in-house assembler (yet), so it is an open question which software to use, and how well the process works.
We know E. coli K-12 DH10B has a 5Mb genome, so 49Mb of data means should give approximately 10x coverage, assuming all the reads are incorporated into the assembly. This is a little on the low side – if it was 454 data I’d prefer 15x or even 20x – but it is still sufficient to make it worthwhile proceeding. Many of the original Sanger-sequenced genomes made do with less than this.
Ion Torrent have partnered with CLC Bio and DNAStar as bioinformatics partners and so CLC Genomics Workbench and SeqMan NGen already have Ion Torrent support built-in, so it makes sense to try them.
We expect Ion Torrent reads to have similar homopolymer and carry-forward errors as the 454, so it makes sense to try assemblers that work well on 454 data. So I thought would try Newbler (Roche’s 454 assembler) and MIRA3 (a good option for 454 sequences). I also chucked-in Ray because it has native support for SFF files. CAP3 is an old stalwart assembler. And for giggles I thought I’d try a short-read assembler, in this case Velvet without any real expectation that it would work correctly (I used k=31).
| Assembler | Version | Method | Open source | License |
| CAP3 | 10/15/07 | Overlap-layout-consensus | Yes | Free to non-profit |
| CLC Genomics Workbench | 4.6.1 | Hybrid | No | Commercial |
| MIRA | 3.2.1 | Overlap-layout-consensus | Yes | Free |
| Newbler | 2.5.3 | Overlap-layout-consensus | No | Free for academic use |
| SeqMan NGen | 3.0.4 (10) | Overlap-layout-consensus | No | Commercial |
| Ray | 1.3.0 | OpenAssembler (de Bruijn) | Yes | Free |
| Velvet | 1.1.03 | de Bruijn | Yes | Free |
Table 1. Assemblers used in comparison (some data from Kumar, Blaxter)
I was gratified that getting all these packages installed was both quick and relatively painless. For the desktop programs I ran the analysis on a Samsung RF510 (i5 2.66Ghz, 8Gb RAM) running 64-bit Win7 and for the command-line programs I used a Gentoo 64-bit Linux box with 8 x 3.0Ghz Xeon processors and 32Gb of RAM. In each case I used the default parameters except with SeqMan NGen which I told to expect 10x coverage for repeat resolution. For Newbler, MIRA and Ray I gave it input as if it was 454 data (from SFF file).
| Assembler | Time taken (hh:mm:ss) |
| CLC Genomics Workbench | 00:01:33 (desktop) |
| SeqMan NGen | 00:07:50 (desktop) |
| Velvet | ~00:15:00 (server) |
| Ray | 00:16:03 (server) |
| MIRA | 00:21:00 (server) |
| CAP3 | ~01:00:00 (server) |
| Newbler | 4 days (server) |
Table 2. Run times for assemblers
I was shocked by just how quick the commercial offerings were – particularly CLC – despite running on a laptop less powerful than the server. CLC beat its advertised speed of 2 minutes, quite impressive.
An honourable mention for Newbler which took an incredible 4 days to run. I wasn’t sure it was going to finish in fact. It’s a little unfair as of course Newbler is not designed for Ion Torrent data and is pretty quick when you feed it 454 files for this kind of sized assembly. I assume the short reads triggered some kind of weird brute-force alignment mode. However, it did complete and produce results, so well done!
| Number of contigs | Min | Max | Total | |
| CLC | 899 | 200 | 34562 | 4473313 |
| Newbler | 1041 | 100 | 42786 | 4450039 |
| MIRA 3.2.1 | 2766 | 80 | 17257 | 4527768 |
| Ray | 3222 | 124 | 7703 | 4373659 |
| Seqman Ngen | 3516 | 78 | 11547 | 4569381 |
| CAP3 | 7593 | 42 | 8743 | 4857173 |
Table 2. Contig stats from assembly ordered by number of contigs (best first)
Velvet produced 1.3m contigs, so I had to exclude it from the test.
Quite a range of values there. It is worth noting that adding up the lengths of all the contigs produces values much less than 5mb. This is expected because perfect or near-perfect repeats longer than the read length often end up in a single “consensus contig”. But there is quite a size range there. Note that comparing numbers of contigs is not strictly like-for-like because each assembler uses a different minimum length cut-off.
| Avg | N50 | N75 | N90 | |
| CLC | 4975 | 8629 | 4560 | 2455 |
| Newbler | 5507 | 8207 | 4615 | 2575 |
| MIRA | 1636 | 2900 | 1613 | 797 |
| Seqman Ngen | 1299 | 2016 | 1160 | 619 |
| Ray | 1357 | 1884 | 1164 | 697 |
| CAP3 | 639 | 973 | 559 | 310 |
Table 3. Contig stats from assembly ordered by N50 (best first)
N50 is a much better guide to comparing “contiguity”, and CLC Genomics Workbench and Newbler give the highest values here. MIRA, Ray and Seqman Ngen are in the middle of the pack. CAP3 didn’t do too well at all which is not a great surprise as it doesn’t use the quality values, plus its default parameters may not be suitable for these reads.
Before we go any further – this isn’t intended as an assembly competition and I am not recommending or dissing any of these assemblers. This is just a very basic experiment to give me a feel for how well de novo assembly performs on these data. But I thought it interesting enough to share with you guys.
So – the results themselves. They aren’t great – if you compare them with a 454 Titanium run or an Illumina paired-end run – but they are pretty good if you consider this is the first dataset available from the Ion Torrent, running with the lowest specification chip. In pure sequencing costs, it’s a $500 bacterial whole-genome and the sequencing itself only takes 2 hours. So it’s quite exciting.
And we must remember much of the software is not yet properly optimised for these data.
The assemblies are certainly good enough to predict and detect most of the genes in K-12. What I have not assessed yet is assembly quality, so the assemblers with the longest N50s may or may not be the best choice if you need accuracy (and you probably do).
I intend to take a close look at the extent of misassemblies, erroneous consensus base calls and issues with homopolymeric tracts in a future blog post.
So for now, the take home messages are:
- bacterial de novo assembly is possible with Ion Torrent data, even with data from a single 314 chip
- CLC Genomics Workbench is the fastest assembler and produces some of the longest contigs with default settings (but costs $5000 for a license)
- Newbler also produces long contigs but takes days to run for some reason
- MIRA and Ray are both promising open-source options
Your comments are always appreciated! And do drop me a line if you are at ASM this year as I will be presenting on Sunday’s “Genome Epidemiology” session.


While running PRINSEQ on the reads, and taking a first look at a newbler assembly, I came across your post. Hmmm, great minds think alike?
However, I have a few different metrics for ‘my’ newbler 2.5.3 assembly of the data:
- runtime on a server, allowing for 20 cpus (otherwise default settings): 10 minutes (!)
- 918 contigs (from 100 bases)
- N50 for those was 8899 bp
- longest 58008 bp
- total bases 4448513 bp
So, except for the total bases, I obtained even better metrics. 20 cpus was overkill, but a first try with one cpu (usually more than enough for bacterial genomes) looked going really really slow, so I thought to give it (almost) all I got…
@flxlex Thanks for commenting! Curious the results are slightly different. I realise now I used Newbler via XBASE-NG (our in-house analysis pipeline) which actually sets -rip as default as I prefer this. I wonder if that explains the difference. Your stats would put Newbler as the best performing assembler by N50 and max contig. I will have a bit more of a play with settings as running 20x quicker would still be much longer than 10 minutes, so there must be something else going on here to account for the difference in speed. Could be our NFS mount, I guess, but it was 100% CPU for the duration so it didn’t look I/O bound. Can you post exact settings?
I simply used
runAssembly -o projectname -cpu 20 file.sff
Very interesting data. It really gives me a feel for what it looks like. Would be very interested to see what your machine in your lab can do. Get in there and get that puppy running.
The Ion Torrent produced data set may not be an “average” run. Did they include how they filtered the raw or is it supposed to be unfiltered?
Hi Paul. Thanks for stopping by! I don’t have much in the way of detail about the run, so not sure what filtering if any has gone on. I also suspect this is not an average run – 50Mb is more than some early-access customers are reporting from a 314 chip. So this may represent one of the best runs done at Ion Torrent. We hope to have our own data end of next week to compare.
CLCbio picked up on your story and used it in a press release: http://www.clcbio.com/files/CLCbio_pressrelease_12052011.pdf
A bit cheap, I feel…
Thanks Nick; I’ve been very curious about what the Ion Torrent data look like, and I saw a link to this post through the CLC Bio newsletter. In our lab, we do a lot of high-throughput amplicon sequencing, so error rates are particularly of interest. You saw that the qual scores got low quite quickly — do you have any sense of the real error rate, though? I suppose a scaffolded assembly would be a better way to check. I’ll be really interested to hear what you get in an average run from your machine. Thanks! –Jeff
@Jeff – thanks for the comment. I plan to take a closer look at the error rates of these assemblies soon. Keith Robison did a great analysis which demonstrates that the quality scores are pessimistic for Ion Torrent data (http://omicsomics.blogspot.com/2011/05/ion-torrents-data-quality-is-pretty.html). I’d also recommend looking at the plotted “raw accuracy” on this Ion Torrent note (http://www.iontorrent.com/lib/images/PDFs/performance_overview_application_note_041211.pdf). For what it’s worth, Newbler calculates a 1% error rate for the reads in its output. I’ve not verified that figure. A quick look at the assemblies aligned to a reference suggest most errors are associated with short homopolymeric tracts.
No mention of GENEious? It works with Ion data, would be good to know how it compares.
Good idea, I had forgotten about that. I am also keen find some other good-performing free & open source options.
[...] (check out the excellent analysis by Nick Loman on his blog http://pathogenomics.bham.ac.uk/blog/2011/05/first-look-at-ion-torrent-data-de-novo-assembly/ So, I naturally had a look at the information in this sff file. Here are the header and the first [...]
[...] Loman from the University of Birmingham. He started to analyse the data and posted sequences on his pathogen blog. Within a few days scientists from four continents were joining in with the [...]
Hi!
You forget to set the -large option for the newbler. If your genome is large (found not defined, how large “large” means), and the option is not set, the algorithm stops after a certain base-amount, and gives not any sign, why. If you use the -large, it takes ca. 1-2 mins for a ~3mbp de novo assembly (i7 3770k 8 threads), and it gives pretty long contigs, however, as known, the contiguity isnt correlated to the validity of the assembly.