I wanted to assess the quality of the assemblies that you might expect to get from assembling Ion Torrent reads de novo for bacterial genomes (see my last post for the initial results of assembly).
By good fortune, Aaron Darling of Mauve fame just published a manuscript in Bioinformatics describing Mauve Assembly Metrics.
For those not familiar with Mauve it is both a useful whole-genome alignment algorithm (progressiveMauve) and an attractive genome viewer optimised for n-way genome comparisons. It works pretty well for bacterial genome comparisons and should be in your toolkit if it isn’t already.
One thing that is not immediately intuitive about Mauve is the way it handles repeats – in comparison to ACT/BLAST it won’t align repeat sequences that can’t be placed unambiguously. A typical example is the ribosomal RNA operon which often comes out as an ~8kb contig with much greater depth of coverage than the rest of the genome. Mauve won’t be able to place this contig because it could go in multiple places in the genome. This will affect the results of the assembly metrics as well as making you think your genome is “missing” a piece that is in fact there.
Mauve Assembly Metrics takes your draft genome sequences (in contigs) and aligns it against a reference, reordering the contigs and producing reports in both tab-separated values format and accompanying PDF graphs. This is achieved through an R script.
The software comes with an example document which explains the output better than I could.
Some of the metrics offered by the software that I think may be interesting are:
- Number of miscalls, i.e. SNPs/indels
- Number of break-points, i.e. misassemblies within contigs
- Number of genes in the reference which are broken, e.g. by a contig end
- Gaps in the reference /extra bases
- Missing sequence due to GC
It seems probable that the best results are going to be obtained when aligning against a close reference. And in theory, the test reads from E. coli K-12 DH10B should be very close to the reference genome. It is worth mentioning that the DH10B sequenced by Life Tech may not be exactly the same as the one sequenced by Blattner. We may expect SNPs to accumulate over time, e.g. from passage, and indeed this strain is notable for an increased rate of mutation when compared to K-12 MG1655. This is thought to be due to more frequent insertion sequence transpositions. This goes some way to explaining why “the E. coli genome” is not specific, nor is the “E. coli K-12 genome”. In fact, can you ever sequence “the genome” of anything?
A caveat with Mauve’s approach is that the system cannot readily detect the differences between alignment errors and sequencing errors. This is likely to be more of a problem when aligning against more divergent references (we were debating on Twitter whether it would work across different eukaryotic species).
So, let’s look at some results.
| ContigN50 | DCJ_Distance | NumMisCalls | NumGapsRef | NumGapsAssembly | TotalBasesMissed | BrokenCDS | |
|---|---|---|---|---|---|---|---|
| 10x.fna.fas | 5787 | 1234 | 277 | 751 | 1687 | 386647 | 644 |
| 15x.fna.fas | 27764 | 296 | 108 | 172 | 897 | 373341 | 341 |
| 20x.fna.fas | 45485 | 175 | 26 | 98 | 691 | 367926 | 258 |
| 25x.fna.fas | 57721 | 143 | 88 | 43 | 652 | 367363 | 218 |
| 30x.fna.fas | 57722 | 144 | 233 | 32 | 624 | 366677 | 218 |
| 35x.fna.fas | 58655 | 137 | 239 | 16 | 578 | 364519 | 203 |
| 314long.fna.fas | 21586 | 369 | 420 | 658 | 3599 | 368937 | 948 |
Interesting! A few things stand-out here. There are some expected trends here – for example, the number of gaps in the reference and the number of gaps in the assembly both reduce as we layer on extra coverage. Also the number of breakpoints/misassemblies as gauged by the “double-cut distance” also reduces.
One striking statistic is that even with the best assembly, 203 genes are broken. This is due to contig ends. This means that 5% of the genes in the genome are not enclosed by individual contigs. The worst assembly (at 10x) has almost 15% of coding sequences disrupted by a contig break. This is going to make life a bit difficult when trying to confidently annotate a genome.
Miscalls
Miscalls I think refer to SNPs and short indels.
The number of apparent miscalls goes down as we layer on extra coverage to 20x, but rather curiously we then see an increase in number of miscalls going from 25x to 25x. The 314-long dataset has the most miscalls by far.
Let’s try and figure out what’s happening here. A useful chart maps this across the genome:
A few things stand out here. Firstly that you can see that frequency of miscalls in the 10x genome and the 314-long set are significantly higher than the other datasets and this is across the genome.
However, the high coverage datasets have a sharp peak of miscalls around 800kb into the genome. This is the source of the majority of those SNPs and the reason we get an uptrend with increased coverage.
Looking at this alignment in Mauve it is a very low nucleotide identity alignment – and BLASTing the contig back against the reference shows in fact there is a much better place to map the contig, giving 100% identity. So these are false SNP calls. I am not sure why it didn’t get picked up by Mauve, perhaps tweaking the parameters would fix it.
I decided to strip this region out (by excluding SNPs occurring >5 times in a sliding window 1,000 bp long, shifted by 100bp). I also discarded SNPs near a contig boundary (<100bp) as this is another source of misalignments.
Then you get the following result:
That looks a bit more rational. A quick look at the assembly for the 25x coverage genome doesn’t immediately indicate whether these are genuine errors or simply differences from the reference. They are not obviously associated with long homopolymeric tracts and there is often a good, deep consensus. Something for another blog post.
One other thing that is interesting in this plot:
There is aregion where there are no miscalled bases in the genome, between 500-750kb in the reference. This corresponds to a large 120kb tandem duplication in the DH10B genome. Because we only have a single consensus contig for this repeat Mauve cannot place it correctly and that region ends up being unaligned. It’s also the major contribution to the surprising finding that almost 350kb of the genome is not covered, equating to 8%. The truth is it is covered, but not ambiguously.
So this is an example where misalignment can mess your results up.
GC Biases
One more plot, before I sign off.
Not sure why the background GC plot is so “wiggly”. But let’s assume its a normal distribution.
It seems missing regions are not generally related to GC content, except in the 314-long dataset. I wonder if this reflects some change to the chemistry to coax out longer reads, e.g. the emPCR reaction conditions?
Conclusions
- Mauve Assembly Metrics gives a useful guide to assembly quality, however misalignments may skew results
- The test Ion Torrent 316 data reaches a decent consensus sequence around 20x coverage
- Long repeats are always an issue for mapping de novo assemblies
- The 314-long dataset has plenty of miscalls, further evidence this is not a high-quality data set
![]()
Darling, A., Tritt, A., Eisen, J., & Facciotti, M. (2011). Mauve Assembly Metrics Bioinformatics DOI: 10.1093/bioinformatics/btr451




Hey Nick, it’s great to see someone making use of this software so quickly after we got it online.
I think you’ve done a pretty fair job interpreting the reports generated by the program. There are a few things I should comment on. First, the large tandemly repeated regions are indeed the weak spot in the assembly metrics when calculated on the basis of progressiveMauve alignments. This shouldn’t be a problem in general for the approach of measuring assemblies against a high quality reference, and in fact a group of collaborators at UCSC and I will soon be publishing another alignment-based system as part of the Assemblathon 1 paper that should deal with some of these issues, and diploidy, more gracefully.
Second issue, which is the mixing of assembly and alignment errors. This is a general feature of the approach and something we tried to emphasize in the paper. In theory if one is really resequencing the same genome (a question you rightly raised), then the better the assembly, the easier it should be to align back to the reference. The situation you encountered here with alignment errors in the 30x and 35x assemblies sounds like a pathological (diabolical?) bug in the aligner and is something I’ll have to investigate more deeply. Every aligner has bugs and despite a couple good years of stamping them out from progressiveMauve there are still a few lurking in dark recesses of the codebase…
The wiggly background GC is a curiosity. It should indeed be smooth, and I’ve not seen this before. I can imagine some things that may have gone wrong but none of them seem very likely. Would you be willing to share the assemblies so I can track down what happened? On the other hand it’s really interesting to see this plot for IonTorrent data. Every time I’ve asked one of their sales reps about GC bias I wasn’t able to get a clear answer. E. coli is kind of a straw man for GC bias tests, since it’s got pretty average GC content at around 48%. For the project which inspired Mauve Assembly Metrics we’re sequencing 50 or 60 halophilic archaea with GC typically at 65-70% and that adds some extra challenge for many sequencing chemistries. So far we’ve had good luck with Illumina TruSeq 3 on PCR-free libraries, and also with PacBio.
I would be very interested to hear more about the resolution for the differences you found between the assembly and the reference in regions of high coverage. In addition to them possibly being errors in the IonTorrent assembly, or evolved differences, a third possibility is that they are errors in the reference assembly constructed in Blattner’s lab.
Hey Aaron – thanks for the informative reply!
I would be very keen to know if that alignment issue that crops up with the high-coverage assemblies is fixable. The assemblies are available here: http://static.xbase.ac.uk/files/results/nick/iontorrent_assemblies.tgz – I ran everything with default settings and I used the mauve snapshot dated 2011-08-05.
That wiggly GC thing is very weird, I don’t think it happened the first few times I ran the metrics so not sure exactly what’s going on there.
You are absolutely right to point out that there could be errors in the DH10B reference sequence. I also have some Illumina assemblies from MiSeq data. I will put all that together in another post – if the differences are also seen in that dataset then I guess it is more likely to be errors in the reference.
Aaron, another question I just wondered about. Is the “broken CDS” count inclusive of genes in unaligned regions? I assume not, but just wanted to clarify that.
Just an update to say the alignment error issues that were cropping up on the 30x and 35x assemblies should be largely resolved, thanks again Nick for bringing this matter to light. The usage instructions available here: http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
have been updated to use the new software revision.
As for CDS in unaligned regions, these should be included in the count of Broken CDS. Did you see something that suggests otherwise?
[...] For this, I used Mauve Assembly Metrics (see also Nick Loman’s post about this program here). (Due to a bug in the Mauve, at least that is what I think caused the crash, I could not get the [...]