5 Responses

  1. Aaron Darling
    August 15, 2011 at 5:57 pm |

    Hey Nick, it’s great to see someone making use of this software so quickly after we got it online.

    I think you’ve done a pretty fair job interpreting the reports generated by the program. There are a few things I should comment on. First, the large tandemly repeated regions are indeed the weak spot in the assembly metrics when calculated on the basis of progressiveMauve alignments. This shouldn’t be a problem in general for the approach of measuring assemblies against a high quality reference, and in fact a group of collaborators at UCSC and I will soon be publishing another alignment-based system as part of the Assemblathon 1 paper that should deal with some of these issues, and diploidy, more gracefully.

    Second issue, which is the mixing of assembly and alignment errors. This is a general feature of the approach and something we tried to emphasize in the paper. In theory if one is really resequencing the same genome (a question you rightly raised), then the better the assembly, the easier it should be to align back to the reference. The situation you encountered here with alignment errors in the 30x and 35x assemblies sounds like a pathological (diabolical?) bug in the aligner and is something I’ll have to investigate more deeply. Every aligner has bugs and despite a couple good years of stamping them out from progressiveMauve there are still a few lurking in dark recesses of the codebase…

    The wiggly background GC is a curiosity. It should indeed be smooth, and I’ve not seen this before. I can imagine some things that may have gone wrong but none of them seem very likely. Would you be willing to share the assemblies so I can track down what happened? On the other hand it’s really interesting to see this plot for IonTorrent data. Every time I’ve asked one of their sales reps about GC bias I wasn’t able to get a clear answer. E. coli is kind of a straw man for GC bias tests, since it’s got pretty average GC content at around 48%. For the project which inspired Mauve Assembly Metrics we’re sequencing 50 or 60 halophilic archaea with GC typically at 65-70% and that adds some extra challenge for many sequencing chemistries. So far we’ve had good luck with Illumina TruSeq 3 on PCR-free libraries, and also with PacBio.

    I would be very interested to hear more about the resolution for the differences you found between the assembly and the reference in regions of high coverage. In addition to them possibly being errors in the IonTorrent assembly, or evolved differences, a third possibility is that they are errors in the reference assembly constructed in Blattner’s lab.

  2. Aaron Darling
    August 31, 2011 at 7:03 pm |

    Just an update to say the alignment error issues that were cropping up on the 30x and 35x assemblies should be largely resolved, thanks again Nick for bringing this matter to light. The usage instructions available here: http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
    have been updated to use the new software revision.

    As for CDS in unaligned regions, these should be included in the count of Broken CDS. Did you see something that suggests otherwise?

  3. Ion Torrent Mate Pairs and a single scaffold for E coli K12 substr. MG1655 « In between lines of code

    [...] For this, I used Mauve Assembly Metrics (see also Nick Loman’s post about this program here). (Due to a bug in the Mauve, at least that is what I think caused the crash, I could not get the [...]

Leave a Reply

You must be logged in to post a comment.