Tips for de novo bacterial genome assembly

I have found Velvet to be an excellent option for short-read de novo assembly of bacterial genomes. We are routinely seeing Velvet assemblies of Illumina data producing similar statistics to Newbler assemblis of 454 data. It seems that Illumina run with 75-base pair reads and paired-end libraries can be competitive with Newbler assemblies from 454 fragment libraries. Amazing that just a year or two the dogma was that short-read technologies weren”t useful for de novo assemby.

However, getting the best out of Velvet is a bit trickier than using Newbler. Roche’s informatics is really slick and running a Newbler assembly is pretty much “fire and forget”. However, the cost-effectiveness of Illumina means that this is turning into a popular choice for bacterial genomics. I am frequently asked at presentations for details on our pipeline for handling Illumina data, so I thought I could point people at this blog post in future.

Firstly, de novo assembly with short-reads gives much more effective results if you aggressively pre-filter the reads passed to the assembler (whether it is Velvet or something else). The difference in assembly statistics between passing the “raw” base-call output from Illumina versus passing a stringently filtered set is night-and-day. For an M. tuberculosis genome, the assembly N50 jumped from 1kb to 30kb after filtering the reads. It is also not uncommon to find Illumina datasets with 100-1000x fold-coverage for the genome due to the massively parallel nature of this technology. In this case more is not necessarily a good thing, submitting too many reads will dramatically increase memory requirements and may also result in a worse quality assembly due to the excess of error reads which complicate the de Bruijn graph. Velvet does not take into account quality scores which perhaps explains the utility of such filtering steps.

Filtering reads is a question of trial and error but I have found the following techniques are all useful. You can use them all, or mix and match for your application. The scripts I reference are all Python scripts which require the latest version (>=1.51b) of the invaluable BioPython package.

All the scripts assume you have a single FASTQ file. If there is paired-end information then the forward read should be followed directly by the reverse read in the file. If your pipeline produces a forward and a reverse read file then use Velvet’s bundled shuffleSequences.pl script to produce a single FASTQ file before proceeding.

1) Count per-base quality scores and trim the reads

Run quality_breakdown.py on your FASTQ file. The stdout will produce a tab-separated report showing the mean quality score at each base position. You can plot this in Excel and it usually shows a linearly degrading quality across the read. You have to use your disgression where to trim each read, but I tend to cut it about 20 bases in from the end or where the average quality is 20, whichever gives the longer read. Run trim_reads.py over your file specifying where in the read to trim. I like to have all reads the same length at the end.

2) Eliminate singletons

If you have excess depth of coverage, and particularly if you have at least x-fold coverage where x is the read length, then eliminating singletons is a nice way of dramatically reducing the number of error-prone reads you pass to Velvet. If you don’t have great coverage this step might not be possible. Simply run eliminate_singletons.py passing the FASTQ file as the first argument. It will ensure that the output still correctly pairs the reads.

3) Chuck away reads with Ns in

If you can afford the loss of coverage, you might throw away all reads containing Ns. The Velvet internals use a 2-bit encoding system for nucleotides and so does not make provision for Ns. I believe it converts Ns to a base chosen at random (it used to use Ns). eliminate_n_paired.py will discard any pairs which contain Ns. Be cautious, I once had a Solexa read file in which every read at positions 31, 32 and 33 were N so I had to trim the reads at position 30 before running this script.

4) Throw out reads with low average quality

This step is probably optional. You can discard all reads with an average score below a certain threshold using filter_reads.py, I might use 20 for the sake of argument.

Once you have a filtered read-file you may want to use Simon Gladman’s excellent VelvetOptimiser script (also bundled with Velvet) to perform a parameter scan looking for the assembly with the best ‘contiguity’, i.e. greatest N50. Whilst it is great to have long contigs – this makes life easier when doing downstream analysis – it is worth bearing in mind that a greater N50 is associated with a higher chance of misassemblies. If you are doing true de novo assembly you will have no reference to check the results against. It is therefore imperative you do some sanity checking on your assembly, particularly if you are using the output for variation detection between closely related strains (as opposed to, say, getting a quick and dirty proteome prediction).

I would be keen to hear what other people to do to help validate their assemblies, but one tip I can impart is to use a short-read aligner (I like Bowtie, but BWA, MAQ or NovoAlign would work as well) to map those reads back to the draft assembly produced by Velvet. By searching for high-frequency variants, i.e. SNPs and indels, either using VarScan or the built-in MAQ SNP pipeline you would expect to see few or hopefully no homozygous variant calls.

If you do see variant calls, check the read-depth of that contig. If it is 2 s.d. above the mean coverage for the genomes you are likely looking at a repetitive consensus contig, i.e. a contig produced by collapsing two genomic regions into a single contig. SNP calls in these contigs are unreliable by their nature. If you are seeing lots of variant calls you may be suffering from misassemblies. Consider changing the Velvet settings to make it more sensitive. SNPs occurring near the contig boundaries are likely to be due to spurious alignments and probably can be ignored. SNP calls occurring in all reads in the middle of long contigs should raise alarm bells.

When trying this approach with some paired-end data I came across a curious phenomenemon. I was getting about 60 SNP variant calls when mapping my reads back to my assembled genome. I used the excellent Hawkeye, part of the AMOS package to try and see what was happening in this region. I wanted to see if there were obvious signs of misassemblies, i.e. low-quality alignments in that region. What I found was unexpected. The SNP call was different from Velvet’s consensus, but the pile-up alignment at that base agreed with the SNP call. It looked like the consensus was being called incorrectly. Confused by the result, the ever-helpful Daniel Zerbino nailed the problem immediately – the consensus was being called from the original contig sequence where there was a mixture of alleles at the locus, before the paired-end resolution code was kicking in. Velvet correctly untangles the repeat using the paired-end information, but did not recall the consensus.

A way of fixing this problem easily is to use the AMOS recallConsensus software. This goes base-by-base through the alignment and calls the likeliest base at each position. It’s pretty slow, but it works OK. To produce AMOS files you need to turn the read tracking mode of Velvet on, and then convert the AFG file to a bank file. After you have run recallConsensus you can convert the bank back to a FASTA file and use that for your further analysis.

I’d be interested to hear from readers any of their experiences with using Velvet for de novo bacterial genome sequencing.

17 Responses

  1. Tweets that mention Pathogens: Genes and Genomes » Tips for de novo bacterial genome assembly -- Topsy.com

    […] This post was mentioned on Twitter by I Andrade. I Andrade said: RT @davidjstudholme: #bioinformatics "Tips for de novo bacterial genome assembly" http://tinyurl.com/ya3ohqf from Nick Loman's blog […]

  2. Jeremy Leipzig
    October 1, 2009 at 10:32 pm |

    Very helpful tips. I don’t understand why you would want “all reads the same length at the end” b/c our assemblies look much better with reads trimmed variably based on quality rather than a fixed lowest common length.

    Both the higher kmer and higher cvCut criteria are correlated with more evidence for real contigs. So I would also associate a high N50 obtained with higher kmer/high cvCut settings as having fewer misassemblies, not more, it’s just those assemblies tend to be shorter in total length. I suppose a high N50 achieved by setting a high minContigLength could indicate more misassemblies.

  3. Jeremy Leipzig
    October 6, 2009 at 3:32 pm |

    Yes a higher cvCut reflect higher stringency since it is akin to contig depth, consistently resulting in smaller assemblies, and in our experience results in higher N50’s (up to a point).

    Oddly enough higher cvCuts can also result in higher read usage, as though some of the spurious contigs are holding critical reads hostage, reads which when available for use by deeper contigs seem to in turn recruit more reads into the assembly.

    I have a set of scripts and a Sweave I use to explore the parameter space:
    http://code.google.com/p/standardized-velvet-assembly-report/

    I should probably think about expanding this to explore quality trimming criteria as well.

  4. bioinfosm
    bioinfosm
    October 7, 2009 at 7:22 pm |

    I was very impressed with the sweave report. Jeremy, you should definitely explore with Daniel (Velvet) to spread the word on this

    Nick, is the stuff you mentioned directly usable with single reads as well?

  5. Torst
    October 8, 2009 at 3:46 am |

    My discussions with Daniel Zerbino (author of Velvet) indicate that he replaces “N” with “A” and NOT a random base. My filtering policy is to remove and reads (or pairs) that have an N in them – this has helped de novo assembly with Velvet. However I found quality filtering did not help too much, and sometimes hindered, but my tests were not very thorough. Daniel also admits that giving Velvet very high depth data confuses it, as more coverage just means more reads with errors. In an optimal assembler this would not matter, but for a real-world assembler with trade-offs and heuristics like Velvet, it does matter. The manual recommends depth between 50 and 100 IIRC.

  6. Jeremy Leipzig
    October 8, 2009 at 2:26 pm |

    Yes the exp_cov argument is a bit alarming. Initially I would set it to something constant like 4, but in the new autoconfigure Velvet settings it gets set to 2xcvCut. I have copied this practice blindly as part of my assay but am considering adding it as yet another testable parameter. For some assemblies expCov seems to have little or no effect until you get to very high values.

  7. idoerg
    idoerg
    November 4, 2009 at 3:41 pm |

    Nick, thanks for posting those scripts, very timely for me. Regarding filter_reads.py it seems like the script is eliminating any read that has a single base below the quality threshold, rather than an average low score over the entire read. Is that the intention?

  8. Plate
    Plate
    January 6, 2010 at 7:30 pm |

    Hi Nick, very informative piece – thanks a lot!

    For your information there seems to be small bug in the quality_breakdown.py script. It only evaluates the first sequence due to a little tabbing mistake. I added a little functionality to allow it to deal with files with sequences of differing length, but I’m not sure how to post the edited version on here. I can mail it to you if you like.

    Best,

    – Ulrik

  9. nat
    nat
    February 17, 2010 at 10:08 am |

    Hi Nick,
    Very useful tips, thanks! About velvet’s behavior of not recalling the consensus, which Daniel confirmed to me as still being the case in latest velvets, I was wondering: Does this mean that for a given genomic paired-end assembly , the whole contigs.fa file needs to be updated using the RecalConsensus tip you described (as the file is somewhat ‘wrong’)?
    This also implies converting each contig into a bank, recalling the consensus and getting back to a new fasta? I am saying this because I have fungal genomes to handle, their assemblies are not loadable to amos (for bank conversion) as a whole, only splitting to separate contigs the afg file works.. Maybe this was not the case for your bacterial genomes? Or is there any other tip I am missing?
    This issue, if I get it right, seems quite important, however I ve only found it mentionned in this blog… I am wondering whether the published velvet genomes are recalled Consensus ones or not…
    Kind regards,
    Natassa

  10. Adaptor trim or die: Experiences with Nextera libraries

    […] of the first posts I did on this blog, way back in September 2009 was about my experiences with filtering and trimming Illumina sequences, and it proved rather […]

Leave a Reply

You must be logged in to post a comment.