Amazingly there is still no 100% satisfactory pipeline for assembling combined Illumina and 454 data de novo.
Here are the ways I know about:
1) Assemble 454 data on its own and correct with Illumina data
For example, Newbler for the 454 data. Then correct the resulting file with a mapping pipeline like Nesoni.
- Newbler still works best on 454 data
- Newbler scaffolder works pretty well with 454 PE data
- Corrects homopolymers/indel errors well
- Quite quick
- Newbler 2.6 has a handy gap filling mode (-scaffold on command line)
- Extra Illumina coverage won’t aid assembly contiguity (important if low-coverage 454 data)
- Won’t correct structural misassemblies in 454 assembly (although it may detect them)
2) Perform a hybrid assembly with MIRA
- Gives very reliable output
- Natively supports 454 and Illumina data at overlap stage
- Can view assembly in GAP4 and see 454 and Illumina reads, and quickly find problems
- Quite slow
- Memory hungry with lots of Illumina reads
- Will not scaffold using paired-end 454 data or mate-pair Illumina data, need to do this with BAMBUS, SSPACE or other
3) Perform a hybrid assembly with CLC Genomics Workbench
- Very quick
- Native support of SFF and FASTQ formats
- Closed source, closed methods – hard to know what it is doing
- Not many user-configurable parameters
- Does not support paired-end 454 data or mate-pair Illumina data to produce scaffolds
Included for completeness, I have not spent much time with these packages.
5) Assemble Illumina data and 454 data separately and combine with MINIMUS
- Reasonably quick
- Can use “best” assembler for each flavour of data
- Theoretically provides independent confirmation of each assembly
- When there are disagreements, which assembly is correct?
- Coverage not additive so unlikely to result in improved contiguity
- Can propagate misassemblies in either assembly
- Difficult to use with gapped scaffolds
6) Fake Sanger reads from 454 or Illumina assembly and feed to the other assembler
I really don’t like this approach as so much useful information is lost in the resulting assembly, so I haven’t tried it.
7) Local assembly of abundant paired-end data to fill 454 scaffolds
This is a useful complementary approach to the ones above – can use BGI’s GapCloser or IMAGE to try and fill gaps in scaffolds by using Illumina abundant paired-end data in conjunction with local assembly.
Update: 8) Newbler 2.6, incorporating FASTQ files
I can’t believe I forgot this, thanks to Anthony Underwood for reminding me.
Newbler 2.6 will now accept FASTQ files and so this may be a good option. I am going to have a play around with it and will post back my findings.
I still think it’s surprising there is no definitive assembly solution that can use 454 and Illumina data of all flavours and produce reliable, error-corrected scaffolds. Please correct me if I’m wrong! Similar issues may apply to combining Illumina or SOLiD data with Ion Torrent, PacBio, etc.
Comments, corrections, feedback as always appreciated.
One issue here is that historically you use a fundamentally different approach for 454 data and Illumina data – the former uses overlap-layout-consensus and Illumina uses de Bruijn graphs. However it may be with the advent of longer Illumina reads 100-150bp and greater accuracy (particularly if you use a k-mer error correction approach) overlap-layout-consensus becomes an option with Illumina. Jared Simpson is experimenting with string graphs as an alternative to de Bruijn.