Gene accumulation curves in R
I’ve put a quick guide to demonstrate plotting gene accumulation curves (for bacterial pan-genomes) up on my personal blog.
I’ve put a quick guide to demonstrate plotting gene accumulation curves (for bacterial pan-genomes) up on my personal blog.
The ritual of checking Twitter with my coffee this morning brought much excited chattering about the new Illumina sequencer, the HiSeq 2000 which was announced yesterday. I won’t cover this in detail as there are already great blog posts up from Daniel McArthur, David Dooling and Genomics Lawyer. Bottom-line: $650,000 for the machine, $10,000 for the reagents, 2 x 100 base-pair paired-reads and the clincher, 200Gb of sequence data per run. And a terrible name.
Technology wise this has been done through parallelisation rather than chemistry changes. Illumina have increased the number of flow-cells in each run to 2. Each of these are now double-sided, and the readable area of the flow-cell has been increased. This is a bit like the processor market where Intel have managed to keep pace with Moore’s Law by increasing the number of processor cores on each chip.
The other big news is that Beijing Genomics Institute have ordered 128 of these beasts, which on delivery will hand them the title of largest genome centre in the world by a country mile. According to the map of high-throughput sequencers, the Broad Institute currently hold this title. Although no doubt we will see a bunch of other genome centres placing orders for this new machine soon.
Update: For those with GA2s looking to upgrade, I am told that trading in your old machine will give you a paltry $150,000 off the list price of the HiSeq.
At yesterday’s IDRN meeting, Neil Ward gave an excellent summary of the Illumina sequencing platform, which really does seem to be going from strength to strength. They are pushing throughput hard – 35Gb per run routinely, with 50Gb targeted in the next few months. They think 95Gb per run should be achievable with the current technology, phenomenal stuff. 100 base-pair reads and the new 3kb mate-pairs (annoying distinction from paired-ends – stop confusing me!) means de novo applications will really become much easier on this platform.
But a strange thing happened during the presentation, Neil mentioned an interesting sounding Illumina case-study, the confirmation of the identity of Russian Emperor Czar Nicholas II and his family by sequencing STRs and mtDNA from DNA recovered from grave sites. This case-study is featured prominently on the Illumina publications page. So imagine my surprise when I clicked through to find the article in Forensic Science International had been withdrawn!
What happened I wonder? Some of the printable suggestions from conference delegates included 1) the authors sequenced their own DNA by accident, 2) a problem with informed consent for the samples – perhaps Nicholas’ living relatives were not happy or even 3) they dug up the wrong grave!
Of course it is probably unfair to speculate in such a case, and I don’t want to annoy anyone by posting about it, but its such a tantalising story its difficult not to! It is certainly unusual to see a paper withdrawn (rather than getting addendum after addendum of corrections). The truth of course is likely to be more prosaic than my suggestions. But I would expect it to disappear off Illumina’s website fairly soon.
A previous genomic study was published in PNAS in October 2008.
We were very pleased to have reps from Illumina, Applied Biosystems and Roche present at the IDRN next-gen sequencing meeting at the University of Birmingham today. Tim Jones brought along a GS Junior to give us a look at this intriguing new entrant to the sequencing market and what I think is the first public viewing of the instrument. As he said, Roche are chasing a different market to AB and Illumina and they intend the 454 to be complementary to the short-read systems. The GS Junior is a cute little thing, about the size of a laser printer. The LED looks a bit like a coin-slot and a few people commented that you “put your 50p in there” and out comes the data. Looking inside the box the slot for the PicoTitrePlate is absolutely tiny, reflecting the lower throughput of this instrument (40Mb per run). A few people commented that they’d buy one if they did it in pink. Here are some pics for all you sequencer-nerds to look at.
We’re very pleased to announce the availability of the xBASE Annotation Service. This service is designed to produce annotation files from unfinished genome data, e.g. from 454 or Solexa de novo assemblies. The steps involved will be familiar to anyone who has used similar pipelines. Gene prediction is run on the uploaded sequence using Glimmer, functional annotation is called from a reference using BLAST. Non-coding RNAs are detected with tRNAScan-SE and RNAMMER. The final result is a GenBank file suitable for viewing in Artemis.
Two features set this pipeline apart from others. Firstly, it is fast. We aim to ensure annotations are completed in less than 30 minutes and ideally half that. Secondly, we have made it extremely easy to compare your unfinished sequence to a reference by re-ordering and reverse complementing contigs and producing a file suitable for loading directly into ACT. This works whilst preserving any novel regions that may be present. We’d be very pleased to get your feedback on this new function.
A few months back I blogged about James Hadfield’s excellent Google Map for locating next-generation sequencing facilities. Some time has passed and the map has become very popular with over 100,000 views registered. Keen to create this into a really great community resource, James and I have been collaborating in our spare time on a revamped version. There are a couple of major improvements over the first version;
Perhaps most excitingly for stats-nerds, this is all database-driven so the long awaited league tables of sequencing platforms can be calculated automatically. However, mindful that the information could be made more accurate, I have decided to leave the league tables off until people have had a bit of time to check the map for errors, and add any missing facilities.
We’re really keen to gather feedback so either post a comment on the blog, or get in touch directly if you have ideas you want to discuss. We were also thinking about extending it for other non-genomics utilities, for example next-generation proteomics. If you are working in such a field and would be interested in collaborating, again get in contact.
Well, what are you waiting for? Go check out the map right now!
A press release from 454 was brought to my attention via their Twitter feed yesterday. We’d heard rumours about this already, but Roche have confirmed that they will be supplying a new bench-top sequencing, the “GS Junior” in 2010. So other than a cute name, what will this bring to the party?
“The GS Junior System, which is no bigger than a typical laser printer, has performance and features tuned to fit the needs of small to medium sized laboratories. The platform will launch with long-read GS Junior Titanium chemistry, offering 400 – 500 bp read lengths, and will be shipped with a desktop computer that is optimized for GS Junior Run processing and downstream data analysis.”
A laser printer form factor sounds attractive, but this is offering daisy-wheel performance with a throughput of 35 million bases (c.f. 500 million from the GS FLX and 30Gb from Illumina) per run.
Contrast this strategy to the other big players in the high-throughput game, Illumina and ABI. Their strategy for the last few years has concentrated on inexorably increasing throughput with an eye on the glittering prize of human genetics – the $1000 and then $100 human genome.
Roche are going in quite a difficult direction and shooting for a different market. ABI still dominate sequencing for smaller customers (i.e. non-dedicated genome centres) such as the molecular biology or clinical diagnostics lab with the wildly successful 3700 series. It can be seen from our map of high-throughput sequencers that whilst Illumina are concentrating on volumes and genome centres (80 machines at the Broad, 40 at Sanger), 454 have slowly managed to get small volumes into larger numbers of institutions. Having said that, the machines are typically housed in sequencing facilities if not sequencing centres. They don’t yet sit alongside the battered thermal cycler and pipettes in the average lab.
I can see Roche having some success with this strategy. But it will require a change in mindset from users. PhD candidates sequencing individual gene loci or plasmids in their mutants will start switching to whole-genome resequencing approaches (did my site-directed mutagenesis experiment affect other parts of the genome?). Clinical diagnostics labs will start looking at tens or hundreds of gene loci (say, all genes known to affect cardiovascular risk). Clinical microbiology labs will routinely deep-sequence viral genomes to help treat infected patients. They may even start whole-genome sequencing bacterial isolates for pathogen profiling and help with infection control.
The Roche Junior may be the perfect solution for these new approaches which are surely to come.
But I see two major problems right now that need sorting for this to happen.
1. Speed up and simplify the 454 library construction
It takes us days to weeks (for paired-end libraries) to prepare a 454 Titanium library for sequencing. There are over 20 steps, some of the molecular biology is quite hairy and if you are not careful, you can easily lose your library and have to start again. You need a couple of weeks training before you can touch the system. This won’t be acceptable for the average grad student or post-doc. The website states that the sample prep will come down to just half a day, which sounds very promising.
2. Plates and reagents for 454 are very expensive
You are still looking at about £6000 ($10,000 US) minimum in consumables to run a single plate of 454 sequence and get about a million reads back. How much will the Junior’s consumables cost? I’d estimate they’d need to get this down to way below £1000 a run for this to be adopted in small to medium labs enthusiastically for new applications.
Also buried in the press release is some more detail about the updates to the FLX Titanium kits. As we’d already heard, reads of up to 1000 base-pairs are promised. This will bring immediate benefits to a number of applications, particularly amplicons and transcriptomics. We will certainly be using to for phylogenetic profiling of bacteria, using the extra long reads to capture more 16S variable region and hopefully improve the accuracy of these approaches.
The press release also hints that less input DNA will be required to operate the new kits. This is great news. we are currently asking our customers for between 5 and 10 micrograms of DNA for input which is sometimes difficult to achieve. They don’t specify how much is required, but certainly Illumina are winning on this, requiring much less than a microgram as input for their library construction.
Read the full press release.
Update 14:08 GMT
The guys from 454 (cheers Tim!) have been in touch to give some indicative pricing for the machine.
Pricing information removed.
So there you have it, less than 15 years after the first bacterial genome sequence was completed (the biological equivalent of an Apollo landing journey back then) an average lab can do the equivalent for £x,xxx on a machine no bigger than a laser printer in a single day.
And on the subject of DNA requirements for the new FLX protocols, this has been cut down to a minimum DNA template requirement of 500ng (thanks: genejockey).
Update 15:21 GMT
Woops, seems like 454 guys don’t in fact want the pricing released to the public just yet. Hopefully they will announce in the near future but have taken it down at their request.
A colleague has asked me to post up this interesting sounding post-doc position.
A position is available for a Postdoctoral Research fellow to work in the area of microbial metagenomics. The appointee will analyze data gathered from a large multi-centre study on the relationship between the intestinal microbiota and health in elderly people, specifically by analyzing compositional and functional bacterial microbiome data. The study includes researchers from University College Cork, Cork University Hospital and Teagasc Moorepark and will exploit cutting edge metagenomics and clinical analysis of diverse health indices in a large cohort of elderly subjects, including unique longitudinal measurements. The appointee will be based in UCC and will report to Dr. Paul O’Toole, Department of Microbiology.
The post would be suitable for a bioinformatician with a strong background in genome sequence analysis, sequence alignment, clustering analysis and related programming. The project will involve analysis of microbial community composition datasets (SSU rRNA genes amplicons) generated by HT next-generation sequencing methodologies [1]. There will also be the opportunity to work with shot-gun metagenomic datasets. This is a full-time fixed term contract appointment, at a position on the CHIU Senior Postdoctoral salary scale reflective of experience.
Please send
Deadline for applications is: 17:00, Friday November 20th, 2009.
Only those candidates considered suitable for the post will be informed by return email. Candidates considered suitable for the post must be available for interview. Please send covering letter and CV to Dr. Paul O’Toole, Department of Microbiology UCC (pwotoole@ucc.ie; 021 490 3997).
1. Claesson, M.J., et al., Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. PLoS One, 2009. 4(8): p. e6669.
An interesting report from Bio-IT World on the recent NGS2009 conference in Barcelona quotes Phil Green, legendary developer of phred and phrap. One paragraph that caught my attention related to storage of image data:
Green closed by recommending a change of heart with regard to data preservation. “Virtually all genome centers throw away image files,” he said. “Our philosophy is, you should save the [data files].” Green has a utility that reduces the size of images by 75%. Tape storage for the images of a 36-cycle Illumina run amounts to just $12, said Green. “The cost of storage is so much less than running the flow cell. You might want to go back to it someday and get more information. It’s a lot more expensive to run another flow cell than to retrieve the images and get more data out of them.”
It is virtually dogma these days that sequencing centres might as well chuck away image files because storing the information in the freezer in DNA form and resequencing if necessary would work out cheaper. Phil Green suggests that a combination of image compression and tape storage is economical, and has the added advantage that images can be crunched again if improved base-callers come along.
I am not really sure if that is compatible with the nature of most research projects. Most labs work on the principle that we do the sequencing, analyse the results, write the paper, (hopefully) get published and move on to the next project. I am not convinced that people will be regularly mining their old image files to get a few percentage extra reads out, particularly when the sequencing technologies are outpacing Moore’s Law. Doing the same amount of sequencing just a year later will be drastically cheaper when calculated cost-per-base.
I am also not sure that the storage cost argument is right. Referencing David Dooling’s just-updated next-generation informatics table you are looking at 2.8 terabytes of image data per GA2 run. Phil Green suggests these can be compressed by 75%, I assume this means you can get them down to 700 gigabytes. Conveniently that will fit on a single LTO-4 tape which are about £27 each from Dabs. Of course you will need tape drives and carousels, perhaps another £5,000 worth. Plus, depending on your setup you need to wire up your sequencer to dump the image files to tape, which might involve an investment in high-end networking or SAN gear. But still, £27 is a lot less than the £5-10,000 you would spend on doing an Illumina run.
Does this scale up? For a large genome centre things get complicated. Assuming 30 Illumina GA2x machines running 24/7 you might be able to do 1,000 runs a year. That would be 700 terabytes of compressed image data. You could probably still afford the tapes, about £15,000 worth. But you’d have to employ a person to load and unload the carousel daily. Then you’d need a dedicated storage room for all those tapes and an indexing system to record which data was on each tape. If you were paranoid you’d want to move this offsite and have fire and flood protection. You’d need multiple tape drives and carousels to cope with the output. Then you’d need bandwidth to move the image files around and processing power to compress them. Then factor in a steady increase in image file size. Conservatively this would come to about £50,000 p.a. for storage, or between 5 and 10 Illumina runs. Factor in instrument improvements and that might be the equivalent of 30 Illumina runs in a years time.
So Phil does have an interesting point. But the key question is, how many sequencing runs would you anticipate needing to re-call in any given year in order to get your work done? And how much time and effort do you want to spend addressing this problem versus all the other bioinformatics challenges faced downstream?
(Locally, we have agreed on a 60-day retention policy for image files through direct attached storage.)
I have found Velvet to be an excellent option for short-read de novo assembly of bacterial genomes. We are routinely seeing Velvet assemblies of Illumina data producing similar statistics to Newbler assemblis of 454 data. It seems that Illumina run with 75-base pair reads and paired-end libraries can be competitive with Newbler assemblies from 454 fragment libraries. Amazing that just a year or two the dogma was that short-read technologies weren”t useful for de novo assemby.
However, getting the best out of Velvet is a bit trickier than using Newbler. Roche’s informatics is really slick and running a Newbler assembly is pretty much “fire and forget”. However, the cost-effectiveness of Illumina means that this is turning into a popular choice for bacterial genomics. I am frequently asked at presentations for details on our pipeline for handling Illumina data, so I thought I could point people at this blog post in future.
Firstly, de novo assembly with short-reads gives much more effective results if you aggressively pre-filter the reads passed to the assembler (whether it is Velvet or something else). The difference in assembly statistics between passing the “raw” base-call output from Illumina versus passing a stringently filtered set is night-and-day. For an M. tuberculosis genome, the assembly N50 jumped from 1kb to 30kb after filtering the reads. It is also not uncommon to find Illumina datasets with 100-1000x fold-coverage for the genome due to the massively parallel nature of this technology. In this case more is not necessarily a good thing, submitting too many reads will dramatically increase memory requirements and may also result in a worse quality assembly due to the excess of error reads which complicate the de Bruijn graph. Velvet does not take into account quality scores which perhaps explains the utility of such filtering steps.
Filtering reads is a question of trial and error but I have found the following techniques are all useful. You can use them all, or mix and match for your application. The scripts I reference are all Python scripts which require the latest version (>=1.51b) of the invaluable BioPython package.
All the scripts assume you have a single FASTQ file. If there is paired-end information then the forward read should be followed directly by the reverse read in the file. If your pipeline produces a forward and a reverse read file then use Velvet’s bundled shuffleSequences.pl script to produce a single FASTQ file before proceeding.
1) Count per-base quality scores and trim the reads
Run quality_breakdown.py on your FASTQ file. The stdout will produce a tab-separated report showing the mean quality score at each base position. You can plot this in Excel and it usually shows a linearly degrading quality across the read. You have to use your disgression where to trim each read, but I tend to cut it about 20 bases in from the end or where the average quality is 20, whichever gives the longer read. Run trim_reads.py over your file specifying where in the read to trim. I like to have all reads the same length at the end.
2) Eliminate singletons
If you have excess depth of coverage, and particularly if you have at least x-fold coverage where x is the read length, then eliminating singletons is a nice way of dramatically reducing the number of error-prone reads you pass to Velvet. If you don’t have great coverage this step might not be possible. Simply run eliminate_singletons.py passing the FASTQ file as the first argument. It will ensure that the output still correctly pairs the reads.
3) Chuck away reads with Ns in
If you can afford the loss of coverage, you might throw away all reads containing Ns. The Velvet internals use a 2-bit encoding system for nucleotides and so does not make provision for Ns. I believe it converts Ns to a base chosen at random (it used to use Ns). eliminate_n_paired.py will discard any pairs which contain Ns. Be cautious, I once had a Solexa read file in which every read at positions 31, 32 and 33 were N so I had to trim the reads at position 30 before running this script.
4) Throw out reads with low average quality
This step is probably optional. You can discard all reads with an average score below a certain threshold using filter_reads.py, I might use 20 for the sake of argument.
Once you have a filtered read-file you may want to use Simon Gladman’s excellent VelvetOptimiser script (also bundled with Velvet) to perform a parameter scan looking for the assembly with the best ‘contiguity’, i.e. greatest N50. Whilst it is great to have long contigs – this makes life easier when doing downstream analysis – it is worth bearing in mind that a greater N50 is associated with a higher chance of misassemblies. If you are doing true de novo assembly you will have no reference to check the results against. It is therefore imperative you do some sanity checking on your assembly, particularly if you are using the output for variation detection between closely related strains (as opposed to, say, getting a quick and dirty proteome prediction).
I would be keen to hear what other people to do to help validate their assemblies, but one tip I can impart is to use a short-read aligner (I like Bowtie, but BWA, MAQ or NovoAlign would work as well) to map those reads back to the draft assembly produced by Velvet. By searching for high-frequency variants, i.e. SNPs and indels, either using VarScan or the built-in MAQ SNP pipeline you would expect to see few or hopefully no homozygous variant calls.
If you do see variant calls, check the read-depth of that contig. If it is 2 s.d. above the mean coverage for the genomes you are likely looking at a repetitive consensus contig, i.e. a contig produced by collapsing two genomic regions into a single contig. SNP calls in these contigs are unreliable by their nature. If you are seeing lots of variant calls you may be suffering from misassemblies. Consider changing the Velvet settings to make it more sensitive. SNPs occurring near the contig boundaries are likely to be due to spurious alignments and probably can be ignored. SNP calls occurring in all reads in the middle of long contigs should raise alarm bells.
When trying this approach with some paired-end data I came across a curious phenomenemon. I was getting about 60 SNP variant calls when mapping my reads back to my assembled genome. I used the excellent Hawkeye, part of the AMOS package to try and see what was happening in this region. I wanted to see if there were obvious signs of misassemblies, i.e. low-quality alignments in that region. What I found was unexpected. The SNP call was different from Velvet’s consensus, but the pile-up alignment at that base agreed with the SNP call. It looked like the consensus was being called incorrectly. Confused by the result, the ever-helpful Daniel Zerbino nailed the problem immediately – the consensus was being called from the original contig sequence where there was a mixture of alleles at the locus, before the paired-end resolution code was kicking in. Velvet correctly untangles the repeat using the paired-end information, but did not recall the consensus.
A way of fixing this problem easily is to use the AMOS recallConsensus software. This goes base-by-base through the alignment and calls the likeliest base at each position. It’s pretty slow, but it works OK. To produce AMOS files you need to turn the read tracking mode of Velvet on, and then convert the AFG file to a bank file. After you have run recallConsensus you can convert the bank back to a FASTA file and use that for your further analysis.
I’d be interested to hear from readers any of their experiences with using Velvet for de novo bacterial genome sequencing.
Recent Comments