Sequencing low diversity libraries on Illumina MiSeq

After its launch in 2005, the 454 rapidly became the go-to technology if you wanted to sample diversity in amplicon libraries, whether a cancer panel, a viral quasispecies or microbial community profiling. It is not difficult to see why. Compared to Sanger sequencing the 454 offered massive throughput, being able to produce over a million reads per run at the relatively modest price of $10,000. This was an order of magnitude less than Sanger sequencing. And crucially, combining the instrument's high-throughput with barcode multiplexing permitted large numbers of samples to be interrogated on a single run at high coverage depth.

In microbiology and ecology, deep sequencing of 16S amplicon libraries using 454 is now the dominant method for phylogenetic profiling of microbes. Of the 2,210 publications listed on the 454.com website, 839 are in the category “Metagenomics and Microbial Diversity”. The “rare biosphere” in our bodies in health and disease was revealed for the first time. Environmental ecologists used the technology to interrogate hugely diverse environmental niches. Hundreds of new OTUs, often representing hitherto uncultured microbes were revealed for the first time.

Move over 454

Sadly, the pace of development of the 454 platform has stagnated in recent years following the Titanium upgrade in 2008. The long-promised upgrade to GS FLX+ “1kb reads” was late and under-delivered with reads more like 700-800 bases, and some users have reported dissatisfaction with the upgrade. Disappointingly the long read protocol is not supported when running unidirectional Lib-A sequencing, dramatically limiting its potential market. Nor is it available on the benchtop 454 GS Junior, although this may change in future.

But most critical is the apparent blind spot of Roche management to the rapidly dropping costs of sequencing on competitor platforms. The 454 has simply priced itself out of the market by being one to two orders of magnitude more expensive when costed per megabase compared to the Illumina and Life Technologies platforms.

New Platforms for Amplicon Sequencing

So for microbiologists wishing to do 16S sequencing, whether they are driven by cost-cutting, or by a desire to sequence more samples more deeply, it is now time to look around at alternatives. The MiSeq and the PGM are both promising platforms for 16S analysis given their competitive price points, and increasingly long reads (MiSeq 2x150bp, PGM 200bp - going to 2x250bp and 400bp respectively by the end of the year).

Sequencing low diversity libraries on Illumina MiSeq

We are moving to the Illumina MiSeq locally for 16S sequencing. For about £750 we generate over 5 million reads per run. By using paired-end sequencing at 150 bases we can design experiments which generate amplicons a little less than 300 bases and overlap them to generate long pseudo-reads. The error model is favourable compared to 454 as it does not suffer from frequent indel errors, meaning there is less need for expensive denoising steps such as PyroNoise.

However, there is a fly in the ointment. Amplicon sequencing on the Illumina platform has traditionally been problematic when sequencing so-called "low diversity" libraries such as 16S, resulting in low yields and lower per-base quality scores compared to sequencing more random libraries, e.g. from genomic DNA.

The good folks of Seqanswers have discussed this at length, and various work-arounds have been suggested. One commonly used approach is to spike in a genomic, higher-diversity sample, e.g. PhiX. The more PhiX spiked in, the better the results, but at the expense of the number of amplicon sequences generated. A second option is to add a sequence of N bases upstream of the 16S primer, resulting in the generation of random sequences. This however reduces the effective read length.

Solving the problem

We have been very fortunate in the past few weeks to welcome Josh Quick to our lab. He previously worked as an integration engineer at Illumina but has now decided to hone his skills as a bioinformatician. There's not much he doesn't know about Illumina sequencing, and he quickly introduced me to some tips for improving amplicon sequencing performance that were so impressive I asked him to share them here.

Over to you, Josh ...

There are 3 main areas in which low-diversity samples can cause you problems on MiSeq:

1) Focusing (every cycle) - the MiSeq focuses on the T channel with a fall back to the C channel, in practice as long as all the signal is not in the G channel you will be fine.  All other issues aside a very small PhiX spike in (~5%) is enough to prevent any focusing issues regardless of the composition of your library.

2) Template building (cycles 1 to 4) and registration (every cycle) - RTA uses images from the first 4 cycles to detect the positions of all the clusters.  You need to have some signal present in each channels for RTA to do template generation and registration properly.  Again a small PhiX spike in (~5%) is usually enough to prevent problems here provided density is <=700k.

3) Phasing/matrix estimation (cycles 1 - 12) - RTA estimates the average colour matrix over the first 4 cycles and the phasing over the first 12 cycles.  Low diversity samples can cause problems with both as the intensity is not evenly distributed across all channels as it is with genomic libraries.  As these are calculated in order to perform corrections a bad estimate here can cause your quality to start high then rapidly fall away, in these cases you might need a large PhiX spike in (~50%) to solve the problem.

Control lanes - on the GA/HiSeq 1) and 2) were still considerations (although each instrument focuses differently) however the use of a PhiX control lane eliminated problem 3).  On the MiSeq, having only a single lane means a control lane isn’t possible but there is a method for using ‘control’ conditions on MiSeq by modifying the RTA configuration file.

In my experience the most likely thing to go wrong is the phasing estimator, it will give a spuriously high phasing or prephasing number of >1% which means your quality starts off good then rapidly falls away.  However you can use a value based on a previous PhiX run, for example ours would be 0.0015/0.003.

The way to use ‘control’ matrix/phasing on the MiSeq:

(DISCLAIMER - this is not a configuration supported by Illumina so use it at your own risk)

Our MiSeq is running:

  • MiSeq Control Software 1.2.3

  • RTA 1.14.23


Locate your RTA configuration, ours is at:
C:\Illumina\RTA\Configs\MiSeq.Configuration.xml

Locate your control phasing and matrix files (previous PhiX run is ideal):
D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\BaseCalls\(Phasing|Matrix)\s_1(phasing|matrix).txt

Use a text editor to put the matrix and phasing values from these files into the MiSeq.Configuration.xml below the other options like this:

<HardCodedPhasing>
  <float>0.0015</float>
</HardCodedPhasing>
<HardCodedPrePhasing>
  <float>0.003</float>
</HardCodedPrePhasing>
<HardCodedColorMatrix>
  <ArrayOfFloat>
    <float>0.9339278</float>
    <float>0.07252103</float>
    <float>0</float>
    <float>0</float>
    <float>1.458246</float>
    <float>1.399187</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0.8679092</float>
    <float>0.03415901</float>
    <float>0</float>
    <float>0</float>
    <float>0.5764247</float>
    <float>0.988043</float>
  </ArrayOfFloat>
</HardCodedColorMatrix>



You need to have one float/ArrayOfFloat per read so the above would set the phasing, prephasing and matrix for a single read run, and below would set just the phasing for a dual index paired end run with four reads:

<HardCodedPhasing>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
</HardCodedPhasing>



When the run starts check the RTA configuration file in your run folder to make sure it accepted the settings:

D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\RTAConfiguration.xml


This in most cases will enable you to use a significantly smaller amount of spiked in PhiX, you will still need 5% minimum to prevent problems arising from 1) and 2) and do not run at high density for amplicon work - 700k is the upper limit for difficult low diversity samples.  It is also possible to save the images for re-running RTA offline, this enables you to try different settings to find what works best.  The MiSeq.Configuration.xml setting for this is:

<CopyImages>true</CopyImages>



Good luck!

Update 6th September 2012: Some of the example values in the original post were wrong and have been corrected. However these were just illustrative, you should use the values from a test run on your local machine for this approach to be useful.

Update 31st October 2012: In the latest release of RTA (1.16) you no longer need to modify your RTAConfiguration.xml, instead save a copy of your control phasing/matrix files described above in the root of the RTA directory as phasing.txt and matrix.txt. RTA will fall back to the values in these files if it detects a low diversity sample.

8 Responses

  1. matt
    matt
    August 10, 2012 at 8:41 pm |

    Thats very interesting and useful thanks!
    This will be an increasingly common problem as 16s sequencing is switched to illumina, which seems very attractive.

    Josh and Nick – do you know any similar software tweaks for HiSeq or GAII runs that would allow less extreme over sequencing of PhiX – a genome no one seems to want sequenced any deeper? I haven’t seen that covered in Seqanswers, and certainly not in the discussion you link too.

    I have heard about doing dark sequencing in read1 into the more random sequence e.g. for the 12bp limit, stripping and resequencing “read1b” from the start.

    Any comments or secret hacks?

  2. mterry
    mterry
    August 13, 2012 at 11:17 am |

    Together with Illumina Sequencing Service we tried to sequence our amplicons back in 2009 and 2010. It is synthetic DNA with regions that are common for all templates, while some regions are variable. We never had much success with Illumina (worked great with 454, and now more recently with PGM), and it’s interesting now to see that we might have an explanation to the problem with monotemplate sequencing. Time and money wasted for us though…

  3. josh
    August 13, 2012 at 12:28 pm |

    If you are already using a control lane on the HiSeq/GA then there isn’t much else you can do to reduce the amount of spike in, there is a minimum needed to maintain focus. If the focus is fine but your template is poor then the dark sequencing is an option assuming if your sample is becoming more diverse. You could also do the index read first and build your template off that if it is more diverse than your read 1!

    Josh

  4. mm9810
    mm9810
    October 16, 2012 at 7:47 pm |

    Thanks for the useful and informative discussion.

    You mention the possibility of introducing N’s upstream of the 16s primer, but that this reduces read length. While that’s clearly true, I would have thought that a variable number of 0-4 nucleotides would be sufficient to ensure diversity. Essentially, if you have some variability in the length of your barcodes, assuming you’re multiplexing, you make your sample, base for base, more diverse. That is a decrease in read length, of course, but not very significant, and certainly much better than the 50% spike in. Am I missing something here?

    Thanks for your thoughts!

    Mike

  5. krobison
    October 31, 2012 at 4:47 pm |

    It’s not just amplicon libraries that have issues; sequencing genomic fragments from very GC-biased samples also gives the MiSeq fits.

  6. Benny Chain
    Benny Chain
    November 6, 2012 at 4:59 pm |

    Can anyone point me in the direction of how to rerun an analysis off line on MiSeq data. We used RTA V1.16.18. Can one run it as as stand alone using saved image files ?

  7. rajasereddy
    rajasereddy
    January 20, 2013 at 4:16 pm |

    MiSeq doesn’t save image files for that matter none of Illumina systems does. If you have saved .cif files olb 1.9 can be used for offline base calling.

Leave a Reply

You must be logged in to post a comment.