Generating MLST profiles from short-read data

There are now several available options if you want to call MLST profiles from whole-genome data.

Result from the DTU MLST web server

DTU MLST Server

The web server at the Center for Genomic Epidemiology at the Danish Technical University is probably the easiest option, with the advantage that it will accept both raw read files and assemblies. It worked well when I tried it, however it was quite slow to return results and if you are uploading large read datasets it will take some time, particularly if you are analysing a large number of samples. It also does not have all of the MLST database listed (I wanted to use C. albicans).

BIGSdb

BIGSdb is a powerful and flexible web server software that can be installed on your local PC or server. It offers the ability to call MLST profiles from assembled genome data, as well as setting up your own typing schemes based on other epidemiologically informative marker genes. But non-bioinformaticians may find it a little tricky to set up.

Update: There is also a hosted version of BIGSdb which lets you cut-and-paste your de novo assembly into the sequence query form and get profiles out, available for a certain subset of the MLST databases (more available on request to Keith Jolley).

SRST

SRST comes from Kat Holt’s group in Melbourne. It runs on your local machine and is notable because it calls profiles from short-read data without prior de novo assembly. It gives a confidence score to assignments. As it has some dependencies (BWA, samtools, BLAST) and runs as a Python script it is probably best run on a Linux machine or a Mac.

I found it works quite well on the Illumina data I tried, however there are a few tips for getting it running that are probably worth documenting for other users.

  • The alleles files should be named gene.fas and geneshould be identical to the FASTA header lines in the file, as well as the column names in the STs file.
  • The alleles in the alleles file should be named gene-N where N is the number of the allele. Note if you have a different separator than a hyphen you can specify this with the –name-sep argument, but having no separator is not allowed (as I think is the case with the Cork E. colidatabase).
  • You need an older version of samtools to run this properly, I used samtools-0.1.12a. Newer versions don’t work.

Roll your own (suggested by Anthony Underwood, HPA)

Of course what many people do is first perform a de novo assembly, perhaps with Velvet, and then BLAST the contigs against the MLST allele database. You can then inspect the results manually, or write a little script to collect the results into a profile. If you have one you’d like to share, please post the link in the comments below. Here’s my Python script for what it’s worth …

References

  • Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 2012 Apr;50(4):1355-61. PMID: 22238442.
  • Inouye M, Conway TC, Zobel J, Holt KE. Short read sequence typing (SRST): multi-locus sequence types from short reads. BMC Genomics. 2012 Jul 24;13:338. PMID: 22827703.
  • Jolley KA, Maiden MC. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010 Dec 10;11:595. PMID: 21143983.

6 Responses

  1. Anthony Underwood
    Anthony Underwood
    October 9, 2012 at 4:40 pm |

    Interested to see the mention of SRST. How fast is it? There’s a pipeline developed by the Sanger Institute using ICORN (http://sourceforge.net/projects/icorn/) but it’s slow compared to a de novo assembly and blast based approach.

  2. Keith Jolley
    Keith Jolley
    October 10, 2012 at 7:39 am |

    It’s worth noting that many of the databases hosted on PubMLST.org are run on the BIGSdb platform. If the scheme you’re interested in is one of these then you can copy and paste whole genome contigs in to the sequence query form within the appropriate database pages. We also have some of the more commonly used schemes that are hosted on other sites available at http://pubmlst.org/mlst/. Schemes can be added here on request (I’ve just added C. albicans).

  3. Keith Jolley
    Keith Jolley
    October 10, 2012 at 8:57 am |

    Biallelic data is a bit of a problem. BIGSdb was really designed for haploid data so it doesn’t really know about nucleotide ambiguity codes. Internally it’s just using BLAST to identify the nearest matches and that seems to work fine for exact matches of alleles with ambiguity codes, but may give a slightly misleading answer when it identifies variable positions if it’s not an exact match.

  4. Anthony Underwood
    Anthony Underwood
    October 10, 2012 at 3:08 pm |

    As far as I can see the ICORN solution is not publicly available, though I believe it is their intention to do so. However on a recent visit to the WTSI they indicated they are now using a different mapping based approach and am in the process of trying to clarify what this is. Will update when I have more details.

Leave a Reply

You must be logged in to post a comment.