There are now several available options if you want to call MLST profiles from whole-genome data.
DTU MLST Server
The web server at the Center for Genomic Epidemiology at the Danish Technical University is probably the easiest option, with the advantage that it will accept both raw read files and assemblies. It worked well when I tried it, however it was quite slow to return results and if you are uploading large read datasets it will take some time, particularly if you are analysing a large number of samples. It also does not have all of the MLST database listed (I wanted to use C. albicans).
BIGSdb
BIGSdb is a powerful and flexible web server software that can be installed on your local PC or server. It offers the ability to call MLST profiles from assembled genome data, as well as setting up your own typing schemes based on other epidemiologically informative marker genes. But non-bioinformaticians may find it a little tricky to set up.
Update: There is also a hosted version of BIGSdb which lets you cut-and-paste your de novo assembly into the sequence query form and get profiles out, available for a certain subset of the MLST databases (more available on request to Keith Jolley).
SRST
SRST comes from Kat Holt’s group in Melbourne. It runs on your local machine and is notable because it calls profiles from short-read data without prior de novo assembly. It gives a confidence score to assignments. As it has some dependencies (BWA, samtools, BLAST) and runs as a Python script it is probably best run on a Linux machine or a Mac.
I found it works quite well on the Illumina data I tried, however there are a few tips for getting it running that are probably worth documenting for other users.
- The alleles files should be named gene.fas and geneshould be identical to the FASTA header lines in the file, as well as the column names in the STs file.
- The alleles in the alleles file should be named gene-N where N is the number of the allele. Note if you have a different separator than a hyphen you can specify this with the –name-sep argument, but having no separator is not allowed (as I think is the case with the Cork E. colidatabase).
- You need an older version of samtools to run this properly, I used samtools-0.1.12a. Newer versions don’t work.
Roll your own (suggested by Anthony Underwood, HPA)
Of course what many people do is first perform a de novo assembly, perhaps with Velvet, and then BLAST the contigs against the MLST allele database. You can then inspect the results manually, or write a little script to collect the results into a profile. If you have one you’d like to share, please post the link in the comments below. Here’s my Python script for what it’s worth …
References
- Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 2012 Apr;50(4):1355-61. PMID: 22238442.
- Inouye M, Conway TC, Zobel J, Holt KE. Short read sequence typing (SRST): multi-locus sequence types from short reads. BMC Genomics. 2012 Jul 24;13:338. PMID: 22827703.
- Jolley KA, Maiden MC. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010 Dec 10;11:595. PMID: 21143983.

Interested to see the mention of SRST. How fast is it? There’s a pipeline developed by the Sanger Institute using ICORN (http://sourceforge.net/projects/icorn/) but it’s slow compared to a de novo assembly and blast based approach.
It’s worth noting that many of the databases hosted on PubMLST.org are run on the BIGSdb platform. If the scheme you’re interested in is one of these then you can copy and paste whole genome contigs in to the sequence query form within the appropriate database pages. We also have some of the more commonly used schemes that are hosted on other sites available at http://pubmlst.org/mlst/. Schemes can be added here on request (I’ve just added C. albicans).
@Anthony, Keith – thanks for stopping by!
@Anthony- it’s not particularly fast. I would say it’s probably slower than doing an assembly and then BLASTing. But it would depend on how many reads you have. Perhaps I should do some quick benchmarking to improve this post. Thanks for also mentioning the “roll-your-own” solution, I suspect you and I and many others have knocked up a quick script to do this.
@Keith- thanks for that!I will update the post with the information. Do you know if BIGSdb copes nicely with biallelic loci?
@Anthony – is the iCORN based solution available to use/download somewhere on Sanger website?
Biallelic data is a bit of a problem. BIGSdb was really designed for haploid data so it doesn’t really know about nucleotide ambiguity codes. Internally it’s just using BLAST to identify the nearest matches and that seems to work fine for exact matches of alleles with ambiguity codes, but may give a slightly misleading answer when it identifies variable positions if it’s not an exact match.
As far as I can see the ICORN solution is not publicly available, though I believe it is their intention to do so. However on a recent visit to the WTSI they indicated they are now using a different mapping based approach and am in the process of trying to clarify what this is. Will update when I have more details.