EHEC Genome Assembly

Keep track of the genomic analysis of the EHEC strains on our Github Wiki.

BGI have released 5 runs of Ion Torrent data for the German EHEC/VTEC outbreak strain. I hope it is released with no specific restrictions on use for the benefit of the entire community, but the site doesn’t make that entirely clear. Thanks to the BGI for putting it up!

Shall we crowd source some analysis? This comes at a very timely moment as I am currently help organise the Applied Bioinformatics & Public Health conference in Hinxton (#ABPH11), where we are discussing the use of whole-genome sequencing in epidemiology. The problem is I don’t have much time to dig into the data.

But I’ve put a first-pass de novo assembly up using MIRA ( here. 3,057 contigs, total bases: 5,491,032, N50 3,675. If you want the alignment files etc. get the big file here (282Mb).

Parameters are: mira –job=denovo,genome,accurate,iontor -GE:not=1

Update 3/6/11 09:15 GMT+1

Marina Manrique has run the assembly through their BG7 bacterial genome annotation pipeline, results are here.

Torsten Seemann and Simon Gladman from the Victorian Bioinformatics Consortium have sent me the results of their in-house annotation pipeline. Results are available: contigs reordered according to E. coli EAEC 55989 and TWEC.

NCBI have also posted a preliminary assembly (of a different isolate – LB226692) – although it is not a true de novo assembly. The approach is a bit different. “Reads were mapped with TMAP against the publicly available E. coli 55989 chromosome (CU928145.2) and the derived consensus was split into contigs at zero-coverage regions. These contigs were used as a ‘backbone’ for mapping of reads, followed by de novo assembly of unmapped reads with the MIRA assembler (v 3.2.1). A small number of de novo and consensus contigs were merged using CAP3.”

Update 3/6/11 16:50 GMT+1

There are two O104 isolates sequenced from this outbreak now. This first – named TY2482 – was done by BGI in collaboration with University Medical Centre Hamburg-Eppendorf and the second was done by Life Tech in-house in collaboration with University of Muenster – this is called LB226692. So opportunities for comparison exist now.

In summary: TY2482 assembly (BGI reads, my assembly), LB226692 assembly (Life Tech reads, assembly).

Mike the Mad Biologist has looked at the TY2482 assembly and concludes it is ST678 (or closely related) which agrees with the original molecular typing release from the Robert Koch Institute.

I’ve heard from another group they are planning on sequencing another isolate. I am going to try and find a place where the latest information can be collated to aid in further crowd-sourcing analysis.

Update 3/6/11 19:50 GMT+1

BGI just released two more 314 chips worth of data and their own assembly of TY2482. I don’t have any details on program used or parameters just yet but I’ve enquired.

Who will take on the challenge of building a whole-genome phylogeny?

Update 4/6/11 16:15 GMT+1

A few notable updates.

Kat Holt has picked up the gauntlet of doing some whole-genome SNP comparisons of the strains. Results here.

David Studholme has looked for strain-specific genes in TY2482 and found some, including a class A beta-lacatamase.

BGI have published some more analysis of the genomes and have suggested people use their assembly for further comparison. However I still don’t have any details on how that assembly was done (I have asked), which seems important.

Some more useful discussion from Phylogeo about the novelty of this strain. I think the consensus is now that this strain has been seen and subsequently typed in the past (hence ST678 – not a new sequence type), but before now we did not have a genome sequence for this particular strain. More discussion over at Aetiology.

Marina Manrique has set up a Github repository and Wiki for this EHEC crowd-sourcing project. I am going to have a play around with this and hopefully we can start keeping all our crowd-sourced data here in a logical format.

Some RAST annotations are available, see the comments thread.

Update 6/6/11 10:54 GMT+1

Keep track of the genomic analysis of the EHEC strains on our Github Wiki.

23 Responses

  1. BioInfo
    June 2, 2011 at 5:38 pm |

    Hi Nick,

    I just completed a quick first past analysis of the data if you are interested: – Analysis data is found here for download:

  2. News: Nick Loman draft assembly of the E. Coli sequence from BGI | NGS bioinformatics

    […] can find in the blog of Nick Loman a draft assembly of the sequences from BGI. He has obtained it with […]

  3. kat
    June 4, 2011 at 7:57 am |

    Hi Nick,

    I’ve done some really preliminary analysis to look at SNPs & phylogeny, posted here:

    Kat Holt

  4. My contribution to the ‘HUSEC41-strains-are-not-that-new’ debate « The Alignment Gap

    […] ECOR concatenated MLST sequences (using this scheme) and extracted/concatenated MLST sequences from Nick Loman’s assembly of strain TY2482 genome (the groups I circled are not very accurate, I did it […]

  5. Batterio killer, continuano le analisi: potrebbe essere resistente a 8 antibiotici « my GenomiX

    […] e proponendo assemblaggi diversi che possono essere visualizzati e confrontati seguendo questo link. Nel frattempo, al BGI hanno proseguito le analisi scoprendo che il ceppo in questione porterebbe […]

  6. fangfang
    June 5, 2011 at 2:13 pm |

    Hi Nick,

    We have computed alignments, trees, evolutionary distances, etc for all the proteins between the two RAST annotated genomes and all the E. coli strains we have in the public SEED.

    This is a large, sortable table. You need to log in to the guest account (guest/guest) for the links to the two RAST genomes to work.


  7. EAEC / STEC genomes « bacpathgenomics

    […] a few fimbrial genes annotated (in both the BG7 annotation and Torsten Seeman’s annotation of Nick Loman’s MIRA assembly of BGI’s TY2428 data) but they are currently each in their own contigs, so it’s not really possible to get an idea […]

  8. mshukla
    June 5, 2011 at 5:47 pm |

    Hi Nick,

    RAST annotations for Escherichia coli TY-2482 and Escherichia coli O104:H4 str. LB226692 genomes are available for download in various file formats (genbank, gff3, gtf, faa, fna) from PATRIC website:


  9. Talks, Genomes, Reads and Annotations | the oh no sequences! blog

    […] Loman (@pathogenomenick) published a de novo assembly of the reads with MIRA in his blog (see post here). And then, some hours later (in the morning of the 3rd of June) we published the annotation of the […]

  10. Gerhard Thallinger
    June 6, 2011 at 12:19 pm |

    Hi Nick,

    > BGI just released two more 314 chips worth of data and their own assembly of TY2482.
    > I don’t have any details on program used or parameters just yet but I’ve enquired.

    It seems to be an assembly based on newbler (judging from the contig names) with some
    post processing.

    I performed a newbler assembly (2.5.3) myself using the 7 IonTorrent runs currently available.
    This results in 1,398 contigs (>100 bp) with an N50 of 9,107 and a total of 5,228,942 bases.
    The largest contig is 47,177 bps.

    If there is interest I can upload it to the github repository.

    Does anybody know whether the raw data from the LB226692 is somewhere available ?


  11. New German STEC/EHEC data from BGI « bacpathgenomics

    […] from BGI implies that all prior assemblies were reference based and not de novo. However the MIRA assembly of runs 1-5 of BGI reads, which has been annotated by ERA7 and analysed extensively, was a de novo […]

  12. john_doe
    June 15, 2011 at 1:24 pm |

    Hi Nick,

    great blog, we always love to stop by!
    The G2L team just released 454 data from two isolates from the German E. coli O104:H4 outbreak. You can find it on our website (which unfortunately is in German):
    The link to the ftp server is
    User name and password are ‘EAHEC_GOS’.
    Would be great if you could post it on the github E. coli O104:H4 Genome Analysis Crowdsourcing (however, we already posted it also to Kat Holt …)
    Thanks and keep up the good work!
    G2L team

  13. Offene EHEC-Forschung |

    […] und niedrigschwellige Kommunikation fördern, lässt sich z.B. an Blogposts von Nick Loman (Pathogens: Genes and Genomes) und Kat Holt (bacpathgenomics) und den dort erwähnten Referenzen […]

  14. | Gobierno Electrónico | Blog | La ciencia 2.0 mató a la bacteria 'E.Coli'

    […] de Sistemas de la Universidad de Birmingham. Como los chinos del BGI, Loman puso en internet su primer ensamblaje preliminar . Segundo […]

  15. 2011: Review of a Remarkable Year
    December 31, 2011 at 9:36 am |

    […] started with Nick Loman filling me in on the crowdsourcing efforts that he had jump-started with his assembly of the BGI’s Ion Torrent data. I was initially dismissive of all this activity on social media, in the absence of any plans for a […]

  16. Some More Thoughts About the German E. coli Outbreak | Mike the Mad Biologist

    […] sequenced E. coli genomes E. coli update: sprouts as the culprit? ehec-outbreak-crowdsourced EHEC Genome Assembly E. coli TY2482: strain-specific genes The reason why this deadly E coli makes doctors shudder Share […]

  17. Dos pepinos espanhóis ao genoma nos blogs: E. coli patogênica na Alemanha | Rainha Vermelha

    […] Pathogens: Genes and Genomes – EHEC Genome Assembly […]

  18. Our Modern Day John Snow teaches us about cool crap | Evolution and Genomics

    […] posted a blog linking to Ion Torrent data and asked crowd sourcing the analysis of […]

Leave a Reply

You must be logged in to post a comment.