Top 50 sequenced bacteria

Combining the NCBI complete and incomplete genome project databases as of 4/4/2011:

    346 Escherichia coli
    206 Staphylococcus aureus
    183 Helicobacter pylori
    148 Vibrio cholerae
    142 Salmonella enterica
     96 Streptococcus pneumoniae
     94 Yersinia pestis
     83 Mycobacterium tuberculosis
     77 Leptospira interrogans
     73 Propionibacterium acnes
     73 Enterococcus faecalis
     68 Staphylococcus epidermidis
     67 Acinetobacter baumannii
     60 Streptococcus mutans
     53 Bacillus cereus
     50 Chlamydia trachomatis
     42 Brucella melitensis
     38 Pseudomonas syringae
     35 Brucella suis
     32 Listeria monocytogenes
     30 Haemophilus influenzae
     29 Neisseria meningitidis
     29 Enterococcus faecium
     29 Clostridium difficile
     28 Mycobacterium abscessus
     25 Campylobacter jejuni
     25 Burkholderia pseudomallei
     25 Bacillus thuringiensis
     25 Bacillus anthracis
     24 Methanobrevibacter smithii
     24 Clostridium botulinum
     24 Brucella abortus
     24 Bacteroides sp.
     23 Synechococcus sp.
     23 Streptococcus pyogenes
     23 Shigella flexneri
     22 Streptococcus sanguinis
     22 Francisella tularensis
     21 Bacillus subtilis
     20 Lactobacillus crispatus
     18 Pseudomonas aeruginosa
     18 Actinobacillus pleuropneumoniae
     17 Treponema denticola
     17 Neisseria gonorrhoeae
     17 Lachnospiraceae bacterium
     16 Borrelia burgdorferi
     15 Wolbachia endosymbiont
     15 Lactobacillus iners
     15 Lactobacillus gasseri
     15 Fusobacterium sp.

Make your own:

cat complete.txt incomplete.txt | cut -f 4 | cut -d " " -f 1,2 | sort | uniq -c | sort -r | head -50

11 Responses

  1. flxlex
    April 4, 2011 at 10:44 am |


    I Was going to ask you if you could map the data onto a taxonomic tree, but decided to do that myself. So, I made a file importable into MEGAN ( with this command:

    cat complete.txt incomplete.txt | grep -v ‘Organism’| awk ‘BEGIN{FS=OFS=”\t”}{x[$3]++}END{for (i in x){print i,x[i]}}’

    and then summarized to genus level, the top ten:

    Escherichia 363
    Streptococcus 315
    Staphylococcus 295
    Helicobacter 198
    Vibrio 198
    Bacillus 186
    Mycobacterium 159
    Lactobacillus 158
    Clostridium 145
    Salmonella 145

    Or phylum top 10:
    Proteobacteria 2653
    Firmicutes 1667
    Actinobacteria 577
    Bacteroidetes 272
    Spirochaetes 168
    Euryarchaeota 133
    Cyanobacteria 107
    Tenericutes 92
    Chlamydiae 77
    Crenarchaeota 46

  2. nabil
    April 4, 2011 at 1:24 pm |

    Hi Nick,

    So I noticed that you’ve included projects that haven’t published a draft genome sequence. Here’s a modified list that contains the Top 50 sequenced bacteria (with sequences that we can get our hands on, draft or complete):

    173 Escherichia coli
    82 Salmonella enterica
    78 Staphylococcus aureus
    69 Propionibacterium acnes
    56 Streptococcus pneumoniae
    56 Enterococcus faecalis
    45 Bacillus cereus
    42 Mycobacterium tuberculosis
    36 Vibrio cholerae
    29 Pseudomonas syringae
    28 Listeria monocytogenes
    27 Neisseria meningitidis
    27 Helicobacter pylori
    27 Enterococcus faecium
    27 Acinetobacter baumannii
    25 Yersinia pestis
    23 Methanobrevibacter smithii
    23 Clostridium difficile
    23 Burkholderia pseudomallei
    22 Campylobacter jejuni
    21 Haemophilus influenzae
    21 Chlamydia trachomatis
    20 Bacteroides sp.
    20 Bacillus thuringiensis
    19 Bacillus anthracis
    18 Synechococcus sp.
    17 Neisseria gonorrhoeae
    16 Clostridium botulinum
    15 Streptococcus pyogenes
    15 Actinobacillus pleuropneumoniae
    14 Lactobacillus iners
    14 Francisella tularensis
    14 Borrelia burgdorferi
    13 Prochlorococcus marinus
    11 Moraxella catarrhalis
    11 Buchnera aphidicola
    11 Bifidobacterium longum
    10 Ureaplasma urealyticum
    10 Streptomyces sp.
    10 Streptococcus sanguinis
    10 Fusobacterium sp.
    10 Burkholderia mallei
    10 Brucella abortus
    10 Bacillus subtilis
    9 Wolbachia endosymbiont
    9 Sulfolobus islandicus
    9 Streptococcus suis
    9 Streptococcus agalactiae
    9 Rhizobium etli
    9 Pseudomonas aeruginosa

    How on earth is anyone going handle 173 E.coli genomes? (or the eventual 363?)

  3. flashton
    April 4, 2011 at 1:30 pm |

    great post, really thought provoking as to why this distribution might be. E. coli and S. aureus make sense as the top two but H. pylori as third? more than twice as many P. acnes than C. diff! it kind of falls into place when you see that ~70 P. acnes genomes have been sequenced at one institute as part of the human microbiome project and a large swathe of the H. pylori seemed to be done at the same time as part of this effort as well.

    would be interesting to see a breakdown by project or just the sequencing centre and date of submission to get an idea of which are widely studied and which have been targeted by large projects such as HMP.

  4. nabil
    April 4, 2011 at 1:46 pm |

    It would be difficult to do a breakdown per project or sequencing centre for genomes that haven’t got a draft sequence. A lot of those genome projects have no information on the isolate or the sequencing centre.

    Why not H. pylori? It’s linked to cancer, which has a lot of money behind it. I’m not sure about the S. aureus, but the E. coli genomes coming out are either enterotoxigenic or non-O157:H7 enterohaemoraggic E. coli (Must be a big push for an all encompassing vaccine)

  5. krobison
    April 4, 2011 at 3:00 pm |

    How quickly does NCBI update this? Some recent papers can each top the numbers above

    PMID: 20368420, April 2010: 30 Clostridium difficile isolates
    PMID: 21273480, Jan 2011: 240 Streptococcus pneumoniae
    PMID: 21383167, Mar 2011: 301 Streptococcus pyogenes (group A strep)

  6. mikethemadbiologist
    April 4, 2011 at 6:54 pm |


    the papers you’re citing weren’t de novo genome assemblies, but SNPs called against a reference so they wouldn’t be included in that NCBI database.


    I think, as Illumina de novo assemblies become standard (the recent ones I’ve seen with 5 kb jumps are amazing–qualitatively better than previous draft genomes), we’ll see resequencing fade away, and de novo sequencing take over.


    actually 90 of the current available E. coli are commensals, which were done anticipating many more pathogens (we need context). In the future, given some of the projects I’m aware of, I think we’ll see about 2/3 pathogens, 1/3 commensals in E. coli.

  7. krobison
    April 5, 2011 at 5:18 pm |

    Thanks for the feedback — though the C.difficile paper did actual assemble their data, not just map it back to a reference.

    I didn’t mean this to be a criticism, as one must work with the best data available, but I do suspect that the database will probably get less and less comprehensive very quickly. It

  8. The Most Sequenced Bacterial Genomes | Sphaerula

    […] a bioinformatician in the Mark Pallen Research Group at the University of Birmingham, listed the most sequenced bacterial genomes yesterday on the group’s Pathogens: Genes and Genomes blog. This sparked quite a few interesting […]

Leave a Reply

You must be logged in to post a comment.