Combining the NCBI complete and incomplete genome project databases as of 4/4/2011:
346 Escherichia coli
206 Staphylococcus aureus
183 Helicobacter pylori
148 Vibrio cholerae
142 Salmonella enterica
96 Streptococcus pneumoniae
94 Yersinia pestis
83 Mycobacterium tuberculosis
77 Leptospira interrogans
73 Propionibacterium acnes
73 Enterococcus faecalis
68 Staphylococcus epidermidis
67 Acinetobacter baumannii
60 Streptococcus mutans
53 Bacillus cereus
50 Chlamydia trachomatis
42 Brucella melitensis
38 Pseudomonas syringae
35 Brucella suis
32 Listeria monocytogenes
30 Haemophilus influenzae
29 Neisseria meningitidis
29 Enterococcus faecium
29 Clostridium difficile
28 Mycobacterium abscessus
25 Campylobacter jejuni
25 Burkholderia pseudomallei
25 Bacillus thuringiensis
25 Bacillus anthracis
24 Methanobrevibacter smithii
24 Clostridium botulinum
24 Brucella abortus
24 Bacteroides sp.
23 Synechococcus sp.
23 Streptococcus pyogenes
23 Shigella flexneri
22 Streptococcus sanguinis
22 Francisella tularensis
21 Bacillus subtilis
20 Lactobacillus crispatus
18 Pseudomonas aeruginosa
18 Actinobacillus pleuropneumoniae
17 Treponema denticola
17 Neisseria gonorrhoeae
17 Lachnospiraceae bacterium
16 Borrelia burgdorferi
15 Wolbachia endosymbiont
15 Lactobacillus iners
15 Lactobacillus gasseri
15 Fusobacterium sp.
Make your own:
cat complete.txt incomplete.txt | cut -f 4 | cut -d " " -f 1,2 | sort | uniq -c | sort -r | head -50
Nice!
I Was going to ask you if you could map the data onto a taxonomic tree, but decided to do that myself. So, I made a file importable into MEGAN (http://ab.inf.uni-tuebingen.de/software/megan/) with this command:
cat complete.txt incomplete.txt | grep -v ‘Organism’| awk ‘BEGIN{FS=OFS=”\t”}{x[$3]++}END{for (i in x){print i,x[i]}}’
and then summarized to genus level, the top ten:
Escherichia 363
Streptococcus 315
Staphylococcus 295
Helicobacter 198
Vibrio 198
Bacillus 186
Mycobacterium 159
Lactobacillus 158
Clostridium 145
Salmonella 145
Or phylum top 10:
Proteobacteria 2653
Firmicutes 1667
Actinobacteria 577
Bacteroidetes 272
Spirochaetes 168
Euryarchaeota 133
Cyanobacteria 107
Tenericutes 92
Chlamydiae 77
Crenarchaeota 46
Hi Nick,
So I noticed that you’ve included projects that haven’t published a draft genome sequence. Here’s a modified list that contains the Top 50 sequenced bacteria (with sequences that we can get our hands on, draft or complete):
173 Escherichia coli
82 Salmonella enterica
78 Staphylococcus aureus
69 Propionibacterium acnes
56 Streptococcus pneumoniae
56 Enterococcus faecalis
45 Bacillus cereus
42 Mycobacterium tuberculosis
36 Vibrio cholerae
29 Pseudomonas syringae
28 Listeria monocytogenes
27 Neisseria meningitidis
27 Helicobacter pylori
27 Enterococcus faecium
27 Acinetobacter baumannii
25 Yersinia pestis
23 Methanobrevibacter smithii
23 Clostridium difficile
23 Burkholderia pseudomallei
22 Campylobacter jejuni
21 Haemophilus influenzae
21 Chlamydia trachomatis
20 Bacteroides sp.
20 Bacillus thuringiensis
19 Bacillus anthracis
18 Synechococcus sp.
17 Neisseria gonorrhoeae
16 Clostridium botulinum
15 Streptococcus pyogenes
15 Actinobacillus pleuropneumoniae
14 Lactobacillus iners
14 Francisella tularensis
14 Borrelia burgdorferi
13 Prochlorococcus marinus
11 Moraxella catarrhalis
11 Buchnera aphidicola
11 Bifidobacterium longum
10 Ureaplasma urealyticum
10 Streptomyces sp.
10 Streptococcus sanguinis
10 Fusobacterium sp.
10 Burkholderia mallei
10 Brucella abortus
10 Bacillus subtilis
9 Wolbachia endosymbiont
9 Sulfolobus islandicus
9 Streptococcus suis
9 Streptococcus agalactiae
9 Rhizobium etli
9 Pseudomonas aeruginosa
How on earth is anyone going handle 173 E.coli genomes? (or the eventual 363?)
Thanks guys – I like this collaborative blog post!
great post, really thought provoking as to why this distribution might be. E. coli and S. aureus make sense as the top two but H. pylori as third? more than twice as many P. acnes than C. diff! it kind of falls into place when you see that ~70 P. acnes genomes have been sequenced at one institute as part of the human microbiome project and a large swathe of the H. pylori seemed to be done at the same time as part of this effort as well.
would be interesting to see a breakdown by project or just the sequencing centre and date of submission to get an idea of which are widely studied and which have been targeted by large projects such as HMP.
It would be difficult to do a breakdown per project or sequencing centre for genomes that haven’t got a draft sequence. A lot of those genome projects have no information on the isolate or the sequencing centre.
Why not H. pylori? It’s linked to cancer, which has a lot of money behind it. I’m not sure about the S. aureus, but the E. coli genomes coming out are either enterotoxigenic or non-O157:H7 enterohaemoraggic E. coli (Must be a big push for an all encompassing vaccine)
How quickly does NCBI update this? Some recent papers can each top the numbers above
PMID: 20368420, April 2010: 30 Clostridium difficile isolates
PMID: 21273480, Jan 2011: 240 Streptococcus pneumoniae
PMID: 21383167, Mar 2011: 301 Streptococcus pyogenes (group A strep)
@krobison
Some of the large bacterial re-sequencing projects don’t deposit a short read assembly into GenBank and consequently haven’t registered each strain at GenBank. I’d expect that situation to become more common as the reward:effort ratio gets progressively smaller.
Just intended as a fun little indicator, and shows that bacterial genome sequencing is still generally focused on medically-relevant pathogens and the gamma-proteobacteria.
krobison,
the papers you’re citing weren’t de novo genome assemblies, but SNPs called against a reference so they wouldn’t be included in that NCBI database.
Nick,
I think, as Illumina de novo assemblies become standard (the recent ones I’ve seen with 5 kb jumps are amazing–qualitatively better than previous draft genomes), we’ll see resequencing fade away, and de novo sequencing take over.
nabil,
actually 90 of the current available E. coli are commensals, which were done anticipating many more pathogens (we need context). In the future, given some of the projects I’m aware of, I think we’ll see about 2/3 pathogens, 1/3 commensals in E. coli.
@madthemikebiologist
I’m sure you are right but that may not mean people will still go to the trouble of submitting their assemblies to GenBank.
I’m curious to hear about your de novo sequencing results. I am guessing these are pure Illumina assemblies with paired-end data combined with mate-pair “jump” data. How many scaffolds do you generally see and what kind of N50s with E. coli? Also, what assembler do you use. And how severe are misassemblies? We routinely get 1-10 scaffolds when assembling with 454 8kb paired-end data combined with 454 fragment data. But I haven’t yet tried Illumina mate-pair to compare.
Thanks for the feedback — though the C.difficile paper did actual assemble their data, not just map it back to a reference.
I didn’t mean this to be a criticism, as one must work with the best data available, but I do suspect that the database will probably get less and less comprehensive very quickly. It
[...] a bioinformatician in the Mark Pallen Research Group at the University of Birmingham, listed the most sequenced bacterial genomes yesterday on the group’s Pathogens: Genes and Genomes blog. This sparked quite a few interesting [...]