22 Responses

  1. peterjc
    peterjc
    September 19, 2011 at 3:05 pm |

    I want SAM/BAM to optionally store the reference sequence. Obviously this is a waste of space on a model organism where all you need to know is the genome build (e.g. hg18 versus hg19), but for non-model organisms a self contained assembly file is a big plus point. And for a bacteria or virus, the size overhead isn’t worth worrying about.

    I’ve discussed getting SAM/BAM output from MIRA with its author Bastian Chevaux, and he felt SAM/BAM is lacking as a de novo assembly format. One issue was it doesn’t allow for any annotation of the contigs (e.g. consensus tags in ACE or MIRA’s own output a region can be marked as repetitive), which is very useful for manual finishing. However, if SAM/BAM doesn’t get built in annotation, this can be handled by another companion file – e.g. a GFF3 file.

    In the meantime, I’m maintaining my MIRA to SAM converter which is working very nicely for me to visualise my paired end assemblies: https://github.com/peterjc/maf2sam

  2. casbon
    casbon
    September 19, 2011 at 3:28 pm |

    Newbler 2.6 reference mapper does output BAM, not the assembler?

    This is slightly more relavent to resequencing, but a standard encoding is only part of the problem, you also need a standard way of using the encoding. For example, should deletions be left or right aligned? Newbler, for example, outputs substitutions as insertions. This kind of thing should be explicit in the spec.

  3. kbradnam
    kbradnam
    September 19, 2011 at 3:40 pm |

    It would obviously be great to have a single file format that can represent all genome assemblies currently in circulation and all of those yet to come. There was various discussions of assembly format at the Genome Assembly Workshop in Santa Cruz earlier this year. Also Deanna Church at the NCBI has been working towards a new submission format for genome assemblies (see http://www.ncbi.nlm.nih.gov/projects/genome/assembly/model.shtml). I also recall the ALLPATHS group proposing a new (variant) FASTA format in their ALLPATHS paper, that would allow you to specify more uncertainty about a genome assembly.

    I think everyone would support the effort to develop a standard, but getting to that point is probably not going to be as straightforward as you might hope. There are always divided opinions on such matters, and I’ve noticed that the genome assembly community seems to have particularly strong opinions on many issues.

    I think the Assemblathon can have a role in ‘brokering a deal’ but that we also shouldn’t be seen to dictate a preference for any particular format. The choice of a standard should be reached with as much agreement from the genome assembly community as possible. After the forthcoming CSHL Genome Informatics meeting, there will immediately be a short Genome assembly ‘mini meeting’ (a half day I think) and this might be another good time to revisit this issue.

    As the Assemblathon 2 submission deadline is just over a week away, I don’t think we can change anything for that as that would probably mean many groups couldn’t (or wouldn’t enter). But it might be something to think about before Assemblathon 3 starts (assuming we get funding to continue)…maybe organize a 1 day meeting (or virtual meeting) to have a discussion about this.

  4. Jared Simpson
    Jared Simpson
    September 19, 2011 at 3:40 pm |

    A few comments as a developer of assembly tools. Tracking the placement of reads onto contigs throughout the assembly requires a lot of memory, which is the reason most NGS assemblers do not natively output this information. It is particularly difficult for de Bruijn graph assemblers, as some reads map to paths through the graph, not a single vertex/edge. In both assemblers I have been involved with (ABySS and SGA) I have chosen to realign the reads to the contigs to recover their placement. Both projects now use BAM to represent the read placements. Shaun Jackman and I have discussed standardising the rest of the scaffolding process to allow us to swap algorithms – for instance using the new ABySS scaffolder with SGA contigs, or vice versa. There are still a few file formats to settle on (the representation of the assembly graph and the distance estimates between contigs) but making BAM the representation of read alignments is a good start.

  5. peterjc
    peterjc
    September 19, 2011 at 3:45 pm |

    In reply to Jared, I know the CLC Bio assembler takes a similar approach – they tell you to map the reads onto the assembled FASTA files afterwards.

  6. peterjc
    peterjc
    September 19, 2011 at 3:47 pm |

    In reply to kbradnam, I agree it is too late to ask for SAM/BAM in Assemblathon 2, but it is an interesting proposal for Assemblathon 3.

  7. peterjc
    peterjc
    September 19, 2011 at 3:55 pm |

    Thread started on samtools-devel to raise specific proposals for how to improve SAM/BAM for assemblies:
    http://sourceforge.net/mailarchive/message.php?msg_id=28109794

  8. flxlex
    flxlex
    September 19, 2011 at 5:13 pm |

    Hi Nick and others,

    Talking directly to 454 people, they tell me SAM/BAM support for gsAssembler is just around the corner…

  9. Shaun Jackman
    September 19, 2011 at 6:18 pm |

    BAM seems the obvious choice for the assembly file format to record where reads are placed in the assembly.

    ABySS doesn’t track where the reads are placed during the assembly. Since the reads are broken up into k-mer, ABySS knows where the k-mer go, but it doesn’t know where the reads go. I map the reads back to the final assembly using a short-read mapper. The next release of ABySS will include an option to map the reads back to the final assembly and produce a BAM file.

    I’d also like to see a standard format to record which contigs are known to overlap with other contigs, that is, an overlap graph. An overlap graph is two things: one, a graph, and two, a set of overlap alignments. As such, I’d like to see the overlap graph file format build on either an existing graph file format (ABySS uses Graphviz DOT) or an existing alignment file format, such as SAM. There are a lot of existing useful tools to manipulate and visualize Graphviz DOT files.

    There are other assembly file formats to standardize, such as estimated distances between contigs and where contigs are placed in scaffolds.

    Cheers,
    Shaun

  10. dmchurch
    dmchurch
    September 19, 2011 at 6:33 pm |

    I have a question about terminology: my understanding is that SAM/BAM is really about defining alignments, not assemblies. Conceivably, you could write code that could take SAM/BAM and generate the files needed to define an assembly (FASTA at a minimum, AGP if you have higher order structures like scaffolds and/or chromosomes). Keith already put a link to the files needed to submit an assembly to GenBank.
    What would be great is to come up with a way to use the information in a BAM to mark up an assembly with high quality or low quality regions based on defined metrics. Most users treat an assembly as uniform, when typically it has high quality and low quality regions- finding a way to express that to users would be great. I’m open to suggestions!

  11. dakl
    dakl
    September 22, 2011 at 8:53 am |

    Great idea. What does Heng Li and the rest of the samtools devs say about it. I think that’s an important issue since they are the ones developing the SAM/BAM formats. Are they involved in your ideas?

  12. peterjc
    peterjc
    September 22, 2011 at 3:23 pm |

    It looks like we may have come to an agreement on how to use SAM/BAM with a padded (gapped) reference sequence, which will be useful for (de novo) assemblies.

  13. peterjc
    peterjc
    September 22, 2011 at 4:57 pm |

    I’ve done a blog post with some example screenshots showing a SAM/BAM file with a padded (gapped) reference/consensus sequence, and how similar this is to the traditional view of insertion sequences as shown in an ACE file or similar:

    http://blastedbio.blogspot.com/2011/09/sambam-with-gapped-reference.html

  14. Links 9/27/11 and Programming Note | Mike the Mad Biologist

    […] could have prevented nearly half of children’s deaths, CDC says Enrolling in the School of Ants SAM/BAM: It’s time for a single standard for assembly output Fall, Flu Shots, and Fear How the refrigerator got its […]

  15. peterjc
    peterjc
    October 4, 2011 at 8:56 am |

    Follow up blog post about how some viewers already try to show inserts as columns using traditional SAM/BAM with unpadded (ungapped) reference/consensus:

    http://blastedbio.blogspot.com/2011/10/sambam-without-gapped-reference.html

Leave a Reply

You must be logged in to post a comment.