Last updated: Feb. 8, 2017
CRAM format encodes genomic alignments to a reference.
It has better lossless (and optional lossy) compression compared to existing BAM format. It is also
becoming more widely used. We propose the addition of a CRAM Reader to biogo/hts
under hts/cram.
As more projects move to whole-genome sequencing across large cohorts, drive space to store alignments becomes a concern. This is the reason for CRAM which has a much smaller footprint than BAM. This is done largely through encoding difference to a reference rather than saving the full sequence and by integer compression schemes.
The functionality to be implemented will be driven by the specification: https://samtools.github.io/hts-specs/CRAMv3.pdf but limited to the types observed in the wild and implemented and used in htslib
The hts/cram Reader should match the hts/bam API as closely as is reasonable.
import "biogo/hts/cram"
var cr *cram.CRAM
var err error
cr, err = cram.NewReader(rdr, io.Reader, rd int, reference *sam.Reference)
cr.Omit(bam.AllVariableLengthData)
var hdr *cram.Header = cr.Header()
var rec *sam.Record
rec, err = cr.Read()The difference from the hts/bam API is the requirement of the reference argument to the constructor.
Note that extracting the sequence is costly, especially in CRAM. While the Omit method in hts/bam
provides a global level of control over if this cost is incurred, we may wish to add a finer level
Record.Sequence(r *sam.Reference) method so that the user has full control over exactly when
to incur this cost.
If the sam.Reference must be passed to the Sequence() method, then the cram.NewReader function would not
need that value.
We will need to understand how this would work with the regional query method in hts/bam that gets a slice of
[]bgzf.Chunk to be sent to a bam.Iterator.
the cram.Header will extend sam.Header like:
type Header struct {
*sam.Header
...
}TODO: what else is needed in the cram Header.
The CRAM specification lists a number of codecs. However, we will limit to those that are used in htslib.github Namely, those are :
- gzip
- bzip2
- lzma
- rANS
- huffman in single-code-only mode (?)
because at least gzip and bzip2 are already available in the go standard library as io.Readers all of these
should be implemented as io.Reader. This would be one of the first things to be implemented as the
API and the need are clear.
itf8 is a central data type in CRAM. For these, we will define:
type itf8 []uint8we can follow the implementation here
with a signature like:
func (i itf8) get32() (read int, val int32){...}
func (i itf8) get64() (read int, val int64){...}where read indicates how far to advance a pointer in the []byte.