Skip to content

Instantly share code, notes, and snippets.

@lh3
Last active November 7, 2024 21:24
Show Gist options
  • Save lh3/bdd97255c2e39f8b2752883dbbc47c14 to your computer and use it in GitHub Desktop.
Save lh3/bdd97255c2e39f8b2752883dbbc47c14 to your computer and use it in GitHub Desktop.
Test BAM/CRAM file sizes
# download portable samtools binary v1.14 (not the latest version)
curl -L 'https://zenodo.org/records/5731013/files/htstools-1.14_x64-linux.tar.bz2?download=1' | tar -jxf - htstools-1.14_x64-linux/samtools
htstools-1.14_x64-linux/samtools # test run; the following command lines assume "samtools" is on $PATH
# download minimap2 binary; also easy to compile from the source code
curl -L https://github.com/lh3/minimap2/releases/download/v2.28/minimap2-2.28_x64-linux.tar.bz2 | tar -jxf - minimap2-2.28_x64-linux/minimap2
# download T2T-CHM13v2 analysis set
curl -L https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0_maskedY_rCRS.fa.gz | zcat > chm13v2.fa
samtools faidx chm13v2.fa
# download and extract the first million PacBio HiFi reads from an unaligned BAM
samtools view https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_HiFi-Revio_20231031/HG002_PacBio-Revio_m84039_230928_213653_s3.hifi_reads.bam|head -1000006|samtools view -b@4 - > test-1M.raw.bam
# convert unaligned BAM to unaligned CRAM
samtools view -O cram,no_ref -@8 -o test-1M.no_ref.cram test-1M.raw.bam
# align with minimap2 and copy tags to the output -- this took 30 minutes over 8 threads
samtools fastq -T ML,MM,ac,ec,ma,np,rq,sn,we,ws,zm,qs,qe,bc,bq,cx,bl,bt,ql,qt,ls test-1M.no_ref.cram | minimap2 -ax map-hifi -yt8 chm13v2.fa - | samtools view -@4 -bo test-1M.aln.bam -
# generate aligned files
samtools view -@8 -O cram -T chm13v2.fa test-1M.aln.bam > test-1M.aln.ref.cram # reference-based, unsorted
samtools sort -@8 -m2g -o test-1M.aln.srt.bam test-1M.aln.bam # BAM, sorted
samtools view -@8 -O cram,no_ref -o test-1M.aln.srt.no_ref.cram test-1M.aln.srt.bam
samtools view -@8 -O cram,embed_ref -T chm13v2.fa -o test-1M.aln.srt.embed_ref.cram test-1M.aln.srt.bam
samtools view -@8 -O cram -T chm13v2.fa -o test-1M.aln.srt.ref.cram test-1M.aln.srt.bam
# unaligned but in the sorted order
samtools view test-1M.aln.srt.bam|awk -v FS="\t" -v OFS="\t" '$11!="*"{$2=0;$3="*";$4=0;$5=0;$6="*";print}'|perl -pe 's/\t(NM|AS|ms|nn|tp|cm|s1|s2|de|rl):\S:\S+//g'|samtools view -O cram,no_ref -@8 -o test-1M.srt.no_ref.cram
8430935615 test-1M.aln.bam # bam, mapped, unsorted
8206539996 test-1M.raw.bam # bam, unmapped
7817221904 test-1M.aln.srt.bam # bam, mapped, sorted
6429553843 test-1M.no_ref.cram # cram, unmapped
6072214958 test-1M.no_ref1.cram # cram, unmapped (cram 3.1, archive)
4794146915 test-1M.aln.srt.no_ref.cram # cram, mapped, sorted (no reference)
4672641614 test-1M.srt.no_ref.cram # cram, mapping information removed, sorted (no reference)
3352373066 test-1M.aln.srt.embed_ref.cram # cram, mapped, sorted (embedded reference)
2611474583 test-1M.aln.ref.cram # cram, mapped, unsorted (with reference)
2594033487 test-1M.aln.srt.ref.cram # cram, mapped, sorted (with reference)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment