Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active January 24, 2022 16:47
Show Gist options
  • Save ckandoth/74f128bcff02c2121448337b6e02ccfa to your computer and use it in GitHub Desktop.
Save ckandoth/74f128bcff02c2121448337b6e02ccfa to your computer and use it in GitHub Desktop.
Compare the speed of containerized bwa-mem using various base images

On a Linux VM or Workstation with docker installed, fetch the GRCh38 FASTA, its index, and a pair of FASTQs:

wget -P /hot/ref https://storage.googleapis.com/genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai}
wget -P /hot/reads/test https://storage.googleapis.com/data.cyri.ac/test_L001_R{1,2}_001.fastq.gz

If on a Slurm cluster, here is an example of wrapping a docker run command in an sbatch request:

sbatch --chdir=/hot --output=ref/std.out --error=ref/std.err --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --mem=30G --time=4:00:00 --wrap="docker run --help"

To record and plot a container's CPU/RAM usage during runtime, install matplotlib and psrecord; then run psrecord a second after your container starts, or longer if docker needs time to download the image. Here is an example of using psrecord while running bwa index on the GRCh38 FASTA:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa index ref/GRCh38_Verily_v1.genome.fa & sleep 1; psrecord $(docker inspect -f '{{.State.Pid}}' $(docker ps -l --format '{{.ID}}')) --include-children --interval 1 --plot /hot/ref/perf_bwa_index.png

To run bwa-mem using Clear/Ubuntu/Alpine Linux base images using 16 threads and 8Mbp chunk size:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_clear.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-ubuntu bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_ubuntu.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-alpine bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_alpine.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz

These are runtimes/cputimes reported by bwa-mem using various parameters on an 8-core Intel Xeon W-2145 with 208GB RAM:

8 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 168.117 sec; CPU: 1324.228 sec; Image: clearlinux
Runtime: 167.644 sec; CPU: 1324.277 sec; Image: ubuntu
Runtime: 168.146 sec; CPU: 1325.243 sec; Image: alpine

16 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 128.370 sec; CPU: 1969.654 sec; Image: clearlinux
Runtime: 128.844 sec; CPU: 1975.693 sec; Image: ubuntu
Runtime: 129.009 sec; CPU: 1983.587 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 125.481 sec; CPU: 1950.107 sec; Image: clearlinux
Runtime: 125.565 sec; CPU: 1954.458 sec; Image: ubuntu
Runtime: 126.925 sec; CPU: 1959.460 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop with kernel mitigations=off)
Runtime: 122.767 sec; CPU: 1912.037 sec; Image: clearlinux
Runtime: 122.757 sec; CPU: 1914.046 sec; Image: ubuntu
Runtime: 122.927 sec; CPU: 1916.443 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime: 125.662 sec; CPU: 1945.060 sec; Image: clearlinux
Runtime: 135.173 sec; CPU: 2102.124 sec; Image: ubuntu
Runtime: 138.476 sec; CPU: 2121.609 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop with kernel mitigations=off)
Runtime: 123.638 sec; CPU: 1908.889 sec; Image: clearlinux
Runtime: 132.465 sec; CPU: 2065.746 sec; Image: ubuntu
Runtime: 133.132 sec; CPU: 2072.740 sec; Image: alpine

Notes:

  • Speedup from multithreading improves well with core-count and less so with hyperthreading.
  • Default chunk size -K is Threads x 10Mbp, but a smaller chunk size seems to slightly reduce the initial delay of loading reads into memory before the CPU can start doing useful work.
  • There is no difference in base image performance when the host is running kernel 5.12.14.
  • The clearlinux base image performs faster than others when the host is running kernel 5.8.0.
  • There is negligible benefit in disabling spectre/meltdown mitigations, at least on Xeon-based platforms.

These are runtimes/cputimes on a 10-core Intel Xeon W-1290P with 128GB RAM:

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime:  97.422 sec; CPU: 1524.188 sec; Image: clearlinux
Runtime: 105.887 sec; CPU: 1663.922 sec; Image: ubuntu
Runtime: 107.245 sec; CPU: 1681.384 sec; Image: alpine

20 threads and 10Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime:  90.679 sec; CPU: 1749.888 sec; Image: clearlinux
Runtime:  97.209 sec; CPU: 1869.157 sec; Image: ubuntu
Runtime:  98.398 sec; CPU: 1890.464 sec; Image: alpine

Try bwa-mem2 which can use AVX2 on the W-1290P and AVX512 on the W-2145:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 index ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 mem -t 16 -K 16000000 -Y -D 0.05 -o reads/test/test_bwa-mem2.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz

16 threads and 16Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop on Xeon W-2145 with AVX512)
Runtime: 58.88 sec

16 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 55.04 sec

20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 52.45 sec

Notes:

  • bwa-mem2 has a speedup that is significant enough to warrant its use over bwa-mem.
  • bwa-mem2 uses ~20GB of memory during runtime, while bwa-mem uses ~6GB. Increases with chunk size.
  • For this workload, AVX512 on the W-2145 is not as beneficial as the faster clock speed on the W-1290P.

Try minimap2 which is known to be faster, but not the "best-practice" for short-read sequencing:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -x sr -d ref/GRCh38_Verily_v1.genome.fa.mmi ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -t 20 -K 16M -x sr -Y -a -o reads/test/test_minimap2.sam ref/GRCh38_Verily_v1.genome.fa.mmi reads/test/test_L001_R1_001.fastq.gz r
eads/test/test_L001_R2_001.fastq.gz

20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P)
Runtime: 33.860 sec; CPU: 393.543 sec; Peak RSS: 11.070 GB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment