Compare the speed of containerized bwa-mem using various base images

On a Linux VM or Workstation with docker installed, fetch the GRCh38 FASTA, its index, and a pair of FASTQs:

wget -P /hot/ref https://storage.googleapis.com/genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai}
wget -P /hot/reads/test https://storage.googleapis.com/data.cyri.ac/test_L001_R{1,2}_001.fastq.gz

If on a Slurm cluster, here is an example of wrapping a docker run command in an sbatch request:

sbatch --chdir=/hot --output=ref/std.out --error=ref/std.err --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --mem=30G --time=4:00:00 --wrap="docker run --help"

To record and plot a container's CPU/RAM usage during runtime, install matplotlib and psrecord; then run psrecord a second after your container starts, or longer if docker needs time to download the image. Here is an example of using psrecord while running bwa index on the GRCh38 FASTA:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa index ref/GRCh38_Verily_v1.genome.fa & sleep 1; psrecord $(docker inspect -f '{{.State.Pid}}' $(docker ps -l --format '{{.ID}}')) --include-children --interval 1 --plot /hot/ref/perf_bwa_index.png

To run bwa-mem using Clear/Ubuntu/Alpine Linux base images using 16 threads and 8Mbp chunk size:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_clear.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-ubuntu bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_ubuntu.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-alpine bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_alpine.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz

These are runtimes/cputimes reported by bwa-mem using various parameters on an 8-core Intel Xeon W-2145 with 208GB RAM:

8 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 168.117 sec; CPU: 1324.228 sec; Image: clearlinux
Runtime: 167.644 sec; CPU: 1324.277 sec; Image: ubuntu
Runtime: 168.146 sec; CPU: 1325.243 sec; Image: alpine

16 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 128.370 sec; CPU: 1969.654 sec; Image: clearlinux
Runtime: 128.844 sec; CPU: 1975.693 sec; Image: ubuntu
Runtime: 129.009 sec; CPU: 1983.587 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 125.481 sec; CPU: 1950.107 sec; Image: clearlinux
Runtime: 125.565 sec; CPU: 1954.458 sec; Image: ubuntu
Runtime: 126.925 sec; CPU: 1959.460 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop with kernel mitigations=off)
Runtime: 122.767 sec; CPU: 1912.037 sec; Image: clearlinux
Runtime: 122.757 sec; CPU: 1914.046 sec; Image: ubuntu
Runtime: 122.927 sec; CPU: 1916.443 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime: 125.662 sec; CPU: 1945.060 sec; Image: clearlinux
Runtime: 135.173 sec; CPU: 2102.124 sec; Image: ubuntu
Runtime: 138.476 sec; CPU: 2121.609 sec; Image: alpine

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop with kernel mitigations=off)
Runtime: 123.638 sec; CPU: 1908.889 sec; Image: clearlinux
Runtime: 132.465 sec; CPU: 2065.746 sec; Image: ubuntu
Runtime: 133.132 sec; CPU: 2072.740 sec; Image: alpine

Notes:

Speedup from multithreading improves well with core-count and less so with hyperthreading.
Default chunk size -K is Threads x 10Mbp, but a smaller chunk size seems to slightly reduce the initial delay of loading reads into memory before the CPU can start doing useful work.
There is no difference in base image performance when the host is running kernel 5.12.14.
The clearlinux base image performs faster than others when the host is running kernel 5.8.0.
There is negligible benefit in disabling spectre/meltdown mitigations, at least on Xeon-based platforms.

These are runtimes/cputimes on a 10-core Intel Xeon W-1290P with 128GB RAM:

16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime:  97.422 sec; CPU: 1524.188 sec; Image: clearlinux
Runtime: 105.887 sec; CPU: 1663.922 sec; Image: ubuntu
Runtime: 107.245 sec; CPU: 1681.384 sec; Image: alpine

20 threads and 10Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime:  90.679 sec; CPU: 1749.888 sec; Image: clearlinux
Runtime:  97.209 sec; CPU: 1869.157 sec; Image: ubuntu
Runtime:  98.398 sec; CPU: 1890.464 sec; Image: alpine

Try bwa-mem2 which can use AVX2 on the W-1290P and AVX512 on the W-2145:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 index ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 mem -t 16 -K 16000000 -Y -D 0.05 -o reads/test/test_bwa-mem2.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz

16 threads and 16Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop on Xeon W-2145 with AVX512)
Runtime: 58.88 sec

16 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 55.04 sec

20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 52.45 sec

Notes:

bwa-mem2 has a speedup that is significant enough to warrant its use over bwa-mem.
bwa-mem2 uses ~20GB of memory during runtime, while bwa-mem uses ~6GB. Increases with chunk size.
For this workload, AVX512 on the W-2145 is not as beneficial as the faster clock speed on the W-1290P.

Try minimap2 which is known to be faster, but not the "best-practice" for short-read sequencing:

docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -x sr -d ref/GRCh38_Verily_v1.genome.fa.mmi ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -t 20 -K 16M -x sr -Y -a -o reads/test/test_minimap2.sam ref/GRCh38_Verily_v1.genome.fa.mmi reads/test/test_L001_R1_001.fastq.gz r
eads/test/test_L001_R2_001.fastq.gz

20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P)
Runtime: 33.860 sec; CPU: 393.543 sec; Peak RSS: 11.070 GB

ckandoth/base_image_benchmark.md