On a Linux VM or Workstation with docker
installed, fetch the GRCh38 FASTA, its index, and a pair of FASTQs:
wget -P /hot/ref https://storage.googleapis.com/genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai}
wget -P /hot/reads/test https://storage.googleapis.com/data.cyri.ac/test_L001_R{1,2}_001.fastq.gz
If on a Slurm cluster, here is an example of wrapping a docker run
command in an sbatch request:
sbatch --chdir=/hot --output=ref/std.out --error=ref/std.err --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --mem=30G --time=4:00:00 --wrap="docker run --help"
To record and plot a container's CPU/RAM usage during runtime, install matplotlib and psrecord; then run psrecord a second after your container starts, or longer if docker needs time to download the image. Here is an example of using psrecord while running bwa index on the GRCh38 FASTA:
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa index ref/GRCh38_Verily_v1.genome.fa & sleep 1; psrecord $(docker inspect -f '{{.State.Pid}}' $(docker ps -l --format '{{.ID}}')) --include-children --interval 1 --plot /hot/ref/perf_bwa_index.png
To run bwa-mem using Clear/Ubuntu/Alpine Linux base images using 16 threads and 8Mbp chunk size:
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-clear bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_clear.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-ubuntu bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_ubuntu.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa:0.7.17-alpine bwa mem -t 16 -K 8000000 -Y -D 0.05 -o reads/test/test_alpine.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
These are runtimes/cputimes reported by bwa-mem using various parameters on an 8-core Intel Xeon W-2145 with 208GB RAM:
8 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 168.117 sec; CPU: 1324.228 sec; Image: clearlinux
Runtime: 167.644 sec; CPU: 1324.277 sec; Image: ubuntu
Runtime: 168.146 sec; CPU: 1325.243 sec; Image: alpine
16 threads and 80Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 128.370 sec; CPU: 1969.654 sec; Image: clearlinux
Runtime: 128.844 sec; CPU: 1975.693 sec; Image: ubuntu
Runtime: 129.009 sec; CPU: 1983.587 sec; Image: alpine
16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop)
Runtime: 125.481 sec; CPU: 1950.107 sec; Image: clearlinux
Runtime: 125.565 sec; CPU: 1954.458 sec; Image: ubuntu
Runtime: 126.925 sec; CPU: 1959.460 sec; Image: alpine
16 threads and 8Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop with kernel mitigations=off)
Runtime: 122.767 sec; CPU: 1912.037 sec; Image: clearlinux
Runtime: 122.757 sec; CPU: 1914.046 sec; Image: ubuntu
Runtime: 122.927 sec; CPU: 1916.443 sec; Image: alpine
16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime: 125.662 sec; CPU: 1945.060 sec; Image: clearlinux
Runtime: 135.173 sec; CPU: 2102.124 sec; Image: ubuntu
Runtime: 138.476 sec; CPU: 2121.609 sec; Image: alpine
16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop with kernel mitigations=off)
Runtime: 123.638 sec; CPU: 1908.889 sec; Image: clearlinux
Runtime: 132.465 sec; CPU: 2065.746 sec; Image: ubuntu
Runtime: 133.132 sec; CPU: 2072.740 sec; Image: alpine
Notes:
- Speedup from multithreading improves well with core-count and less so with hyperthreading.
- Default chunk size
-K
isThreads x 10Mbp
, but a smaller chunk size seems to slightly reduce the initial delay of loading reads into memory before the CPU can start doing useful work. - There is no difference in base image performance when the host is running kernel 5.12.14.
- The clearlinux base image performs faster than others when the host is running kernel 5.8.0.
- There is negligible benefit in disabling spectre/meltdown mitigations, at least on Xeon-based platforms.
These are runtimes/cputimes on a 10-core Intel Xeon W-1290P with 128GB RAM:
16 threads and 8Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime: 97.422 sec; CPU: 1524.188 sec; Image: clearlinux
Runtime: 105.887 sec; CPU: 1663.922 sec; Image: ubuntu
Runtime: 107.245 sec; CPU: 1681.384 sec; Image: alpine
20 threads and 10Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop)
Runtime: 90.679 sec; CPU: 1749.888 sec; Image: clearlinux
Runtime: 97.209 sec; CPU: 1869.157 sec; Image: ubuntu
Runtime: 98.398 sec; CPU: 1890.464 sec; Image: alpine
Try bwa-mem2 which can use AVX2 on the W-1290P and AVX512 on the W-2145:
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 index ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/bwa-mem:2.2.1 bwa-mem2 mem -t 16 -K 16000000 -Y -D 0.05 -o reads/test/test_bwa-mem2.sam ref/GRCh38_Verily_v1.genome.fa reads/test/test_L001_R1_001.fastq.gz reads/test/test_L001_R2_001.fastq.gz
16 threads and 16Mbp chunks on Linux kernel 5.12.14 (ClearLinux Desktop on Xeon W-2145 with AVX512)
Runtime: 58.88 sec
16 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 55.04 sec
20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P with AVX2)
Runtime: 52.45 sec
Notes:
- bwa-mem2 has a speedup that is significant enough to warrant its use over bwa-mem.
- bwa-mem2 uses ~20GB of memory during runtime, while bwa-mem uses ~6GB. Increases with chunk size.
- For this workload, AVX512 on the W-2145 is not as beneficial as the faster clock speed on the W-1290P.
Try minimap2 which is known to be faster, but not the "best-practice" for short-read sequencing:
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -x sr -d ref/GRCh38_Verily_v1.genome.fa.mmi ref/GRCh38_Verily_v1.genome.fa
docker run --rm -v /hot:/hot -w /hot -u $(id -u):$(id -g) ghcr.io/ucladx/minimap:2.21 minimap2 -t 20 -K 16M -x sr -Y -a -o reads/test/test_minimap2.sam ref/GRCh38_Verily_v1.genome.fa.mmi reads/test/test_L001_R1_001.fastq.gz r
eads/test/test_L001_R2_001.fastq.gz
20 threads and 16Mbp chunks on Linux kernel 5.8.0 (Ubuntu Desktop on Xeon W-1290P)
Runtime: 33.860 sec; CPU: 393.543 sec; Peak RSS: 11.070 GB