WORK IN PROGRESS!
We benchmarked the following operations (all operations got timed
with system.time()):
## --- random matrix (dense) ---
m <- 12000
n <- 2500
set.seed(123)
A <- matrix(rnorm(m * n), nrow=m) # A: a 12,000 x 2,500 matrix
system.time(A_cp <- crossprod(A)) # A_cp: a symetric 2,500 x 2,500 matrix
system.time(A_tcp <- tcrossprod(A)) # A_tcp: a symetric 12,000 x 12,000 matrix
system.time(MULT <- A_cp %*% A_cp)
system.time(A_svd_30 <- svd(A, nu=30, nv=30))
## RSpectra::svds() can also take advantage of BLAS:
library(RSpectra)
system.time(A_RSsvds_30 <- svds(A, k=30))
system.time(A_RSsvds_100 <- svds(A, k=100))
## --- single cell count (sparse) ---
library(scRNAseq)
sce <- fetchDataset("zilionis-lung-2019", "2023-12-20", path="human")
logcounts(sce) <- log1p(assay(sce))
## Select the top 1000 highest mean genes
rs <- rowSums(logcounts(sce))
sce2 <- sce[rank(-rs) <= 1000, ]
M <- as.matrix(logcounts(sce2)) # dense matrix with 89% zeros!
## base::svd()
system.time(M_svd_30 <- svd(M, nu=30, nv=30))
system.time(M_svd_full <- svd(M))
## RSpectra::svds() # also takes advantage of BLAS
system.time(M_RSsvds_30 <- svds(M, k=30))
system.time(M_RSsvds_100 <- svds(M, k=100))
## Unlike base::svd(), RSpectra::svds() can operate natively on
## a sparse matrix. However, it seems that in the sparse case,
## which BLAS is used makes no difference! (see benchmarks below)
S <- as(M, "dgCMatrix") # sparse matrix
system.time(S_RSsvds_30 <- svds(S, k=30))
system.time(S_RSsvds_100 <- svds(S, k=100))sessionInfo():
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scRNAseq_2.23.0 SingleCellExperiment_1.31.1
[3] SummarizedExperiment_1.39.1 Biobase_2.69.0
[5] GenomicRanges_1.61.1 Seqinfo_0.99.2
[7] IRanges_2.43.0 S4Vectors_0.47.0
[9] BiocGenerics_0.55.1 generics_0.1.4
[11] MatrixGenerics_1.21.0 matrixStats_1.5.0
[13] RSpectra_0.16-2
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 alabaster.se_1.9.0 dplyr_1.1.4
[4] blob_1.2.4 filelock_1.0.3 Biostrings_2.77.2
[7] bitops_1.0-9 lazyeval_0.2.2 fastmap_1.2.0
[10] RCurl_1.98-1.17 BiocFileCache_2.99.5 GenomicAlignments_1.45.2
[13] XML_3.99-0.18 lifecycle_1.0.4 ProtGenerics_1.41.0
[16] alabaster.matrix_1.9.0 KEGGREST_1.49.1 alabaster.base_1.9.5
[19] RSQLite_2.4.2 magrittr_2.0.3 compiler_4.5.1
[22] rlang_1.1.6 tools_4.5.1 yaml_2.3.10
[25] rtracklayer_1.69.1 S4Arrays_1.9.1 bit_4.6.0
[28] curl_6.4.0 DelayedArray_0.35.2 abind_1.4-8
[31] BiocParallel_1.43.4 HDF5Array_1.37.0 gypsum_1.5.0
[34] grid_4.5.1 ExperimentHub_2.99.5 Rhdf5lib_1.31.0
[37] cli_3.6.5 crayon_1.5.3 httr_1.4.7
[40] rjson_0.2.23 DBI_1.2.3 cachem_1.1.0
[43] rhdf5_2.53.4 parallel_4.5.1 AnnotationDbi_1.71.1
[46] AnnotationFilter_1.33.0 BiocManager_1.30.26 XVector_0.49.0
[49] restfulr_0.0.16 alabaster.schemas_1.9.0 vctrs_0.6.5
[52] Matrix_1.7-3 jsonlite_2.0.0 bit64_4.6.0-1
[55] ensembldb_2.33.1 alabaster.ranges_1.9.1 GenomicFeatures_1.61.6
[58] h5mread_1.1.1 glue_1.8.0 codetools_0.2-20
[61] BiocVersion_3.22.0 GenomeInfoDb_1.45.9 UCSC.utils_1.5.0
[64] BiocIO_1.19.0 tibble_3.3.0 pillar_1.11.0
[67] rappdirs_0.3.3 rhdf5filters_1.21.0 R6_2.6.1
[70] dbplyr_2.5.0 httr2_1.2.1 alabaster.sce_1.9.0
[73] lattice_0.22-7 AnnotationHub_3.99.6 png_0.1-8
[76] Rsamtools_2.25.2 memoise_2.0.1 Rcpp_1.1.0
[79] SparseArray_1.9.1 pkgconfig_2.0.3
On Linux we compared the following BLAS alternatives with R "built-in" BLAS (a.k.a. Rblas):
- OpenBLAS (-pthread, -openmp, -serial flavors)
- NVBLAS
- Intel MKL (Intel Math Kernel Library) -- todo
- ATLAS -- todo
- BLIS -- todo
For this comparison we used four Jetstream2 m3.quad instances (each with 4 cpus, 15 GB of RAM) with Ubuntu 24.04 and R-4.5.1 + BioC 3.22.
| Rblas | openblas | openblas | openblas | |
|---|---|---|---|---|
| -pthread | -openmp | -serial | ||
| Operation(*) | ||||
| A_cp | 38.492 | 0.537 | 0.535 | 1.725 |
| A_tcp | 135.476 | 2.769 | 2.708 | 8.853 |
| MULT | 8.447 | 0.206 | 0.206 | 0.723 |
| A_svd_30 | 203.369 | 8.819 | 7.197 | 17.873 |
| A_RSsvds_30 | 18.127 | 8.162 | 7.791 | 5.880 |
| A_RSsvds_100 | 35.826 | 18.762 | 15.567 | 12.117 |
| M_svd_30 | 361.366 | 49.542 | 43.615 | 59.258 |
| M_svd_full | 384.136 | 49.153 | 43.520 | 56.418 |
| M_RSsvds_30 | 38.777 | 10.872 | 10.255 | 10.837 |
| M_RSsvds_100 | 109.056 | 35.547 | 37.843 | 30.569 |
| S_RSsvds_30 | 4.513 | 4.678 | 4.471 | 4.446 |
| S_RSsvds_100 | 12.904 | 13.222 | 12.825 | 12.763 |
(*) See "1. The operations that we benchmarked" at the top of this document for the details of each operation.
For this comparison we used Bioconductor GPU machine kakapo1 (96 cpus, 128 GB of RAM, GPU NVIDIA L40S) with Ubuntu 24.04 and R-4.5.1 + BioC 3.22. We made two separate installations of R-4.5.1 on the machine: one linked to openblas-openmp and one linked to nvblas.
| openblas | nvblas | openblas | nvblas | openblas | nvblas | |
|---|---|---|---|---|---|---|
OPENBLAS_NUM_THREADS |
unset(*) | unset(*) | 4 | 4 | 1 | 1 |
| Operation(**) | ||||||
| A_cp | 0.088 | 0.366 | 0.088 | 0.357 | 0.099 | 0.350 |
| A_tcp | 0.799 | 1.376 | 0.821 | 1.347 | 0.817 | 1.343 |
| MULT | 0.118 | 0.083 | 0.119 | 0.082 | 0.119 | 0.082 |
| A_svd_30 | 4.293 | 11.586 | 4.327 | 11.525 | 4.248 | 11.217 |
| A_RSsvds_30 | 2.204 | 1.564 | 5.158 | 5.288 | 2.590 | 1.688 |
| A_RSsvds_100 | 3.924 | 3.319 | 10.255 | 10.982 | 5.378 | 3.438 |
| M_svd_30 | 22.033 | 32.580 | 24.522 | 30.852 | 24.421 | 31.348 |
| M_svd_full | 21.144 | 30.189 | 25.215 | 31.092 | 24.201 | 31.267 |
| M_RSsvds_30 | 6.741 | 6.740 | 6.563 | 7.035 | 5.803 | 6.766 |
| M_RSsvds_100 | 17.258 | 16.757 | 16.477 | 17.235 | 11.502 | 18.971 |
| S_RSsvds_30 | 4.579 | 4.498 | 4.311 | 4.405 | 4.308 | 4.398 |
| S_RSsvds_100 | 12.677 | 12.632 | 12.205 | 12.343 | 12.172 | 12.291 |
(*) With OPENBLAS_NUM_THREADS unset, openblas-openmp has access to kakapo1's 96 cpus!
(**) See "1. The operations that we benchmarked" at the top of this document for the details of each operation.
On macOS we compared vecLib-based BLAS with R "built-in" BLAS (a.k.a. Rblas).
TODO
TODO
Work-in-progress!
The book builds for BioC 3.22 run on nebbiolo2: 72 cpus, 128 GB of RAM, Ubuntu 24.04, R-4.5.1. See https://bioconductor.org/checkResults/3.22/books-LATEST/
The table below reports bookdown::render_book("index.Rmd") times for the various sub-books of the OSCA book.
All times are in seconds. OPENBLAS_NUM_THREADS is set to 4.
| Rblas | openblas | openblas | openblas | |
|---|---|---|---|---|
| -pthread | -openmp | -serial | ||
| OSCA.intro | 449.3 | 426.2 | ??? | 424.7 |
| OSCA.basic | 865.9 | 843.6 | ??? | 846.6 |
| OSCA.advanced | error! | 3109.1 | ??? | 3147.3 |
| OSCA.multisample | 2893.8 | 2813.4 | ??? | 2862.8 |
| OSCA.workflows | error! | 3475.3 | TIMEOUT | 3489.1 |
| Total time | ??? | 10667.6 | NA | 10770.5 |
| Overall speedup (*) | --- | ???% | NA | ???% |
(*) Overall relative speedup is with respect to Rblas.
TODO
When the result is a matrix (A_cp, A_tcp, and MULT) the maximum relative error
can simply be computed with max(abs((Ralt - Rref)/Rref)) where Rref is the result
obtained with Rblas and Raltthe result obtained with the alternative BLAS.
Primary documentation https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS
R includes its own BLAS (located in src/extra/blas/) which is compiled and
used by default.
When running R configure script without --with-blas, the script reports
something like:
External libraries: pcre2, readline, LAPACK(generic), curl, libdeflate
After compilation, sessionInfo() reports:
BLAS: <R_HOME>/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
and R CMD config BLAS_LIBS and R CMD config LAPACK_LIBS report:
-L/home/hpages/R/R-4.5.r88102/lib -lRblas # BLAS_LIBS
-llapack # LAPACK_LIBS
Not included in our benchmarks because it does not seem to provide any significant performance gains.
On Ubuntu 24.04, generic BLAS and LAPACK are provided by Ubuntu
packages libblas3 and liblapack3. These are typically already
present on a typical Ubuntu 24.04 installation:
sudo apt-get install libblas3 liblapack3
Path to the libraries:
/usr/lib/x86_64-linux-gnu/blas/libblas.so.3
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3
Configure with:
configure --with-blas="-L/usr/lib/x86_64-linux-gnu/blas -lblas" \
--with-lapack="-L/usr/lib/x86_64-linux-gnu/lapack -llapack"
This reports:
External libraries: ..., BLAS(generic), LAPACK(generic), ...
sessionInfo() reports:
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
R CMD config BLAS_LIBS/LAPACK_LIBS reports:
-L/usr/lib/x86_64-linux-gnu/blas -lblas
-llapack
Ubuntu provides 3 flavors of OpenBLAS:
libopenblas-pthread-dev(same as installinglibopenblas-dev)libopenblas-openmp-devlibopenblas-serial-devThey all contain BLAS and LAPACK libraries.
On Ubuntu 24.04, install OpenBLAS with:
sudo apt-get install libopenblas-pthread-dev
Note that this will also install libopenblas0-pthread.
Configure with configure --with-blas="openblas" --with-lapack. This reports:
External libraries: ..., BLAS(OpenBLAS), LAPACK(in blas), ...
extSoftVersion()[["BLAS"]] reports:
## with libopenblas-pthread-dev:
/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so
## with libopenblas-openmp-dev:
/usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.26.so
## with libopenblas-serial-dev:
/usr/lib/x86_64-linux-gnu/openblas-serial/libopenblas-r0.3.26.so
R CMD config BLAS_LIBS reports:
-lopenblas
Note that R CMD config LAPACK_LIBS returns a blank string.
Note that any flavor of OpenBLAS can also be use as a drop-in replacement for the reference BLAS with:
sudo update-alternatives --set libblas.so.3-x86_64-linux-gnu \
/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
In this case R should be configured with only --with-blas. The switch
to an alternative BLAS can happen any time after R has been compiled.
To cancel the manual selection and put the system back in its normal state:
sudo update-alternatives --auto libblas.so-x86_64-linux-gnu
sudo update-alternatives --auto liblapack.so-x86_64-linux-gnu
Before anything else, use nvidia-smi to make sure that your machine has
an NVIDIA GPU and that the NVIDIA drivers are working.
On Ubuntu 24.04 NVBLAS is provided by Ubuntu package libnvblas12:
sudo apt-get install libnvblas12
This will also install libcublas12 and libcublaslt12.
There are a few obstacles that make it a little bit harder to get R to use NVBLAS:
- We can't use the
update-alternativesmethod because NVBLAS is not a registered alternative for reference BLAS. - Another problem is that installing
libnvblas12does not create the/usr/lib/x86_64-linux-gnu/libnvblas.sosymlink (tolibnvblas.so.12), and also, for some reason,sudo ldconfig -vwon't create it. - Also it seems that NVBLAS does not provide a full BLAS implementation, only a partial one, so we'll need to link to NVBLAS and OpenBLAS, in that order.
- Finally, R configure's script doesn't not recognize NVBLAS (like it
does for OpenBLAS, ATLAS, MKL, and BLIS), so configuring with
gets ignored and the script falls back to using Rblas.--with-blas="-L/usr/lib/local/x86_64-linux-gnu -lnvblas"
Here's how we were able to word around these issues:
- Manually create a symlink to
/usr/lib/x86_64-linux-gnu/libnvblas.so.12.0.2.224in/usr/local/lib/x86_64-linux-gnu/. - Run
sudo ldconfig -v. This creates thelibnvblas.so.12symlink in/usr/local/lib/x86_64-linux-gnu/(but still nolibnvblas.sosymlink). - Manually create the
libnvblas.so -> libnvblas.so.12symlink in/usr/local/lib/x86_64-linux-gnu/. - Create
nvblas.conffile in/usr/local/lib/x86_64-linux-gnu/with the following 2 lines in it:
See https://docs.nvidia.com/cuda/nvblas/#configuration for more information about this step.NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so NVBLAS_TILE_DIM 2048 - Add the following line to
/etc/profile:
Then logout and login again for the change to take effect.export NVBLAS_CONFIG_FILE=/usr/local/lib/x86_64-linux-gnu/nvblas.conf - Configure with:
configure --with-blas="openblas" --with-lapack. - Manually change the value of
BLAS_LIBSfrom-lopenblasto-lnvblas -lopenblasinMakeconfandetc/Makeconf. - Run
make && make installas usual...
Finally start R and check that sessionInfo() reports something like:
BLAS: /usr/lib/x86_64-linux-gnu/libnvblas.so.12.0.2.224
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
R CMD config BLAS_LIBS should report:
-lnvblas -lopenblas
Note that R CMD config LAPACK_LIBS should return a blank string.
We created four Ubuntu 24.04 CPU instances: blas1-4
Flavor: m3.quad (4 cpus, 15 GB of RAM)
Ubuntu packages installed:
sudo apt-get install libxml2-dev libdeflate-dev
Really nice to see this Hervé! 😍
I've had interest in this as part of the Conda Forge R Team, wondering which BLAS implementations are best for different platforms. Below is some info I can share from the Conda Forge/Bioconda perspective, and running the same benchmarks you provided here.
BLAS in Conda Forge R
When Conda Forge compiles R it doesn't use the Rblas, Rlapack. Rather, it has a modular BLAS system that uses a Netlib reference during compilation and allows implementations to be swapped without recompilation. For example, there is one R 4.5.1 built for each platform, and then the end-user can pick whatever is appropriate for their system (e.g., MKL, OpenBLAS, Accelerate).
For macOS, there are currently four BLAS options: Netlib, OpenBLAS, Accelerate, and "new" Accelerate (macOS 13.3+)
Results on MacOS (M3 Pro 5P/6E )
Unfortunately, the Conda ecosystem has lag time for R and Bioconductor releases, so the latest I can benchmark is Bioconductor 3.20 (R 4.4.3). Hopefully, it's not too different - at the very least, one can see within these results how the BLAS's compare.
Results
(*) See "1. The operations that we benchmarked" at the top of this document for
the details of each operation.
Environment
Here's an example session info and Conda environment from the runs.
SessionInfo
Input YAML
Solved YAML