How can we make some matrix operations more efficient in Bioconductor

Goal: can we efficiently perform a single-cell analysis fully in memory, even for a big dataset like the 1.3M cells dataset? As long as the machine has enough memory.

Here are some low-hanging fruits that have the potential to significantly improve the current situation.

Role of the SparseArray package

I believe that the SparseArray package can help achieve that goal.

In particular the Sparse Vector Tree representation (SVT_SparseArray class) which provides an efficient in-memory representation of sparse matrices with no limit on the number of nonzero values.

The 1.3M cells dataset:

7% nonzero values
24 GB in memory when loaded as a SVT_SparseArray object.

Lot of work already went into the package but there are still some bottlenecks:

Some row summarization functions are not optimized yet. Many of them still rely on transposing the matrix first and calling the corresponding column summarization function on the transposed matrix. This is very inefficient. It can be avoided!
We need an svd() implementation that works natively on the SVT_SparseArray representation.
Same for RSpectra::svds(): right now it handles the dgCMatrix representation natively (and very efficiently) but not the SVT_SparseArray representation. It should not be too hard to adapt the current code to also handle the latter natively.

The double transposition idiom is bad

This is extremely inefficient on a big in-memory matrix:

t(t(m) / colSums(m))

because the cost of transposing the matrix is big!

Unfortunately this idiom is used in a lot of R code.

There are about 1000 instances of it in Bioconductor packages.

My proposal is to use something like:

m / as_tile(colSums(m), along=2)

instead.

The idea is that if a is an array and x is a vector that we want to recycle along a particular dimension, then we can do:

a <op> as_tile(x, along=k)

where:

<op> is an arithmetic (Arith), comparison (Compare), or elementwise boolean (Logic) operation;
and k is the rank of the dimension along which recycling should happen.

Note that as_tile() is a function defined in the S4Arrays package. It returns a tile object (a class also defined in S4Arrays). To make a <op> as_tile(...) work, we need to implement a series of methods for the Ops group generic.

Some preliminary testing shows that m <op> as_tile(x, along=2) can be between 4x to 10x faster than t(t(m) <op> x) on a big SVT_SparseMatrix object. It also reduces memory usage significantly.

It can also be beneficial to make this work on ordinary matrices (dense).

In fact, in the long term, this could even make it to base R.

Role of GPU processing

GPU processing could contribute at some point but it's not clear in what form exactly.

For example many operations on SVT_SparseArray objects are parallelized via OpenMP. Should we explore a GPU-based implementation for these operations?

Wrting C/C++ code that can take advantage of the GPU is notoriously hard.

Other projects on the table

ZarrArray package

We still need a ZarrArray package where the zarr backend for DelayedArray objects is implemented.

Mike Smith already has something in his Rarr package but I think that we should have a dedicated package for that. See Huber-group-EMBL/Rarr#22

Benchmark HDF5Array vs TileDBArray vs ZarrArray

Some comparison of HDF5Array/TileDBArray/ZarrArray in the context of single-cell analyses would be nice. Look at impact of hardware, chunk size & geometry, compression level, sparseness of data, etc... on performance.

Improve support for zarr v3

Does Rarr support zarr v3? We need to take a look at this.

hpages/how_to_improve_matrix_ops_in_bioc.md

Select an option

No results found