Goal: can we efficiently perform a single-cell analysis fully in memory, even for a big dataset like the 1.3M cells dataset? As long as the machine has enough memory.
Here are some low-hanging fruits that have the potential to significantly improve the current situation.
I believe that the SparseArray package can help achieve that goal.
In particular the Sparse Vector Tree representation (SVT_SparseArray class) which
provides an efficient in-memory representation of sparse matrices with no limit on
the number of nonzero values.
The 1.3M cells dataset:
- 7% nonzero values
- 24 GB in memory when loaded as a
SVT_SparseArrayobject.
Lot of work already went into the package but there are still some bottlenecks:
-
Some row summarization functions are not optimized yet. Many of them still rely on transposing the matrix first and calling the corresponding column summarization function on the transposed matrix. This is very inefficient. It can be avoided!
-
We need an
svd()implementation that works natively on theSVT_SparseArrayrepresentation. -
Same for
RSpectra::svds(): right now it handles the dgCMatrix representation natively (and very efficiently) but not theSVT_SparseArrayrepresentation. It should not be too hard to adapt the current code to also handle the latter natively.
This is extremely inefficient on a big in-memory matrix:
t(t(m) / colSums(m))
because the cost of transposing the matrix is big!
Unfortunately this idiom is used in a lot of R code.
There are about 1000 instances of it in Bioconductor packages.
My proposal is to use something like:
m / as_tile(colSums(m), along=2)
instead.
The idea is that if a is an array and x is a vector that we want to recycle
along a particular dimension, then we can do:
a <op> as_tile(x, along=k)
where:
<op>is an arithmetic (Arith), comparison (Compare), or elementwise boolean (Logic) operation;- and
kis the rank of the dimension along which recycling should happen.
Note that as_tile() is a function defined in the S4Arrays package. It returns
a tile object (a class also defined in S4Arrays).
To make a <op> as_tile(...) work, we need to implement a series of methods for the
Ops group generic.
Some preliminary testing shows that m <op> as_tile(x, along=2) can be between
4x to 10x faster than t(t(m) <op> x) on a big SVT_SparseMatrix object. It
also reduces memory usage significantly.
It can also be beneficial to make this work on ordinary matrices (dense).
In fact, in the long term, this could even make it to base R.
GPU processing could contribute at some point but it's not clear in what form exactly.
For example many operations on SVT_SparseArray objects are parallelized via OpenMP. Should we explore a GPU-based implementation for these operations?
Wrting C/C++ code that can take advantage of the GPU is notoriously hard.
We still need a ZarrArray package where the zarr backend for DelayedArray
objects is implemented.
Mike Smith already has something in his Rarr package but I think that we should have a dedicated package for that. See Huber-group-EMBL/Rarr#22
Some comparison of HDF5Array/TileDBArray/ZarrArray in the context of single-cell analyses would be nice. Look at impact of hardware, chunk size & geometry, compression level, sparseness of data, etc... on performance.
Does Rarr support zarr v3? We need to take a look at this.