(carried over from humancellatlas/table-testing#8)
- i am by no means a (SC) domain expert / this is just my guess about how things will shake out
- there may well be situations where a person will have a dataset where elements ("rows") correspond to genes, which each include a list of per-cell metrics
- doing a transpose of a distributed matrix is possible and will be supported
- the thesis is just that, in "99%" of cases, "rows as cells" will map domain needs to infrastructure-{assumptions,conventions} better than "columns as cells"
Overview of a few common operations as they relate to this question:
It's easier to filter rows from a large dataset, but filtering columns isn't too bad either.
SC analyses will likely filter cells and genes in different situations, so this is somewhat of a tie.
However, thinking about a mature HCA corpus, the genes are O(1) while the cells are O(n).
Additionally, I feel like it will be much more common to want all the 20-30k per-gene metrics about [cells that are responsive to a query] (which will typically be a tiny fraction of the BNs of cells in the HCA) than to want to know what every cell in the HCA had for a given gene.
Similarly, it's probably easier to cluster by rows, and I think SC analyses will cluster by cells and by genes, but I think by cells will be more common / more in need of distributed compute.
It's much easier to concatenate "rows" than "columns", and I think we'll much more frequently concatenate groups of cells with the same gene-lists than vice-versa.
- A "row" / "element" of a distributed dataset should fit easily in memory.
- "20-30k per-gene metrics about a given cell" fits this bill
- "BNs of per-cell metrics for a given gene" does not.
- if someone did want to do gene-oriented analyses over BNs of cells, they'd likely end up doing some 2-D / rectangular chunking (like we've discussed elsewhere on this PR, with e.g. the 1000x1000 chunks)
Here's an attempt to viaualize "operations we want to do with each type (genes / cells)", and "how easily those operations can be done on 'rows' and 'columns'"):
genes | cells | rows | columns | |
---|---|---|---|---|
filter | β | β | β | π |
cluster | β | β | β | |
concatenate | β | β | ||
memory-sized | β | β |
The point is that "cells" and "rows" correlate better than "genes" and "rows".
Feel free to tell me if any of this seems crazy / off-base π
Doesn't this depend on how we stored/chunked the matrix?
For example if chunks are defined to be columns then adding new rows will be much harder than adding new columns.
Also I would like to highlight here we are not talking about tabular data, we are talking about multi-dimensional arrays.