Writeup of pros and cons of using cells vs genes as "rows" in distributed matrices of single-cell expression data

(carried over from humancellatlas/table-testing#8)

Caveats

i am by no means a (SC) domain expert / this is just my guess about how things will shake out
there may well be situations where a person will have a dataset where elements ("rows") correspond to genes, which each include a list of per-cell metrics
- doing a transpose of a distributed matrix is possible and will be supported
- the thesis is just that, in "99%" of cases, "rows as cells" will map domain needs to infrastructure-{assumptions,conventions} better than "columns as cells"

Features

Overview of a few common operations as they relate to this question:

Filtering

It's easier to filter rows from a large dataset, but filtering columns isn't too bad either.

SC analyses will likely filter cells and genes in different situations, so this is somewhat of a tie.

However, thinking about a mature HCA corpus, the genes are O(1) while the cells are O(n).

Additionally, I feel like it will be much more common to want all the 20-30k per-gene metrics about [cells that are responsive to a query] (which will typically be a tiny fraction of the BNs of cells in the HCA) than to want to know what every cell in the HCA had for a given gene.

Clustering

Similarly, it's probably easier to cluster by rows, and I think SC analyses will cluster by cells and by genes, but I think by cells will be more common / more in need of distributed compute.

Concatenating

It's much easier to concatenate "rows" than "columns", and I think we'll much more frequently concatenate groups of cells with the same gene-lists than vice-versa.

Memory-Sizes

A "row" / "element" of a distributed dataset should fit easily in memory.
"20-30k per-gene metrics about a given cell" fits this bill
"BNs of per-cell metrics for a given gene" does not.
- if someone did want to do gene-oriented analyses over BNs of cells, they'd likely end up doing some 2-D / rectangular chunking (like we've discussed elsewhere on this PR, with e.g. the 1000x1000 chunks)

Recap:

Here's an attempt to viaualize "operations we want to do with each type (genes / cells)", and "how easily those operations can be done on 'rows' and 'columns'"):

	genes	cells	rows	columns
filter	✅	✅	✅	🌗
cluster	✅	✅	✅
concatenate		✅	✅
memory-sized		✅	✅

The point is that "cells" and "rows" correlate better than "genes" and "rows".

Feel free to tell me if any of this seems crazy / off-base 😀

ryan-williams/cell-vs-gene-rows.md