Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Last active July 20, 2018 19:56
Show Gist options
  • Save ryan-williams/e942472ef581b36ec76548ebd84ee953 to your computer and use it in GitHub Desktop.
Save ryan-williams/e942472ef581b36ec76548ebd84ee953 to your computer and use it in GitHub Desktop.
Writeup of pros and cons of using cells vs genes as "rows" in distributed matrices of single-cell expression data

(carried over from humancellatlas/table-testing#8)

Caveats

  • i am by no means a (SC) domain expert / this is just my guess about how things will shake out
  • there may well be situations where a person will have a dataset where elements ("rows") correspond to genes, which each include a list of per-cell metrics
    • doing a transpose of a distributed matrix is possible and will be supported
    • the thesis is just that, in "99%" of cases, "rows as cells" will map domain needs to infrastructure-{assumptions,conventions} better than "columns as cells"

Features

Overview of a few common operations as they relate to this question:

Filtering

It's easier to filter rows from a large dataset, but filtering columns isn't too bad either.

SC analyses will likely filter cells and genes in different situations, so this is somewhat of a tie.

However, thinking about a mature HCA corpus, the genes are O(1) while the cells are O(n).

Additionally, I feel like it will be much more common to want all the 20-30k per-gene metrics about [cells that are responsive to a query] (which will typically be a tiny fraction of the BNs of cells in the HCA) than to want to know what every cell in the HCA had for a given gene.

Clustering

Similarly, it's probably easier to cluster by rows, and I think SC analyses will cluster by cells and by genes, but I think by cells will be more common / more in need of distributed compute.

Concatenating

It's much easier to concatenate "rows" than "columns", and I think we'll much more frequently concatenate groups of cells with the same gene-lists than vice-versa.

Memory-Sizes

  • A "row" / "element" of a distributed dataset should fit easily in memory.
  • "20-30k per-gene metrics about a given cell" fits this bill
  • "BNs of per-cell metrics for a given gene" does not.
    • if someone did want to do gene-oriented analyses over BNs of cells, they'd likely end up doing some 2-D / rectangular chunking (like we've discussed elsewhere on this PR, with e.g. the 1000x1000 chunks)

Recap:

Here's an attempt to viaualize "operations we want to do with each type (genes / cells)", and "how easily those operations can be done on 'rows' and 'columns'"):

genes cells rows columns
filter βœ… βœ… βœ… πŸŒ—
cluster βœ… βœ… βœ…
concatenate βœ… βœ…
memory-sized βœ… βœ…

The point is that "cells" and "rows" correlate better than "genes" and "rows".

Feel free to tell me if any of this seems crazy / off-base πŸ˜€

@samanvp
Copy link

samanvp commented Jul 20, 2018

It's much easier to concatenate "rows" than "columns"

Doesn't this depend on how we stored/chunked the matrix?
For example if chunks are defined to be columns then adding new rows will be much harder than adding new columns.

Also I would like to highlight here we are not talking about tabular data, we are talking about multi-dimensional arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment