Ryan Williams ryan-williams

(carried over from humancellatlas/table-testing#8)

Caveats

i am by no means a (SC) domain expert / this is just my guess about how things will shake out
there may well be situations where a person will have a dataset where elements ("rows") correspond to genes, which each include a list of per-cell metrics
- doing a transpose of a distributed matrix is possible and will be supported
- the thesis is just that, in "99%" of cases, "rows as cells" will map domain needs to infrastructure-{assumptions,conventions} better than "columns as cells"

Features

Title	Date	Cells	Size (Bytes)	Size
L5_All.loom	2018/04/13 09:33:55	160796	19129203189	17.8G
L6_Neurons.loom	2018/04/04 09:06:19	74539	6194902783	5.8G
L6_Cns_neurons.loom	2018/04/04 09:06:08	70968	5624800899	5.2G
L6_Glia.loom	2018/04/04 09:06:17	66656	3410506583	3.2G
L6_Cns_glia.loom	2018/04/04 09:06:12	52539	1881357834	1.8G
L6_Telencephalon_projecting_neurons.loom	2018/04/04 09:06:13	28858	929955780	887M
L1_Medulla.loom	2018/04/05 23:20:39	65179	480827497	459M
L1_Pons.loom	2018/04/05 23:20:27	62635	455401509	434M

Notes on the integration test, `fastavro_it_test`

Benchmarks

I set it to write 10MM synthetic records, with fastavro and avro, and then read them back in, each side reading what it wrote, and then verify that the read PCollections are equal (via a CoGroupByKey).

"Write" pipeline: 10MM records

The fastavro side is 3.5x faster:

While doing some dependency upgrades I saw tests failing like:

[error] Uncaught exception when running BugTest: java.lang.IncompatibleClassChangeError: Class spire.math.IntIsNumeric does not implement the requested interface cats.kernel.Eq
[error] sbt.ForkMain$ForkError: java.lang.IncompatibleClassChangeError: Class spire.math.IntIsNumeric does not implement the requested interface cats.kernel.Eq

It turns out that Spark MLLib depends transitively on org.spire-math:spire_2.11:0.13.0 via breeze, and when I had an unrelated exclusion in my build.sbt:

	// Instead of defining a specific Show type-class, a library can define a template for a Show and its companion:
	object ShowTemplates {
	// implement custom `Show` type-classes as sub-types of this
	trait ShowTemplate[T] {
	def apply(t: T): String
	}
	// mix this in to your `Show`'s companion object, bringing whichever optional instances you want into the default implicit-search path
	trait ShowCompanionTemplate[Show[T] <: ShowTemplate[T]] {
	def make[T](fn: T ⇒ String): Show[T]
	implicit val string: Show[String] = make(identity) // String instance always available

	{
	sealed trait obj
	case object one extends obj
	case object two extends obj

	def bad(o: obj) =
	o match {
	case one ⇒ 1
	case two ⇒ 2
	}

	import $ivy.`org.http4s::http4s-dsl:0.18.13`
	import $ivy.`org.http4s::http4s-blaze-server:0.18.13`

	import cats.effect._
	import org.http4s._
	import org.http4s.dsl.io._
	import org.http4s.server.blaze._

	val response = "This is the response"
	val port = 8000

	import mill._, mill.scalalib._
	object test1 extends ScalaModule {
	def scalaVersion = "2.11.12"
	}
	object test2 extends ScalaModule {
	def scalaVersion = "2.11.12"
	def ivyDeps = Agg(ivy"org.typelevel::cats-core:1.0.1")
	}
	object test3 extends ScalaModule {
	def scalaVersion = "2.12.6"

	import sbt.Keys._
	import sbt._
	import sbt.PluginTrigger.AllRequirements
	import sbt.librarymanagement.CrossVersion

	// Put this in project/Plugin.scala to expose `modName` setting which is your project's name and scala-binary-version (e.g. "foo_2.12").
	object Plugin
	extends AutoPlugin {
	override def trigger = AllRequirements