Last active
July 17, 2018 21:16
-
-
Save johnynek/91a2a35b4808b98328cdc2ad3f15d74b to your computer and use it in GitHub Desktop.
Scala's path dependent types can be used to prove that a serialized value can be deserialized without having to resort to Try/Either/Option. This puts the serialized value into the type, so we can be sure we won't fail. This is very useful for distributed compute settings such as scalding or spark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scala.util.Try | |
object PathSerializer { | |
trait SerDe[A] { | |
// By using a path dependent type, we can be sure can deserialize without wrapping in Try | |
type Serialized | |
def ser(a: A): Serialized | |
def deser(s: Serialized): A | |
// If we convert to a generic type, in this case String, we forget if we can really deserialize | |
def toString(s: Serialized): String | |
def fromString(s: String): Try[A] | |
} | |
val intSer: SerDe[Int] = | |
new SerDe[Int] { | |
type Serialized = String | |
def ser(a: Int) = a.toString | |
def deser(s: Serialized) = s.toInt // since this was serialized with this SerDe this is safe | |
def toString(s: Serialized): String = s | |
def fromString(s: String): Try[Int] = Try(s.toInt /* this can fail, because we only know it is a String */) | |
} | |
def example0 = { | |
val x = 42 | |
val ser: intSer.Serialized = intSer.ser(x) | |
val y = intSer.deser(ser) | |
assert(x == y) | |
} | |
def example1 = { | |
// In a case like spark or scalding, we can remember that deserialization can't fail: | |
// pretend the list below is an RDD or TypedPipe | |
val inputs: List[Int] = (0 to 1000).toList | |
val serialized: List[intSer.Serialized] = inputs.map(intSer.ser(_)) | |
// intSer.Serialized is distinct from String, note the following error | |
// if we pretend they are the same: | |
// | |
// val strings: List[String] = serialized | |
// | |
// ts_ser.scala:41: error: type mismatch; | |
// found : List[PathSerializer.intSer.Serialized] | |
// required: List[String] | |
// Look, ma! no Trys! | |
val deserialized: List[Int] = serialized.map(intSer.deser(_)) | |
assert(inputs == deserialized) | |
} | |
} |
@przemek-pokrywka, that's not quite right. Scalding or Spark jobs can compile down to many map-reduce steps, each of which is separated by a serialization boundary. Usually that happens behind the scenes, and you're forced to serialize some sort of class tag into each object behind the scenes on disk.
With this trick you can explicitly trigger serialization, then jump back out in the same program. This is more efficient, and behind the scenes the mapreduce platform only sees bytes, so no class tags required.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@johnynek
https://gist.github.com/johnynek/91a2a35b4808b98328cdc2ad3f15d74b#file-scala-path-dependent-serializers-scala-L48 is only possible if you have values of
intSer.Serialized
around. But when any serialization to disk/buffers is involved one doesn't readintSer.Serialized
- one reads raw bytes / String / other low-level representation. So you would need to somehow jump from that low-level representation tointSer.Serialized
directly which I think isn't possible to be made safe, without usingOption
,Try
or equivalent.Update
one can still keep a typed reference to the data in offsite-heap for example, in which case the deserialization is safe.