-
-
Save milessabin/8549878 to your computer and use it in GitHub Desktop.
import shapeless._ | |
import record._ | |
import syntax.singleton._ | |
object ScaldingPoC extends App { | |
// map, flatMap | |
val birds = | |
List( | |
"name" ->> "Swallow (European, unladen)" :: "speed" ->> 23 :: "weightLb" ->> 0.2 :: "heightFt" ->> 0.65 :: HNil, | |
"name" ->> "African (European, unladen)" :: "speed" ->> 24 :: "weightLb" ->> 0.21 :: "heightFt" ->> 0.6 :: HNil | |
) | |
val fasterBirds = birds.map(b => b + ("doubleSpeed" ->> b("speed")*2)) | |
fasterBirds foreach println | |
val britishBirds = birds.map(b => b + ("weightKg" ->> b("weightLb")*0.454) + ("heightM" ->> b("heightFt")*0.305)) | |
britishBirds foreach println | |
val items = | |
List( | |
"author" ->> "Benjamin Pierce" :: "title" ->> "Types and Programming Languages" :: "price" ->> 49.35 :: HNil, | |
"author" ->> "Roger Hindley" :: "title" ->> "Basic Simple Type Theory" :: "price" ->> 23.14 :: HNil | |
) | |
val pricierItems = items.map(i => i + ("price" ->> i("price")*1.1)) | |
pricierItems foreach println | |
val books = | |
List( | |
"text" ->> "Not everyone knows how I killed old Phillip Mathers" :: HNil, | |
"text" ->> "No, no, I can't tell you everything" :: HNil | |
) | |
val lines = books.flatMap(book => for(word <- book("text").split("\\s+")) yield book + ("word" ->> word)) | |
lines foreach println | |
} |
I'm not making any ambitious claims for the efficiency of the above. However, the current (shapeless 2.0-M1) representation of records is probably lighter weight that you expect: the keys are encoded as singleton types intersected with the types of the values and have absolutely no runtime footprint ... at runtime the record is essentially a cons list of the values and the keys are completely erased.
Cool. I need to take a look at the latest implementation.
We will need to look at serialization here because, as Dean notes, we definitely don't want to serialize the keys with each row. We'd have to look at how Kryo does (or can be made to) serialize the records.
@johnynek The keys don't exist at all at runtime.
Here's an example showing Kryo serialization of record types: https://gist.github.com/bsidhom/9798005
The record type takes no more space than its underlying HList, which isn't too bad when registered. I haven't found a way to register classes more concisely unfortunately, but it may be possible to remove some boilerplate given an example instance (via getClass
) or with the help of macros.
Cool.
One thing you would want is a separation between schema and actual records, for performance. For example, specify that column 2 is the title, but have an efficient data structure (Array or Stream) holding the data, either by column or by row. You might be reading millions of records in a single process.