Skip to content

Instantly share code, notes, and snippets.

@gangstead
Last active August 29, 2015 14:17
Show Gist options
  • Save gangstead/e0955e77864e318bc61e to your computer and use it in GitHub Desktop.
Save gangstead/e0955e77864e318bc61e to your computer and use it in GitHub Desktop.
Scala Days Notes

Scala Days 2015

notes by Steven Gangstead

Monday March 16th

Key Note: Scala - where it came from, where it's going

Martin Odersky @odersky

  • Doesn't feel like talking about the announced topic
  • It's just going to be "where it came from" because it isn't going anywhere
  • "Scala is a gateway drug to Haskell"
  • Phase out name in favor of "Hascalator"

Timeline:

  • 1980's Mobula-, Oberon
  • 1990-95 Functional Programming, lambda calculus, Haskell, SML
  • 1995-98 Pizza (written by Odersky) - FP features into Java, led to Generics which became javac
  • 1998-99 GH, javac
  • 2000-02 Functional Nets, Funnel

Motivations for Scala

  • Grew out of Funnel
  • Wanted to show practical combination of OOP and FP
  • What got dropped
  • Concurrency relegated to libraries
  • No tight connection between language and core calculus
  • What got added
  • Native object and class model, Java interop, XML literals (hides behind podium)

Why ?

  • Wanted Scala to have hipster syntax

What makes Scala Scala?

  • functional
  • obect-oriented / modular
  • statically typed
  • strict
  • closest predecessor: OCaml
  • Differences: OCaml separates object and module system, Scala unifies them
  • OCaml uses Hindley/Milner, Scala subtyping + local type inference.

1st Invariant: A Scalable Language

  • Instead of providing lots of features in the language, have the right abstractions so that they can be provided in libraries
  • This has worked quite well so far
  • It implicitly trusts programmers and library designers to "do the right thing", or at least the community to sort things out.

2nd Invariant: It's all about the types

  • Scala's core is its type system
  • Most of the advanced types concepts are about flexibility, less so about safety
  • Goals Safety vs Flexibility / Ease of Use
  • Scala's initial main goal is to make the type system good enough for people who would otherwise choose a dynamic language (focus on flexibility over safety)
  • The goal is for Scala to catch up in terms of Type Safety with other typed languages

The Present (highlights)

  • Emergent Ecosystem
  • Chart of all the Scala libraries
  • New environment Scala.JS, no longer experimental, beats native JS in some benchmarks, great interop with JS libraries
  • Works well because it plays to the strengths of Scala
  • Libraries instead of primitives
  • Flexible type system
  • Geared for interoperating with a host language
  • Tool improvements:
  • Incremental compiler, available in sbt and IDEs
  • New IDEs
  • Eclipse IDE 4.0
    • IntelliJ 13.0
    • Ensime: make the Scala compiler available to help editing
  • Coursera stats:
  • 400,000 "inscriptions"
  • Success rate of ~10% (higher than the industry average)

Where is Scala going?

  • Emergence of a platform
  • Core libraries
  • Specifications:
  • Futures
    • Reactive Streams
    • Spores
  • Common vocabulary
  • Beginnings of a reactive platform, analogous to Java EE
  • JDK the core of the Java Platform
  • Java source -> Classfiles -> Native Code
  • What are class files good for?
  • Make your software portable across hardware, across OSs, versions
  • What's the equivalent for Scala?
  • Scala piggy backs on the JDK
  • Adds Scala Signatures for the Scala compiler to link the symbol table to the generated class files
  • Challenges for Scala
  • Binary compatibility
  • scalac has way more transformations to do than the javac
    • Compilation schemes change
    • Many implentation techniques are non-local, require co-compilation of library and client (eg Trait composition)
  • Having to pick a platform
  • Previously platform is "The JDK"
    • In the future: Which JDK? 7, 8, 9, 10? And what about JS?
  • Exploring/ proposing: a Scala-Specific Platform
  • Scalac compiles source into "TASTY" and then a packaging tool/ linker generates JS / Classfiles
  • The core
  • TASTY file format: Serialized Typed Abstract Syntax Trees
  • <simple statement expanded into complex tree, a TAST>
  • TASTY trees take up ~25% of classfile size (but carry much more information)
  • What can we do with it?
  • Applications: instrumentation, optimization, code analysis, refactoring
  • This things are hard to do with parsing the source code (makes me think he's not interested in an official formatter)
  • Publish once, run anywhere
  • Automated remapping to solve binary compatibility issues

Language and Foundations

  • Connect the Dots
  • DOT: A calculus for (...? big graphic on next slide)
  • Work on developing a fully expressive machine-verified version is still ongoing
  • dotc: a compiler for a cleaned up version of Scala
  • dotc: Cleaning up Scala. Removing these things:
  • XML Literals -> moving into string interpolation
  • Procedure Syntax -> _ (aka going away, "just write equals signs")
  • Early initializers -> Trait parameters trait 2D(x: Double, y: Double)
  • More simplifications
  • Existential Types (List[T] forSome {type T}) becomes List[_]
  • Higher-kinded Types List -> Type with uninstatiated type members List
  • Type parameters as Syntactic Sugar
  • example
  • General higher-kinded types through typed lambdas
  • example type Two[T] = (T,T)
  • Two[String]
  • New concepts
  • Type unions (T & U) and intersections (T | U)
  • Make the type system cleaner and more regular (eg intersection, union are commutative).
  • Post new challenges for compilation
 class A { def x = 1 }
 class B { def x = 2}
val ab: A | B : ???
ab.x
  • dotc compiler close to completion, hopefully alpha release by ScalaDays Amsterdam
  • Plan to use TASTY for merging dotc and scalac

Plans for Exploration

  • Cleaned up language, new compiler, lets add new stuff, right?
  • Ideas worth exploring:
  1. Implicits that compose
  • Already have implicit lambdas implicit x => t implicit transaction => body
  • What about if we also allow implicit function types? implicit Transaction => Result
  • Then we can abstract over implicits: type Transactional[R] = implicit Transaction => R
  • Types like these compose eg type TransactionalIO[R] = Transactional[IO[R]]
  • New Rule: If the expected type of an expression E is an implicit function, E is automatically expanded to an implicit closure.
  • that's all you need and with that you can get implicits that compose
  1. Better Treatment of effects
  • So far purity in Scala is by convention, not by coercion. In that sense, Scala is not a pure functional language (for FP extremists)
  • We'd like to explore "scalarly" ways to express effects of functions
  • Effects can be quite varied: Mutation, IO, Exceptions, Null-dereferencing
    • All hav two essential properties: they are additive, they propagate along the call-graph
  • Hascalator says "though shalt use monads for effects"
  • Monads are cool, but for Scala I hope we find something even better
  • Monads don't commute
    • Require monad transformers for composition, but this confuses even ODersky!
  • Use implicits to model effects as capabilities
  • instead of def f: R throws Exc = ...
  • use this: def f(implicit t: CanThrow[Exc]): R = ...
  • or add this type throws[R, Exc] = implicit CanThrow[Exc] => R

In summary

  • Scala established FP for the mainstream
  • showed that a fusion of OOP and FP is both possible and useful
  • promoted the adoption of strong static typing
  • has lots of enthusiastic users, conference attendees included
  • Despite it being 10 years old, it has few close competitors

Our Aims

  • Make platform more powerful
  • make the language simpler
  • work on foundations to get to the essence of Scala

Q&A Q: Will generics be replaced by dependent types or what's the interaction? A: They will both be supported going forward. Generics can be mapped to dependent types

Q: In the future with type trees, will they be distributed as something other than JARs A: Let's see what's in Java9 because they are proposing new distribution mechanisms, and he'd like to keep using what the JDK does. Current proposal is the TASTY parts will be annotations in the classfile.

Scala Days 2015

notes by Steven Gangstead Note: there are 4 sessions at a time so my notes are just for the one session I was able to attend.

Tuesday March 17th

Key Note: Why Open Languages Win

Danese Cooper @DivaDanese

  • Back story, worked at Apple, Semantec...
  • Worked at Sun to "open source" Java
  • Sun didn't want to open source Java, not sure how to monetize it
  • SCSL - Sun Community Source License, not actually a good license, not open. Everything had to go through Sun
  • Parlayed that into being the Head of Open Source at Sun. They still didn't want to OS Java, but she got other things out there, like Open Office.
  • C++ & Java were competing in the market place, Sun's big play was to open source Tomcat by putting it into the Apache foundation
  • Sun willing to sacrifice a part of Java to keep people from using ASPs.
  • Lessons learned at Sun Apache started Geronimo => Sun open sourced Glassfish. Apache created Harmony => Sun started OpenJDK project.
  • Left Sun for Intel. Worked on a project over 3 years to get Linux desktop adoption over Windows adoption.
  • Learned Windows desktop penetration too deep. Had to compete on new devices.
  • "If I were to pick a language to use today other than Java, it would be Scala" - James Gosling, 2011 (inventor of Java)
  • R, language based on S. S came from company Sass, extremely proprietary. R made from academia and is totally open. Everyone knows R now, it's a complete ripoff of S. All quants learn R and all graduate work done in statistics is done in R.
  • Pie Graph of github pushes, all open languages take up the bulk. Javascript, Python, Ruby, PHP, Java then C++.
  • Miguel Deacaza - famous for porting C# (and other dot net languages?) into Mono. Microsoft has since open sourced .NET
  • Node.js, Paypal works on Node, is heavily invested. Initially worried about the io.js fork, but she sees that's where all the engineers from Node.js have gone on to because of Joyent's doing a bad job as a benevolent dictator (Joyent not mentioned by name)

Take aways:

  • Open source is now a requirement to drive a language adoption
  • Don't try to monetize it
  • Have a permissive license
  • It's quite possible to get it WRONG
  • Listen to your developers
  • They just want stuff to work
  • Open Standards != Open Source
  • Lots of big companies will tell you otherwise
  • Open Standards isn't clearly defined like Open Source
  • Look Richard Stallman's 10 requirements for open source
  • Open Standards bodies are worried about De Facto standards

Questions Q: One of the challenges of open source is all the other overhead you have to do in addition to just releasing the source. Are there any good patterns for doing that well? A: See book Producing Open Source Software (I missed the author) it's free. It is hard, but it's healthy for companies to get a feel for how outside developers do things. So you need to set it up so that the outside world has an equal chance of contributing to how to do things. At Paypal she asks questions for how/ when to open source a project: 1) Does anyone care? 2) Do we still use it (some companies tempted to OS junk they aren't using anymore)? 3) Is there a resource (so people can research it)? If the first 3 things are yes, can we continue to work on the project after we open source it (modularize the "secret sauce" parts and keep them closed while working on the rest in the open)?

Q: How do you overcome company's resistance to OS because they are afraid of devaluing their product? A: Companies OS for three reasons: for their reputation, to disrupt, (I missed the third thing, fear?). You have to find out which of the reasons is going to motivate a particular company.

Q: How do you finance open source? A: Dual license has been really effective for a few projects. MySQL traditionally and now MongoDB. MongoDB has a license that makes IaaS difficult without going to get a commercial license. Her favorite is the foundation. It keeps the books open, keeps transparency high and gives everyone an equal chance at contributing. Sponsorship is really hard to do well, she mainly sees it going poorly and it creates a lot of forks. She likes the idea of crowdfunding, but there's a problem with fulfillment. Grassroots suffers when it gets big enough for anyone to care, the monetizing wolves tear it to shreds.

Q: How do you feel about Contributor License Agreements? A: For all you Californians: the idea of a pre-invention agreement does not apply to Californians because of a lot of case law. Just document that you did the work on your own resources on your own time. Projects are working on updating CLAs to be more permissive while still protecting the project from Copyright claims, possibly getting rid of CLAs. The debate is not over, but the hip projects are looking at no longer aggregating copyrights and having the "developers attest"

Q: At companies that want to open source stuff someday, just not yet because it's not ready or whatever. What do you think is the tradeoff? A: You have to have at least something ready before releasing. It's too hard to get going otherwise. Outside of that she recommends "release early, release often". That may mean finding parts of code that you want to rewrite and just tagging it as such first and then rewriting it later.

Q: Are open source concepts taking hold in other industries? A: Yes, I wrote a book about that in 2007, go read it. I've seen it things like science, medicine, extreme sports. Open source is everywhere you want to be.

Scala Collections Performance

Craig Motlin @motlin

Steven node Craig flew through his slides, way too much info to type down at speed. Find his slides online.

  • Works at Goldman
  • One library he works on is GS Collections
  • Doesn't see people using Scala at Goldmen for general code.
  • Sees people switch back to Java for performance reasons.

Goals:

  • Scala programs ought to perform as well as Java but Scala is a little slower
  • Highlight a few performance problems that matter to me
  • GS Collections (GSC) and Scala collections are similar
  • mutable and immutable interfaces with common parent
  • similar iteration patterns at hierarchy root Traversable and RichIterable
  • Lazy evaluation (view, asLazy)
  • Parallel-azy evaluation (par, as Parallell)
  • GS Collections and Scala Collections are different:
  • Persistent data structures
  • Hash tables
  • (other stuff, I couldn't keep up)

Persistent Data Structures

  • Data Structure that always preserves the previous version of itself when it is modified
  • Examples, List, Sorted Set
  • "Add" by constructing new nodes from leave to a new root
  • Important in purely functional languages
  • All collections are immuable
  • Must have good runtime complexity
  • No one seems to miss them in GS Collections
  • Proposal mutable, persistent immutable and plain immutable
  • Mutable same as always
  • Persistent: use when you want structural sharing
  • Plain immutable
  • not persistent
  • "adding" is much slower
  • speed benefits for everything else
  • huge memory benefits
  • Performance assumptions:
  • Iterating through an array should be faster than a linked list
  • linked lists won't parallelize well with .par
  • no surprises in the results - so we'll skip
  • Immutable array-backed sorted set
  • immutable -> trimmed, sorted array
  • no nodes -> ~1/8 memory
  • array backed -> cahce locality
  • Assumptions about contains
  • may be faster when it's a binary search in an array (good cache locality)
  • will be about the same in mutable / immutable tree
  • assumptions not quite correct, immutable sorted set slower but only a tiny bit. He's only interested in 2X or 1/2x differences.
  • Testing serial arrays, about the same performance as Scala.
  • Testing parallel-lazy evaluation
  • Assumption: Scala tree parallilize well & GSC's array should parallelize very well.
  • Surprise result: Scala collection much slower in parrallel than in serial
  • Scala's immutable. TreeSet doesn't override .par so parallel is slower than serial.
  • Some tree ops like filter are hard to parallelize
  • TreeSet.count should be easy to parallelize using fork/join with some work -Persistent Data Instructions wrap up:
  • proposal : mutable, persistent and immutable in same library

Hash Tables

  • Scala's immutable.hashmap is a hash array mapped trie (pronounced tree)
  • "achieves almost hash table-like speed while using memory much more economically" - wikipedia
  • Scala's mutable.hashmap is backed by an array of Entry pairs
  • Java.util.hashmap.entry caches the hashcode. Takes more memory, but get a speed benefit when resize the array.
  • GSC's UnifiedMap is backed by Object[] flattened. ImmutableUnifiedMap is backed by a UnifiedMap
  • Testing: Scala's immutable hashmap memory size increases linerarly, but everyone elses is much lower, with a step size as the array doubles.
  • Testing hashmap get: Scala Immutable hashmap is way slower than everything else, mutable map performance is similar to GSC maps.
  • Testing hashmap put: Same results.
  • HashSets:
  • Scala immutable hashset is backed by an array
  • java util.hashset is implemented by delegating to a hashmap
  • GSC UnifiedSet is backed by Object [], either elements or arrays of collisions
  • GSC ImmutableUnifiedSet is backed by a UnifiedSet
  • Memory usage of hashsets: scala immutable hashsets use lots of memory, linearly increasing. scala mutable hashset good performance, increases as array expands.

Primitive Specializaiton

  • Boxing is expensive
  • costs for Reference + Header + alignment
  • Scala has specialization, but most of the collections are not specialzed
  • If you cannot afford wrappers you can:
  • use primitive arrays (only for lists)
  • use a java collections library
  • Proposal: Primitive lists, sets and maps in Scala
  • Not Traversable - outside the collections hierarchy
  • fix specialization on functions (lambdas) so that for-comprehensions can work well

fork - join

  • [test code for scala, java and GS collections]
  • [performance results flashed up on screen too fast to understand]
  • Fork join is general purpose but always requires merge work
  • We can get better performance through specialized data strcutrures meant for combining.

They watch for the gs-collections tag on Stack Overflow Code on github: http://github.com/goldmansachs/gs-collections

Life Beyond the Illusion of the Present

Jonas Boner @jboner

  • "Time is a device that was invented to keep everything from happening at once" - Graffiti on a wall at Cambridge University
  • Newtonion Physics - The simplified model of time is very appealing to us.
  • von Neumann architecture - single processors running mutable state, has full control of the present.
  • Concurrency comes along and makes everything difficult
  • Jim Gray gave us transactions to give us the illusion of order within a transaction to give us our linear time back.
  • Distribution comes along and makes life miserable again. Transactions don't distribute well.
  • This is not surprising, the world doesn't work in transactions either. There isn't an absolute single global consistent present
  • You can construct a local present and work with that.
  • "The future is a function of the past" - A J Robertson
  • "The (local) present is a merge function of multiple concurrent pasts" - Boner
  • [joke involving a foldleft in scala]
  • Information is always from the past, the present is relative.
  • The truth is actually closer to Einstein's physics, where everything is relative to the observer
  • Information travels at the speed of light (we all know that). This puts a cap on the speed of information. Information has latency. Contrary to newton's law.
  • The cost of maintaining this illusion is increased contention and coherency
  • Adding participants eventually slows down the system [someone's law]
  • As latency gets higher, the illusion cracks even more.
  • Classic quote: "If a tree falls in the forrest ..." - Charles Riborg Mann
  • Directly affects computer systems because information can get lost and it will get lost.
  • How do we deal with information loss in real life?
  • We use a simple protocol of confirm or wait / repeat.
  • We don't wait for guaranteed delivery
  • We take educated guesses to fill in the blanks
  • and if we are wrong we take compensating action
  • Can we rewrite the past?
  • Winner writes the history books, the history books even get rewritten
  • We can do this in CS, but should we?
  • Usually a bad idea, but we can add more information
  • There is a path forward
  • Treat time as a first class construct
  • What is time really?
  • It't not wall clock time: hours, minutes, seconds
  • Time is the succession of causally related events
  • Embrace this and things fall into place
  • How to manage time? Thinking in FACTS
  • Facts have values, they are not variables. They accrue either as new information or derived from previous information.
  • Immutability is a core requirement
  • Not a part of classic traditional object orientation
  • They conflate identity with value
  • There is a time and a place for mutability, but immutable should be the default
  • Do variables have a purpose in life?
  • "The assignment statement is the von Neumann bottleneck of programming languages and keeps us thinking in word-at-a-time terms ..." John Backus (Turing Award lecture 1977)
  • Mutable state needs to be contained, not exposed to the rest of the world. Only expose immutable values.
  • How do we manage facts? Functional Programming
  • "you put facts in and out comes new facts"
  • Dataflow graphs, model time through data dependencies
  • First rule of facts never delete facts
  • Facts represent the past and the past is the only way to the present
  • Disk is so cheap, there's no reason to delete.
  • [Long Jim Gray Quote about accountants not altering the books, but taking new notes]
  • CRUD becomes CRUD
  • "database is a cache of a subset of the log" - Pat Helland (2007)
  • Store facts in an event log. The log is like a database of the past.
  • The log id is like the ticking forward of time
  • The log allows time travel
  • Constructing a sufficiently consistent local present means employing consistency mechanisms
  • an agreement across processes
  • consistency means employing some sort of coordination
  • too little coordination can violate correctness, too much means reduced availability.
  • Inside Data: Our current present / Outside Data: Blast from the Past / Between Services : Hope for the future - Pat lelland (2011?)
  • Event sourcing - practical tool to capture state changing events in the log. Replay history to reconstruct present.
  • Queries can be hard
  • Microservices map to consistency boundaries.
  • Decoupling in space / time can give you the isolation to have fault tolerance.
  • In reactive systems this is called Location transparency
  • Strong consistency - the wrong default
  • It has an extremely high price
  • We most often don't need it
  • Eventual consistency
  • Loosen up the guarantees and focus on availability
  • gives us room for scalability
  • has loose meaning and not as useful. How eventual? How consistent?
  • Tracking Time is tracking causality, don't rely on timestamps.
  • lead to write locks
  • Alternative: Lamport Clocks
  • gives us global causal ordering between events
  • Vector Clocks
  • Partial causal ordering between events.
  • Logical time allows causal consistency
  • What consistency guarantees to you really need and when?
  • Sometimes events go outside your system and are then causally related
  • Expensive to track all the metadata
  • Mine for confluence
  • Your component produces the same set of outputs for all set of inputs
  • Powerful property, you don't have to coordinate.
  • ACID 2.0
  • Associative
  • Commutative
  • Idempotent
  • Distributed
  • CRDTs - conflict-free replicated data types

Experiences Using Scala in Apache Spark

Patrick Wendell, Databricks @pwendell

  • Spark is an execution engine for doing large scala data analytics in clusters of machines. Written in Scala, but API's for Java, Python, R.
  • Most active project in the Apache Foundation
  • Simple example from Spark REPL (fork of the Scala REPL)
  • Databricks founded by Spark creaters
  • Databricks Cloud - basically Spark as a Service
  • Internal components written in Scala

Datbricks' Overall Impressions of Scala

  • Using a new Programming Language is like Falling in Love
  • honeymoon phase gives way to quirks
  • key to success is investing in the relationship
  • Why we chose Scala
  • Wanted to work with Hadoop, which is jvm based and a concise programming interface
  • Compatible with JVM ecosystem (big legacy codebase in big data)
  • DSL support
  • Concise syntax (rapid prototype, but still typesafe)
  • Thinking functionally (encourages immutability and good practices)
  • Perspective of a software platform
  • Users make multi-year investments in Spark
  • large ecosystem of third party libraries
  • hundreds of developers on the project (who come and go)
  • Important to us:
  • backwards compatible and stable APIs
  • A simple and maintainable code base

Concerns with Scala

  • Overview
  • Easy to write dense & complex code
  • some people think minimizing LOC is the end game
  • Certain language concepts can be easily abused
  • Eg DSLs, operator overloading, Option/ Try, chaining
  • Compatibility story is good, but not great
  • Source compatibility is a big step towards improving this
  • Binary compatibility still far off in the future

Announcing the Databricks Style Guide

"Code is written once by the author and modified multiple times by lots of other engineers" it's on github.com/databricks/scala-style-guide

Example: Symbolic Names

  • Symbolic method names are hard to understand the intent of the functions:
channel ! msg
stream1 >>= stream2

Not as clear as:

channel.send(msg)
stream.append(stream2)

Example: Monadic Chaining Example with getting a value from a map where there's a nested call of get, get, flatmap, get, flatmap... Refactored to not be so deep. Question from the audience What about for comprehensions? Answer yeah those are fine, makes it more readable.

Subjective rule re. Monadic Chaining:

  • Do not chain / nest more than 3 operations deep
  • Non scala devs in particular have a hard time understanding code more nested than that.

Other typics in maintaining large codebase

  • [Binary compatibility & Source Compatibility explanation]
  • Spark has committed Binary compatibility
  • Binary Compatibility
  • Can't change or remove function signatures
  • Can't change argument names
  • Need to cross compile for different Scala version
  • Less obvious things that break binary compatibility
  • adding concrete members to traits
trait person {
	def name: String
}
trait person {
	def name: String
	def age: Option[Int] = None
}
  • Make this an abstract class and it works instead
  • Might change in future versions of scala where this won't break binary compatibility
  • Return Types
  • Explictly list return type in public API's, otherwise type inference can silently change them between releases.
  • This is good practice anyways
  • Verifying binary compatibility
  • Typesafe code called MIMA, outdated but useful
  • We've build tooling around it to support package private visibility
  • Building a better compatibility checker would be a great community contribution
  • Java API's
  • Conjecture: the most popular scala projects in the future will have Java API's
  • because the user base is so much bigger
  • need to runtime unit test everything using Java
  • with some encouragement, Scala team has helped fix Java compatibility bugs
  • Have to do things like avoid some features (default implementations) and return Java collections instead of Scala collections.
  • Performance in Scala
  • Understand when low-level performance is important
  • prefer java collections over scala collections
    • prefer while loops over for loops
    • prefer private[this] over private
  • IDE's - they prefer IntelliJ
  • Build tools
  • They used to support SBT and Maven builds.
  • Now the "official" build is Maven, but they use sbt-pom-reader plugin to support SBT
  • SBT has improved substantially since they made that decision
  • SBT & Maven differences boil down to: do you prefer Scala over XML for the language? And SBT-Plugins are better than MOJO plugins (Maven)
  • Overall he makes a stronger case for SBT over Maven now.
  • Getting help with Scala
  • Hipster beginner Scala book Atomic Scala - learning programming in a language of the future . Hipster because it's hard to get a hold of a copy. Second edition published online a week ago.
  • Scala issue tracker: https://issues.scala-lang.org/secure/Dashboard.jspa

Conclusions

  • Scala has a large surface area. For best results, we've constrained our use of Scala
  • Keeping your internals and (especially) API simple is really important
  • Spark is unique in its scale, our conventions may not apply to your project

Q: How is their Scala style different from Typesafe's and how do they enforce it programmatically A: They use Scala Style tool to enforce it automatically. Databricks' guide fills in some gaps that Typesafe leaves in their sytle guide.

Type-level Programming in Scala 101

Joe Barnes, Senior Software Architect at Mentor Graphics, @joescii

New name: Type-Level Programming, The Subscpace of Scala

Follow along at http://type-prog.herokuapp.com

  • Not an expert on the subject, here to share his Aha! moment
  • Programming in Scala is like Super Mario 2
  • Normal value programming, there's a lot of stuff that will kill you, a flask of type programming and a door appears and it goes into the bizarro world. That world is a subspace.

Basic stuff

  • Value programming
  • val num = 1 + 2 + 3
    • happens at run time
  • lazy val str = "a" + "b" + "c"
  • happens later, when you access it
  • 'def now = new java.util.Date'
  • Even lazier than than lazy, happens even later because it happens every time you access it
  • type MyMap = Map[Int, String]
  • Like structs back in C, happens in the compiler
  • Code examples Defining boolean values with traits.
  • Can create it in types, instead of a case object for FalseVal it's now a trait called FalseType.
  • Everything is the same but we're replaing def and val with type. This is moving everything into compile time from run time.
  • How to test?
  • With Values you are using Specs or Scalacheck
  • But how to do it with Types?
object BoolTypeSpecs {
	implicitly[TrueType =:= TrueType]
}
  • If you want to get the type in the type: implicitly TrueType#Note =:= FalseType (# instead of .)

  • The test is done by the compiler

  • How to test the negatives? import shapeless.test.illTyped and it compiles only if the type DOESN'T compile

  • Scalatest has something that will do this check at runtime.

  • Advanced example with more than two types (bool and false). Now we do Integers.

  • Start with an IntVal trait and Int0 base case Object and a IntN for all others, and a recursive plus operation.

  • Now do it with IntType, extended by Int0 and IntN types. The code is isomorphic.

  • Your compilation can now have problems, including never completing. This is a problem that you used to have at run time.

  • You've created a bunch of types, so what's the point?

  • To answer that we'll make a simple list of integers in value programming.

  • While testing it you want to add require(...) statements to it, they will be caught at runtime.

  • This implementation validates the size at runtime, but we know the size at compile time.

  • Now with the IntList types we can validate it at compile time.

  • Your compiler errors in your tests won't be intelligible.

  • There is a compiler flag (possibly "higherkinds") to turn off compiler warnings when you try to do any of these things.

Retrospective

  • Types are not just the shape of my objects
  • Type programming expedites computation to compile time
  • This guarantees I see the errors, rather than my users
  • Type validations propagate through my code base

Presentation: https://github.com/joescii/type-prog-impress This was an awesome presentation. He had it running on a heroku app and it would update as he progressed through the slides.

Happy Paths: Functional Constructs in the Wild

Joe Walp, James Kionka

Scalaz, Cats, Scalactic, Structures -> Abstraction, Category theory Scala map/flatMap -> Explicit

  • Example working with Options, you can put them in for comprehensions, map them.

  • Example in a for comprehension of Options, when one fails, you don't know which one.

  • Scalaz has Maybe[A] which is similar to Option but it is invariant (instead of covariant)

  • Working with Try example. Very similar to Option except you have a Failure case (which contains a throwable) instead of a None.

  • Tip can incrementally migrate Options to Trys. Take the piece you are working on and put it in a try, then at the end put .get and it will throw the thing it was going to throw anyways.

  • Try only catches non-fatal exceptions.

  • Cool kids on scalaz use a Disjunction (Scalaz' version of Either) \/ (note: not a V)

  • Gives you explicit exception types.

  • / has .fromTryCatchThrowable method to catch only specific exceptions

  • [ten slides of category theory, I lost focus]

  • Monadic Laws

  • Left identity

  • Right identity

  • Associativity

  • Lists of Monads. Start with a List of names, map and use a try in the map and you have a List[Try[Person]], but you really wanted Try[List[Person]], so sequence!

  • val people: Try[List[Person]] = peopleList.sequence

  • Sequence will only return the first failure

Futures

  • Example of two futures that take different times to complete.
  • Every operation of a future takes an execution context, you can't force it to complete. You can only wait (possibly for a timeout)
  • Execution contexts are basically thread pools
  • Example to get the first Success from a list of futures.
  • Scalaz version of this is Task
  • Will attempt to reuse thread and wait until you tell it to run
  • Tasks also have many features defined for you already (like gatherUnordered)

Co- / Contra- / In- varience

  • Most collections are covariant
  • Example of how covarience gives you some wonkiness in everyday collections

Summary We should all strive for better tools, but until they get here, we must also improve the ones we were given.

Akka in Production: Why and How

Evan Chan, Socrata Inc., @Evanfchan

Reactive applications - event driven, scalable, resilient and responsive

  • For most people this means akka and play
  • Lots of companies and frameworks are using Akka

Ingestion Architectures with Akka Typical Akka stack:

  • spray (http), slf4j / logback, yammer metrics, spray-json, akka 2.0, scala 2.10

Places he's used Akka:

Livelogsd - Akka/Kafka file tailer

  • Project Evan worked on.
  • Maps one actor per file handler Apache Storm
  • Has spouts and bolts
  • Actors talk to each other within bolts to give a level of concurrency to what was normally a single threaded
  • hard to support, very complex with lots of moving parts Akka cluster-based Pipeline
  • Still too complex -- would we want to get paged for this system in the middle of the night?
  • Akka cluster 2.1 not ready for production, but 2.2.x much more stable
  • Mixture of actors and futrues for HTTP requests became hard to grok
  • Actors were much easier for most developers to understand (compared to Futures) Simplified Ingestion Pipeline
  • Karka used to partition messages
  • Single processs - super simple
  • No distribution of data
  • linear actor pipeline - very easy to understand

Stackable Actor Traits

  • Why?
  • Keep adding monitoring, logging, metrics, tracing code gets pretty ugly and repetitive
  • We want some standard behavior around actors -- but we need to wrap the actor Receive block:
  • Start with a base trait trait ActorStack extends Actor { ...}
  • then wrap your receive block with functionality
  • your actor code is still nice and clean and you get your boilerplate functionality mixed in easily

Productionizing Akka

Akka Performance Metrics

  • define a trait that adds two metrics for every actor
  • frequency of messages
  • time spent in receive block
  • all metrics exposed via a spray route / metricz
  • daemon polls /metrics to aggregate data
  • [examples of charts you get with this data]
  • VisualVM and Akka
  • Bounded mailboxes = time spent enqueueing msgs
  • A way to provide backpressure
  • Stack traces don't work for akka apps [example]
  • What we want is a way to track the message flows
  • Trait sends an Edge(source, dest, messageInfo) to a local Collector actor, Trakkar
  • Aggregate edges across nodes then graph
  • Akka service discovery
  • Akka remote - need to know the remote nodes
  • Akka cluster - need to know the seed nodes
  • Use zookeeper or /etcd
  • Be careful - akka is very picky about IP addresses. Beware of AWS, Docker, etc. Test, test, test.
  • Akka instrumentation libraries
  • kamon.io, uses aspectj to "weave" in instrumentation, metrics, logging, tracing
  • akka-tracing, zipkin distributed tracking for Akka

Backpressure and Reliability

Backpressure - the ability to tell senders to slow down / stop

  • Must look at entire system, individual components having flow control does not mean the system behaves well
  • by default, actor mailboxes are unbounded
  • using bounded mailboxes
  • when mailbox full messges go to DeadLetters
  • mailbox-push-timeout-time: setting for how long to wait when mailbox is full
  • doesn't work for distributed systems
  • Real flow control: pull, push with acks, etc
  • works anywhere but more work.
  • Backpressure in action
  • a working back pressure system causes the rate of all actor components to be in sync
  • witness this message flow rate graph of the start of event processesing (they all go at the same rate)
  • Akka streams
  • very conservative - Pull Based
  • consumer must first give permission to publisher to send data
  • Backpressure for fan-in
  • multiple input streams to go single resource (DB?)
  • may come and go
  • pressure comes from each stream
  • Three messages Register, Ready for data, Data
  • High overhead: lots of streams to notify "Ready"
  • At least once delivery
  • let every message have a unique id, ack returns with unique id. What happens when one message doesn't get acked.
  • Resent unacked messages until confirmed== "at least once"
  • Requires keeping message history around
  • unless souce is Kafka then just replay from last succcessful offset + 1
  • Use akka persistence - has at least once semantics
  • Combining fan-in and at-least-once
  • let the client have upper limit of unacked messages
    • Messages are Msg ###, Ack ### and Reject.
  • Use an actor to limit # of outstanding futures
  • [example code] - keeps memory from filling up with futures
  • Good Akka development practices
  • Don't put things that can fail into Actor constructor
  • default supervision strategy stops an actor which cannot initialize itself
  • instead use an initialize message
  • learn akka testkit!

The Scalactic Way

Bill Venners

  • Scalactic grew out of Scalatest
  • Scalatest - quality through tests
  • "The Scalactic Way" - quality through types
  • SuperSafe - quality through static analysis

Scalactic Anyvals PosInt, PosLong ... PosInt(1) works, but PosInt(-42) caught at compile time. val x = 1; PosInt(x) Can't be caught at compile time so you have to do PosInt.from(x) to get an Option[PosInt] They are called AnyVals because at runtime PosInt is just an Int val x: PosInt = -1 uses a macro to do implicit conversion

  • PropertyCheckConfig has a bunch of requires to ensure you have a valid PropertyCheckConfig, but they are checked at runtime
  • Changed in Scalatest 2.3 to use AnyVals and there is less code, and it's caught at compile time.
  • This also reduces some other run time requires and assertions
  • Can roll your own Compile-time assertions
  • requires writing a Macro.
  • Macros can be hairy, but they have a ways of making that easier for you
  • Example: half function with a require assertion on the input to be even
  • replaced with a EvenInt type
  • Now it's just tested in one place that the type satisfies it's requirements instead of in requires and assertions all over your code.
  • The Scalactic way - Use types where practical to focus and reduce the need for tests (both assert and requires)
  • Sits on comfy chair to tell stories
  • ScalaTest 2.0 -> 3.0 He attempted to add typesafe equality but comes at the cost of complexity
  • In the end he decided the added compile time to get better equality type checking was not worth it, and they're not releasing it.
  • This halted effort combined with some forum discussion with Odersky gave way to the solution being done in the SuperSafe project which is doing static analysis instead of type checking, and doesn't even make the compile time predictably larger.
  • This is the gordian knot solution where types were not the right solution
  • TypeSafe contains were also another tricky problem to do with Scalactic, but was much easier to do with SuperSafe
  • Lesson: the type system does not offer the best solution to every problem
  • The SuperSafe Way
  • Can run all the time
  • Doesn't hurt compile time
  • No warnings, only errors
  • Not a linter; a "scala subset policy enforcer"
  • Free as in free beer, but has premium licensed version of $60 / seat / year

Q: Does super safe affect the run time binary? A: No, other than the code changes it causes you to make.

Scala Days 2015

notes by Steven Gangstead Note: there are 4 sessions at a time so my notes are just for the one session I was able to attend.

Wednesday March 18th

Announcements

The Scalawags (Josh Sureth and Daniel)

Code Mash - Family friendly conference in Ohio. Hosted at a water park. http://www.codemash.org/

Keynote: Technical Leadership from wherever you are

Dianne Marsh, Director of Engineering Tools at Netflix, @dmarsh

  • Own It - own the decision making process, as if you own the company, and you own your job satisfaction
  • What is Leadership - Kids on a playground form leadership organically. Writing out the rules and having people sign them isn't leadership. Leadership is understanding the difference between leaders and management.
  • Leaders aren't appointed.
  • Are early adopters leaders? Not necessarily. Early adopters are learning and sharing what they're learning. It's compelling to think of early adopters as leaders, often times they are, but it's not something we want to blindly follow. Assuming EA are always leaders leads to "Shiny Object Syndrome"
  • Leaders emerge from great organizations - the more your opinions are valued the more freely you give them.
  • You Ride - You Decide - as a manager you don't want to make decisions for your team that limit their success. Showed a picture of coworkers mountain biking - they are the ones riding down the mountain, don't want a manager that affects that and doesn't ride.
  • Value courage - leader has to have courage to deliver unpalatable news
  • Netflix Culture: Freedom & Responsibility
  • Responsibility doesn't get as much press as the Freedom part, but it's integral
  • One of the most important things that leaders bring to the table is Vision
  • If you own a company, you'd better understand where the company is going
  • Strategize - Leaders spend a lot of times figuring out how to get that vision to reality
  • Communicate - If you don't communicate an idea it's as if you never had it
  • Great leaders Inspire their team / community. If you don't inspire your team / community they will start following someone who does
  • Remain Flexible - You have to have a plan, but you don't have to act on it. The strategy should evolve as new information is revealed
  • Listening is one of those things that's really hard to do well. Leaders have a proclivity to dominate the conversation and not give the silent people a chance to join the conversation.
  • Story - some people process information differently. They're deep thinkers and need time to process info before they will give their opinion. She started giving them information before the meeting so they could prep ahead of time.
  • Managing Up - leaders job is to make other people look great. Give people under you all the information they need to make good decisions
  • Managing Sideways - similar but with the distinction that when you are talking about your peers you know their strengths and goals. How can you help improve the situation for everyone on the team?
  • Feedback - the biggest gift you can give someone
  • stories about people soliciting "360" reviews from their friends, daughters, etc because they value feedback so much
  • Challenges of Technical Managers - technical managers feel like they aren't contributing if they don't get their hands on some code and that's not always the best thing to do.
  • "I didn't get anything done today, just went to the meetings" - But that's actually your job as a technical manager
  • Technical managers still need to exercise their technical muscle
  • Ways to do that:
  • Internal Hackathons
  • Off critical path projects (not on any timeline, usually something like internal tools)
  • Learn by listening - podcasts
  • Where do you recharge?
  • Winter Tech Forum (formerly Java Posse Roundup)
  • Find your Place to go to immerse yourself and refresh, engage
  • Types of leaders and what they do:
  • Project leaders:
  • project management groups
  • follow through
  • communicate with team
  • People leaders:
  • give honest, frequent feedback
  • protect the company culture
  • help recruit coworkers
  • Idea Leaders:
  • Speak at conferences
  • Write
  • Organize Internal Events
  • Scala Leaders:
  • Be welcoming of newcomers
  • Recognize there's not Just One Way
  • Respect the journey - help new developers
  • "You can't spend half a career as someone else's employee and then, suddenly, one day, start thinking like an owner. Think like an owner from the very first day of the job."

The Unreasonable Effectiveness of Scala for Big Data

Dean Wampler @deanwampler

Claimed "Hadoop is the Enterprise Java Beans of our time"

Hadoop history

  • Started to get traction mainstream around 2008 when Yahoo posted an article with some big numbers for how much data they were processing.
  • HDFS architecture explained
  • Example: Inverted Index
  • Web crawlers index a bunch of pages, map reduce job makes an inverse index where the keys are the words and the values are lists of tuples of the files containing the word and the word count
  • Problems
  • Hard to implement anything more than simple algorithms in map reduce
  • Hadoop API is horrible
  • Example Java code:
  • lots of method calls just to set properties, lots of other ceremony to declare the types and get the code set up. The actual core of the algorithm is only a few lines and even that is more complicated than it needs to be.
  • You get lost in trivial details, you implement everything that matters yourself
  • This is the state of things around 2011/2012
  • Twitter had this problem and was using Scalding, a Scala API, which was based off of Cascading (Java) which was based off of map reduce
  • The same example is now less than half the code
  • Code you write uses basic Scala methods flatMap, groupBy and doesn't have much ceremony.
  • This still has problems since it is using Map Reduce
  • Still uses a bad mapreduce api
  • Only works in "Batch mode"
  • What's next: Event Streams
  • Storm
  • You have to do all your logic and querying twice: once in the batch layer and once in the streaming layer
  • Twitter came up with an API Summingbird that sits on top of Scalding and Storm
  • Spark - answers the problem of having one api for batch and streaming mode
  • very concise, elegant functional api
  • flexible for algorithms
  • composable primitives
  • Efficient: builds a dataflow Directed-Acyclig-Graph and caches it in memory
  • Process streams in "mini batches"
  • Reuse "batch" code
  • Adds "window" functions - a stream is just a very small batch over a short window of time (down to 1 second)
  • Inverted Index example in Spark:
  • Spark code looks just like Scala collections code with a few additions for big data conventions like reduceByKey, groupByKey, mapValues
  • Shows how great the Spark API is to work with
  • This could have been something that Clojure could have taken over
  • but Clojure didn't really go after the community like Scala did
  • and Scala is an easier transition for Java developers
  • The point is not that Scala is so great, but that Functional Programming is so great
  • Working with Data is Mathematics
  • Numbers come in, you work on them and numbers go out

Mathematics libraries in Scala

  • Algebird
  • Addition
  • Associativity explained, important when you are adding billions of numbers
  • Identity explained
  • Generalize addition: Monoid
  • A set of elements, an associative operation, and an identity function
  • Other monoids
  • Top K
  • Average
  • Max/Min
  • This break down at Twitter scale
  • But approximations are ok
  • Algebird lets you tune performance vs accuracy
  • Monoid approximations:
  • hyperloglog for cardinality (how many items in set)
  • minhash for set similarity
  • bloom filter for set membership
  • ... and more
  • "Hash, don't Sample" - mantra at twitter
  • Hash uses all of the data, unlike sampling
  • Spire
  • Numeric library
  • Some overlap with Algebird
  • SQL is also functional
  • also has functional combinators, order by and group by
  • Unification
  • Want something that combines: Spark Core + Spark SQL + Spark Streaming
  • Example collecting stats on airline flights, taken from a Typesafe training course

Conclusions

  • Scala has won in big data

Akka HTTP: The Reactive Web Toolkit

Roland Kuhn, Akka Tech Lead at Typesafe, @rolandkuhn

  • Akka Http is the bridge between actors and an http client.

  • It's the port of the Spray project, which came out of Akka in the first place

  • It's "Spray 2.0" but with quotes because they changed the actor based model to a stream based model.

  • Live demo

  • In addition to an actor system you need an ActorFlowMaterializer()

  • Source is a blueprint for what shall happen (for a file, list, http connection or whatever it's a source of) but doesn't actually run it. It's a source of an event stream.

  • Sink is the end of an event stream.

  • When you call .run() on a stream with a source and a sink it turns them into Actors

  • ActorFlowMaterializer() is what turns the stream with a source and sink into actors, I guess there are other materializers you can turn streams into.

  • Sources can be composed from other sources with an implicit builder function.

  • He makes a composite source by zipping together two other sources

  • It looks like all the Actors are under the covers, with the actorflowmaterializer, you don't actually write any.

  • He had been using Thread.sleep to make it artificially slow. Now he changes map to mapAsynch and the after pattern in Akka. These are executed in parallel and show up in order, but it buffers to groups of 4.

  • In order to asynchronously have a stream complete one future per second he creates a Flow with the OperationAttibutes.inputBuffer set to 1 and inserts that flow into the previous stream with a .via(...) stage

  • We can also make Http connections.

  • Create an StreamTCP() and bind it to an address and you get a source of incoming connections

  • To run this source we add a new stage .to(...) and put a Sink in there. In the Sink we join the connection's flow to a ByteString flow and run it with .run()

  • Then he creates an outgoingConnection from the address created by the StreamTCP

  • Then he binds a list to it and when it runs the program sends the string to itself via an http connection and prints out the list as a bytestring

  • End Demo

  • API Design

  • goals: no magic, compositionality

  • Why do we add an http module? Akka is about building distributed applciations

  • distribution implies integration

  • between internal sub-systems => actors (akka remote)

  • loosely couple systems => http is the lingua franca

  • Stream pipelines

  • SSL stage is almost but not quite ready

  • HTTP Live demo

  • Defines a route with path directives, looks just like Spray routing

  • Bind that route to an Http() connection. Looks just like Spray-can

  • Adds an upload path to the route and gets the entity as a Source and sends it to a previously defined Flow.

Q: Release schedule? A: RC1 hopefully in 4 weeks

Q: Will HTTP2 be a problem? A: We have a proposal for how to handle some of the new streaming

Q: Will Akka remoting be based on streams? A: Yes, it will simplify the remoting layer a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment