So the hacker news post said my comment was too long. Warning, long, opinionated post:
Scala dev for 10+ years here. Spark is weird. Databricks has a style guide that deliberately chooses not to use scala features that the rest of the community uses, and doesn't follow the same best practices around library and scala major version usage that the rest of the community uses [1]. It's no surprise that the project has trouble interoperating with libraries outside of the spark ecosystem, and is therefore a maintenance problem.
Spark's style and compatibility problems
Scala isn't a maintenance nightmare, but it does attract a lot of newcomers who dive in, don't stick within one of its many ecosystems, get confused, and generally leave a mess, and that is a direct result of the fact that scala is a multi-paradigm, relatively expressive language to the one(s) it is competing with and pulling developers from, and that those developers, for the large part, don't really want to change and think that Scala is just a more featured version of the language they came from rather than an entirely different language with smooth one-way interoperability with those languages. This post is an example of what happens when someone gets over the hump of hype, and enters the trough of disillusionment.
Scala major versions and libraries within major versions are binary compatible. When a scala version is released, development upon the next scala version is begun immediately. Most major scala libraries cross-compile against the latest scala version and the previous scala version until the next scala version is in Milestone 1 release. At that point, the libraries cross-compile against the current scala version and the next scala version. The projects typically announce that they are ending support for x version for anything. That tends to mean that, if you choose as a community to not follow that trend and stay on a scala version that is over seven years old, you are going to have issues interoperating with the newest versions of libraries. Spark chose not to upgrade to the previous scala version, 2.12, which is six years old at this point, until last year. Many platforms as a service haven't upgraded their Spark support to a version that supports scala 2.12, yet. That means that you are stuck only using libraries within the Spark ecosystem. That's a deliberate choice by the Spark community. It's a choice that needs to change, but is unlikely to, IMHO.
This happens in binary compatible languages as well. Not all new versions of a library are going to be source compatible, and semantic versioning is used as a convention to guarantee source level compatibility in javascript, for example.
Sbt
Scala has a build tool. It is sbt. The vast majority of all scala projects use it [2]. It has design issues [3] and Mill[4] has addressed these but hasn't gained popularity among scala projects. Maven and Gradle don't really grok scala dependencies natively, don't understand binary incompatibility because of that, and so aren't really a viable option for library authors, which make up the vast amount of readable open-source scala project examples on the web. So as a newcomer to scala, you are going to have to learn sbt, just like you had to learn maven or gradle. The good news is that the amount of sbt that you will use to manage 80% of scala projects amounts to a manual that is 35 pages, much of which is double-spaced code examples. The rest is very well documented, and is searchable. Getting started with sbt now is so much easier than in the old days:
- install sdkman [5]
- run
sdk install sbt
- run
sbt new scala/scala-seed.g8
- `cd`` to the produced scala project
- run
sbt
Sbt configuration is task and setting based. You express dependencies among tasks and settings by giving the setting/task name and appending .value
to it within the definition of the setting/task. All available settings can be listed with settings -v
. All available tasks can be listed with tasks -v
. All available configurations can be listed with ivyConfigurations
. Of course all of these instructions can be listed with help
. And individual command help can be listed with <help> commandName
. Also, all commands are tab-completable. It really is a nice tool, but most people just google 'how do i do x' and end up in a sea of stack overflow comments, many of which are for outdated versions. It is a difficult tool for the maintainers to maintain because of its internals design, but for project builders not having to publish a plugin to do a custom interactive task and being able to write the build in the language of your code is really nice. Additiionally, with its recursive structure, its possible to add your own abstractions to the build by putting .scala
files in the project
directory of the current build level and importing them into the build.sbt of that level. So you can end up with builds that are very clean without resorting to plugins.
Of course, within an organization/enterprise, library dependencies and settings will be common across projects, so creating a plugin to share these settings is pretty common. Additionally, there are some really great compiler and build plugins that do common things across the scala ecosystem, such as code linting (wartremover), assembly and shading (sbt-assembly), packaging (scala-native-packager), better errors (better errors), and implicit dependency chain compiler display (tek:splain), that pretty much every project needs if you don't want to do it from scratch. The above four are the best of breed, though tek:splain tends to eat memory so I enable it only when something doesn't derive (an implicit injection is inexplicably unavailable).
Sbt's code generation capabilities are top-notch, btw, so if you'd rather avoid macros and learining scala-meta (definitely an expert level code generation tool that works at the source level rather than the project level), you can bang out common things with a handlebars template, a custom task definition, some custom settings for the replacements in your template, and by adding the task to <Compile|Test|IntegrationTest> / sourceGenerators
or <Compile|Test|IntegrationTest> / resourceGenerators
and your repetitive source code (Json codecs, elasticsearch queries, Table migration statements, swagger api endpoits, anyone) can be generated for you and is available for packaging and compilation immediately.
The tool isn't hard to use or understand. Nobody that starts out reads the manual until they're neck-deep in doing something that is actively fighting the way the tool works, and by that time they are already in trouble. Maintaining the internals of sbt is an entirely different issue, and it would be nice if we could keep the sbt interactive interface and familiar scope / command / task hierarchy implemented with classes moving forward, but I really think that Mill will be the eventual place the community will migrate to in the future. It is much easier to understand from a maintenance standpoint, however, the choices made by Li Haoyi in moving away from target directories and the lack of plugins for common tasks at the moment preclude switching.
Testing Libraries
Testing frameworks are kind of like json libraries for scala. Scala does DSLs really well, so testing libraries are one of those things that scala library authors love to write. At the moment, I'd reccomend munit. It's very similar to junit, and works with mockito-scala, though I'm pretty sure you could integrate ScalaMock easily as well (which is a superior, though opinionated mocking framewwork). I haven't tried, mocks are convieniences, and with scala's ???
operator can even be skipped entirely by relying on the metals ide to generate unimplemented interfaces. Anyway, use scalatest or munit, and stick to one style.
Speaking of style -- dance with the ecosystem that brought you
Scala has six main library/coding style ecosystems. When you are in one ecosystem, avoid bringing in libraries from another or your project will be incohesive.
- Lightbend The lightbend ecosystem is based around playframework and akka. It is written in the same style as the scala standard library, heavily uses dsls and macros, is OOP in the large and functional in the small. The main libraries here are playframework/[6], lightbend/config, and akka-[7]. When you are working in this style, you avoid higher-kinded types and shapeless derivation, have to memorize many dsls, and don't worry as much about referential transparency. It's a heavily framework-based world, many of which are implemented as sbt plugins. Each library kind of has its own universe of discourse, but they tend to work together really well. The issue comes when trynig to rely upon third-party libraries. When a major version of a backing framework changes (Akka/play), you can be left hanging as niche framework add-ons can be slow to upgrade, and since this ecosystem is very much framework based rather than library based, you have to wait to upgrade. It is very much like using spring-boot in the java world.
This is the scala as written in the Scala language overview book, and the scala you will probably be first exposed to as a newcomer. It uses all the features, and java interop tends to be handled by wrapping java apis in a similar interface but with a Try
or Future
or Akka Actor
wrapper around the return to avoid thrown exceptions leaking into your Scala source code. You are going to be reading documentation for a long time, as the base libraries here are pretty big and although well-written, all have differing domains that have to be learned pretty much in full to limit boilerplate and repitition that you'd get in a java codebase. It does use implicit injection, but primarily at the method level rather than at an object level, and through the use of imports. You can be very effective here, as long as you don't mix in a library written another style. Often described as 'better java'. Stick to libraries from the akka, lightbend, and play organizations, or you might end up in dependency hell someday.
- Typelevel[8]
This is the functional/type level programming community. It's where a lot of scala's old-guard lives. It is simpler than the Lightbend style, as everything must be referentially transparent and lots of things depend on shapeless/code derivation using implicits. It is functional in the large, functional in the small, and eschews dsls for the most part. Everything is an object at some level, and uses typeclasses and effect types like
Task
andIO
to wrap side-effecting or error code. Libraries here include Doobie, shapeless, http4s, circe, cats, cats-effect, Monix, fs2, refined, and scalacheck.
There are about 22 methods that you need to know to use this ecosystem: map, flatMap, delay, sequence, mapN, fold, foldLeft, foldRight, eval, evalMap, catchNonFatal, handleErrorWith, parSeq, and show are the most important, and are all defined within the cats, cats-effect, and fs2 libraries. The basic effect type is IO, though most code is written to one if its implementing effect interfaces: Sync, Async, Effect, or ConcurrentEffect (IO implements all the typeclasses). IO is referentially transparent, unlike the standard library future, but otherwise is nearly identical. fs2 is a pull based streaming library that provides concurrent queues and streams for handling streaming communication. Monix is its push-based counterpart for in-memory event-driven effects. You can pretty much build anything combining cats, fs2/monix, and cats-effect in an entirely referentially transparent way. The ecosystem provides an IOApp wrapper that will wrap your entire application in IO and provide the necessary implicits so that you can write your code without having to worry about what exact effect type you are using, for the most part.
Typelevel scala is pretty much done using the tagless final style
[9].
The nice thing about this ecosystem and writing apps in it are that it uses a safe subset of scala features, is a consistent and flexible architechture, and uses a small number of patterns (map/flatMap/fold/pattern matching) and uses the scala compiler features to get dependency injection without adding an external dependency. It also often uses shapeless for derivation -- meaning you get auto-generated json codecs or test examples, and uses scalacheck for test data generation, meaning you test many different examples in your tests, and usels discipline for transformation laws that ensure that if you implement a typeclass instance not provided to you by cats for your special data type, it will still behave within the typeclass interface behavior, so you don't define things that will break. This style is slowly being replaced/augmented/competed with by another ecosystem, ZIO.
-
ZIO[10] ZIO builds on several libraries from the Typelevel ecosystem, but puts them into an easier to understand context[11] and by providing a common runtime context to evaluate things inside those type aliases. Rather than the complicated context-bound notation of
tagless final
, ZIO provides the same benefits by adding a few type aliases that require less understanding of the scala type system than tagless final. Though new, this ecosystem is very high quality, and will just work much like the typelevel ecosystem libraries. Typically, libraries inside the ecosystem have zio in the artifact name. Think of it as a less-magic and user-friendly but just as high-quality counterpart to the typelevel ecosystem. Quite often, ZIO libraries integrate or wrap typelevel libraries under the hood. While the libraries may interop with those from the typelevel ecosystem, your application code and libraries should stick with the ZIO conventions and avoid directly using those libraries, in order to maintain a consistent style. Since the ecosystem is newer, and many contributors also contribute to typelevel, you may have to implement something yourself, but with the guardrails of ZIO core available to you it will likely be simpler and more consistent and require less understanding of the primitives of fp than the raw typelevel ecosystem. You'll still end up using shapeless a lot for things that can be derived. -
Spark Stay within spark libraries and java or get burned. Slow to update, and geared towards spark coding only. Don't mix libraries not written in the spark style and libraries written in another ecosystem that you don't maintain, or you might end up in dependency hell.
-
Twitter/Finagle High quality libraries written for everything in the Twitter School of Scala style. Pretty similar to the Lightbend ecosystem, with which it competes. Again, don't mix and match with Typelevel or ZIO, or Spark, you might get burned. Lightbend scala tends to work pretty will with Finagle stuff, though.
-
Scalaz style A predecessor to cats, scalaz is pretty similar, but not as well-documented. It tends to kind of run in its own circles, and isn't really used that much anymore, due to inter-community infighting. The libraries tend to be really-well written, and close to their haskell counterpartts, but it's kind of fallen out of popular usage.
What should your company use
For the most part, I reccomend you use ZIO or Typelevel, and to build small simple projects that asynchronously communicate via a queing system fronted by a simple REST api powered by http4s, and front ends built out of scalajs-react. You can build almost anything out of this architecture, and it can all live in one monorepo without any organizational sbt plugins. It uses the features of scala that are likely never going to be ported to Java, and makes for much safer programs, with the least amount of typing overhead and the consistency of the interfaces and di injection style mean that you can get a long way without having to memorize several DSLs. Of course there are more features available in the ecosystem, but you'll use them pretty sparingly outside of sequence and mapN if you are doing it correctly.
IF YOU ARE WORKING IN BATCH DATA ANALYTICS/ML, USE SPARK, and especially use SparkSQL if you can. For that use case you can't beat the support provided by many platform as a service providers. Don't mix the ecosystem streams, and you'll be fine.
For streaming data analytics, use fs2 with typelevel/ZIO style if you can't afford to drop data events.
For streaming data analytics where you can drop data, use Monix with tyelevel/ZIO style, since it is more efficient.
For applications with a big GUI/FrontEnd in ScalajsReact and a simple db storage system with lots of templating, use Lightbend + Play + scalajs-react. Even though I don't use it myself (I use scalatags and http4s), having the framework instead of a library really helps to keep large codebases consistent. Don't do this for REST apis--simple http4s is much leaner and easier to reason about than the playframework guice injection and templating stuff. Yeah, you might have to wait for upgrades to use the new hotness, but you'l be plenty productive and there's options to purchase support from lightbend via a license, which comes in really handy on complex applications like the ones you need a full-stack solution for.
Eventually your senior devs are going to want to use http4s+zio/typelevel, because of the guarantees they provide, but your java devs will feel at home in play, and hopefully by that time your codebase will be large enough that rewrites will be prohibitively expensive.
Your front-end devs aren't going to care because they'll be using scalajs-react anyway, and that will all be in the same style.
With all of these ecosystems, you want to leverage as much extant Java library code as possible by wrapping them in scala interfaces that remove the java nastiness -- catch errors in Trys/Effects/Eithers, No Nulls, declare when something is going to do something asynchronous with a Future/IO/ZIO. Use the refined scala library and your own case classes rather than the java library internals to make BowlingScoreStrings instead of using raw scala primitives or smart constructors so that you don't put a BowlerName where a BowlerScore should go, and to avoid boxing when unnecessary, write typeclass converters to and from the java libs domain, so you can sway it out with a different implementation if you want, later, etc.
1: https://github.com/databricks/scala-style-guide/blob/master/README.md 2: https://scala-sbt.org 3: https://lihaoyi.com/post/SowhatswrongwithSBT.html 4: https://com-lihaoyi.github.io/mill/ 5: https://sdkman.io 6: https://www.playframework.com/ 7: https://akka.io/ 8: https://typelevel.org/ 9: https://gist.github.com/jackcviers/079e67828318c548d9c6112147bce2ba 10: https://zio.dev/ 11: https://zio.dev/docs/overview/overview_index