Spark Tips & Tricks

Misc. Tips & Tricks

If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the

micahgodbolt / wsl_install_node.md

Last active June 19, 2025 19:41

WSL install Node

The apt-get version of node is incredibly old, and installing a new copy is a bit of a runaround.

So here's how you can use NVM to quickly get a fresh copy of Node on your new Bash on Windows install

$ touch ~/.bashrc
$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.3/install.sh | bash
// restart bash
$ nvm install --lts

SteefH / 0 - blog.md

Last active May 11, 2019 05:24

Amorphous: Writing a Scala library for boilerplate-free object mapping

At Infi, we started our first Scala project (link in Dutch) in mid-2016. When it became clear that Scala might be one of the technologies used in the project, I jumped at the chance to be part of it, because I'm always eager to learn new tech, and doing a project in a functional programming language was already near the top of my professional wish list.

As always when learning new technology, I like to push the envelope to see where things start to break down. I think that's a nice way to get to know the limits of that technology. As it turns out, Scala is a powerful language, with a strong type system that lets you use many advanced concepts I won't detail here (eg. type classes, high-level abstractions like the ones in the Typeclassopedia with the help of scalaz or [Cats](https://github.com

non / seeds.md

Last active July 10, 2024 20:34

Simple example of using seeds with ScalaCheck for deterministic property-based testing.

introduction

ScalaCheck 1.14.0 was just released with support for deterministic testing using seeds. Some folks have asked for examples, so I wanted to produce a Gist to help people use this feature.

simple example

These examples will assume the following imports:

gvolpe / di-in-fp.md

Last active September 16, 2024 07:18

Dependency Injection in Functional Programming

There exist several DI frameworks / libraries in the Scala ecosystem. But the more functional code you write the more you'll realize there's no need to use any of them.

A few of the most claimed benefits are the following:

Dependency Injection.
Life cycle management.
Dependency graph rewriting.