Skip to content

Instantly share code, notes, and snippets.

@msummersgill
msummersgill / SparkRError.R
Last active April 7, 2021 21:59
SparkR Error Collecting Large SparkDataFrame
library(arrow)
## Open source Apache Spark downloaded from this archive:
## https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
library(SparkR, lib.loc = "~/DatabricksTesting/spark-3.0.1-bin-hadoop2.7/R/lib/")
## $java -version
## openjdk version "1.8.0_212"
## OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03)
## OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
@dusenberrymw
dusenberrymw / spark_tips_and_tricks.md
Last active June 28, 2024 12:37
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the