Skip to content

Instantly share code, notes, and snippets.

View jwbargsten's full-sized avatar

Joachim Bargsten jwbargsten

View GitHub Profile
# https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html
[tool.pytest.ini_options]
pythonpath = [ "src", "tests" ]
norecursedirs = [
"tests/testkit"
]
[tool.black]
line-length=111
# ./.flake8
[flake8]
max-line-length=111
exclude=src/dapple/_version.py
ignore=E231,W503,E203,E265,D103,D100,D101,D102,D104,D105,D107,D401,D400,D205
@jwbargsten
jwbargsten / spark_tips_and_tricks.md
Created January 10, 2025 07:36 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the