Skip to content

Instantly share code, notes, and snippets.

View jwbargsten's full-sized avatar

Joachim Bargsten jwbargsten

View GitHub Profile
@jwbargsten
jwbargsten / spark_tips_and_tricks.md
Created January 10, 2025 07:36 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
# ./.flake8
[flake8]
max-line-length=111
exclude=src/dapple/_version.py
ignore=E231,W503,E203,E265,D103,D100,D101,D102,D104,D105,D107,D401,D400,D205
# https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html
[tool.pytest.ini_options]
pythonpath = [ "src", "tests" ]
norecursedirs = [
"tests/testkit"
]
[tool.black]
line-length=111
@jwbargsten
jwbargsten / Effective Scala Case Class Patterns.md
Created May 26, 2022 09:05 — forked from chaotic3quilibrium/Effective Scala Case Class Patterns.md
Article: Effective Scala Case Class Patterns - The guide I wished I had read years ago when starting my Scala journey

Effective Scala Case Class Patterns

Version: 2022.03.02

Available As

@jwbargsten
jwbargsten / esm-package.md
Created March 6, 2022 11:10 — forked from sindresorhus/esm-package.md
Pure ESM package

Pure ESM package

The package linked to from here is now pure ESM. It cannot be require()'d from CommonJS.

This means you have the following choices:

  1. Use ESM yourself. (preferred)
    Use import foo from 'foo' instead of const foo = require('foo') to import the package. You also need to put "type": "module" in your package.json and more. Follow the below guide.
  2. If the package is used in an async context, you could use await import(…) from CommonJS instead of require(…).
  3. Stay on the existing version of the package until you can move to ESM.
@jwbargsten
jwbargsten / deskew.java
Created September 13, 2021 20:49 — forked from witwall/deskew.java
Automatic image deskew in Java http://anydoby.com/jblog/en/java/1990 Those who have to process scans should know how painful it is to manually deskew images. There are several approaches to do this deskewing automatically. The basis of all the methods is to identify lines following the same direction in a image and then by deviation from horizon…
public double doIt(BufferedImage image) {
final double skewRadians;
BufferedImage black = new BufferedImage(image.getWidth(), image.getHeight(), BufferedImage.TYPE_BYTE_BINARY);
final Graphics2D g = black.createGraphics();
g.drawImage(image, 0, 0, null);
g.dispose();
skewRadians = findSkew(black);
System.out.println(-57.295779513082320876798154814105 * skewRadians);
return skewRadians;
@jwbargsten
jwbargsten / parseflags.sh
Created April 22, 2021 13:40 — forked from bxparks/parseflags.sh
Simple Bash Shell Command Line Processing Template
#!/bin/bash
#
# Self-contained command line processing in bash that supports the
# minimal, lowest common denominator compatibility of flag parsing.
# -u: undefined variables is an error
# -e: exit shell on error
set -eu
function usage() {
@jwbargsten
jwbargsten / setup.cfg
Created May 25, 2020 10:30 — forked from althonos/setup.cfg
A `setup.cfg` template for my Python projects
# https://gist.github.com/althonos/6914b896789d3f2078d1e6237642c35c
[metadata]
name = {name}
version = {version}
author = Martin Larralde
author-email = [email protected]
home-page = https://github.com/althonos/{name}
description = {description}
long-description = file: README.rst, CHANGELOG.rst
@jwbargsten
jwbargsten / ajping.py
Last active August 29, 2015 14:01
pings a servlet engine with AJP protocol
#!/usr/bin/env python
# source: http://www.joedog.org/pub/AJP/ajping.txt
from struct import unpack
import time
import sys
import socket
acks = set([65, 66, 0, 1, 9])