Skip to content

Instantly share code, notes, and snippets.

View simicd's full-sized avatar

Dejan Simic simicd

View GitHub Profile
@simicd
simicd / cloudSettings
Last active August 2, 2020 11:29
Visual Studio Code Settings Sync Gist
{"lastUpload":"2020-08-02T11:29:35.550Z","extensionVersion":"v3.4.3"}
@simicd
simicd / Dockerfile
Last active November 17, 2019 20:34
PySpark dockerfile
# Source: https://hub.docker.com/r/jupyter/pyspark-notebook
# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM $BASE_CONTAINER
LABEL maintainer="Jupyter Project <[email protected]>"
USER root
@simicd
simicd / custom.css
Last active December 29, 2019 15:19
Jupyter Notebook styling sheet
/* Main section */
#notebook-container{
box-shadow: none !important; /* Remove box shadows */
max-width: 1000px;
}
.container {
width: 80% !important;
}
@simicd
simicd / spark_tips_and_tricks.md
Created February 14, 2020 20:58 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
## Read Palmer Station Penguin dataset from GitHub
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/"
"palmerpenguins/47a3476d2147080e7ceccef4cf70105c808f2cbf/"
"data-raw/penguins_raw.csv")
# Increase dataset to 1m rows and reset index
df = df.sample(1_000_000, replace=True).reset_index(drop=True)
# Update sample number (0 to 999'999)
# Write to csv
df.to_csv("penguin-dataset.csv")
# Write to parquet
df.to_parquet("penguin-dataset.parquet")
# Write to Arrow
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Write out to file
# Read csv and calculate mean
%%timeit
pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"].mean()
# Read parquet and calculate mean
%%timeit
pd.read_parquet("penguin-dataset.parquet", columns=["Flipper Length (mm)"]).mean()
# Measure initial memory consumption
memory_init = psutil.Process(os.getpid()).memory_info().rss >> 20
# Read csv
col_csv = pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"]
memory_post_csv = psutil.Process(os.getpid()).memory_info().rss >> 20
# Read parquet
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@simicd
simicd / Dog Image API.postman_collection.json
Last active August 14, 2020 10:59
Postman array iteration
{
"info": {
"_postman_id": "c18ab42d-2677-4ede-b043-99535f4da9f6",
"name": "Dog Image API",
"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json"
},
"item": [
{
"name": "Dog API - Loop through breeds",
"event": [