- If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
- Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
- Pay particular attention to the number of partitions when using
flatMap
, especially if the following operation will result in high memory usage. TheflatMap
op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output offlatMap
to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"lastUpload":"2020-08-02T11:29:35.550Z","extensionVersion":"v3.4.3"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Source: https://hub.docker.com/r/jupyter/pyspark-notebook | |
# Copyright (c) Jupyter Development Team. | |
# Distributed under the terms of the Modified BSD License. | |
ARG BASE_CONTAINER=jupyter/scipy-notebook | |
FROM $BASE_CONTAINER | |
LABEL maintainer="Jupyter Project <[email protected]>" | |
USER root |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Main section */ | |
#notebook-container{ | |
box-shadow: none !important; /* Remove box shadows */ | |
max-width: 1000px; | |
} | |
.container { | |
width: 80% !important; | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Read Palmer Station Penguin dataset from GitHub | |
import pandas as pd | |
df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/" | |
"palmerpenguins/47a3476d2147080e7ceccef4cf70105c808f2cbf/" | |
"data-raw/penguins_raw.csv") | |
# Increase dataset to 1m rows and reset index | |
df = df.sample(1_000_000, replace=True).reset_index(drop=True) | |
# Update sample number (0 to 999'999) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Write to csv | |
df.to_csv("penguin-dataset.csv") | |
# Write to parquet | |
df.to_parquet("penguin-dataset.parquet") | |
# Write to Arrow | |
# Convert from pandas to Arrow | |
table = pa.Table.from_pandas(df) | |
# Write out to file |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Read csv and calculate mean | |
%%timeit | |
pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"].mean() | |
# Read parquet and calculate mean | |
%%timeit | |
pd.read_parquet("penguin-dataset.parquet", columns=["Flipper Length (mm)"]).mean() | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Measure initial memory consumption | |
memory_init = psutil.Process(os.getpid()).memory_info().rss >> 20 | |
# Read csv | |
col_csv = pd.read_csv("penguin-dataset.csv")["Flipper Length (mm)"] | |
memory_post_csv = psutil.Process(os.getpid()).memory_info().rss >> 20 | |
# Read parquet |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"info": { | |
"_postman_id": "c18ab42d-2677-4ede-b043-99535f4da9f6", | |
"name": "Dog Image API", | |
"schema": "https://schema.getpostman.com/json/collection/v2.1.0/collection.json" | |
}, | |
"item": [ | |
{ | |
"name": "Dog API - Loop through breeds", | |
"event": [ |
OlderNewer