Skip to content

Instantly share code, notes, and snippets.

View pablosjv's full-sized avatar
🏄‍♂️
Data Surfing

Pablo San José pablosjv

🏄‍♂️
Data Surfing
View GitHub Profile
@pablosjv
pablosjv / tokens_dataset.py
Created August 27, 2021 11:58
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
from collections import namedtuple
from torch.utils.data import Dataset
Tokens = namedtuple("Tokens", ["input_ids", "attention_mask"])
class TokensDataset(Dataset):
def __init__(self, iids, amask):
self.input_ids = iids.to_numpy()
@pablosjv
pablosjv / spark-submit-example.sh
Last active August 27, 2021 16:10
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
#!/bin/sh
spark-submit \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
s3://your-bucket/path/to/your/script.py
@pablosjv
pablosjv / dask-vs-spark-best-experiments.csv
Created September 1, 2021 09:54
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Tables
Exp. Name Instance Type Instance Count Instance Memory Instance Cores Machine Cost (h) Spot Price (As Today) Worker Memory Worker Cores Worker Count Batch Size (Rows) Total Rows Job Time (min) On Demand Price Spot Price Price/ 1000 Rows On demand Delta with Prod Spot Delta Current Prod
Prod Spark c5d.4xlarge 26 32 16 $0.8880 $0.3233 13 2 64 250 83957 29 $11.1592 $4.0632 $0.1329 - -
Prod Dask r5d.4xlarge 10 128 16 $1.3840 $0.3254 16 2 80 150 83957 29 $6.6893 $1.5729 $0.0797 40.00% 61.00%