Skip to content

Instantly share code, notes, and snippets.

View pablosjv's full-sized avatar
🏄‍♂️
Data Surfing

Pablo San José pablosjv

🏄‍♂️
Data Surfing
View GitHub Profile
@pablosjv
pablosjv / dask-vs-spark-best-experiments.csv
Created September 1, 2021 09:54
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Tables
Exp. Name Instance Type Instance Count Instance Memory Instance Cores Machine Cost (h) Spot Price (As Today) Worker Memory Worker Cores Worker Count Batch Size (Rows) Total Rows Job Time (min) On Demand Price Spot Price Price/ 1000 Rows On demand Delta with Prod Spot Delta Current Prod
Prod Spark c5d.4xlarge 26 32 16 $0.8880 $0.3233 13 2 64 250 83957 29 $11.1592 $4.0632 $0.1329 - -
Prod Dask r5d.4xlarge 10 128 16 $1.3840 $0.3254 16 2 80 150 83957 29 $6.6893 $1.5729 $0.0797 40.00% 61.00%
@pablosjv
pablosjv / spark-submit-example.sh
Last active August 27, 2021 16:10
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
#!/bin/sh
spark-submit \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
s3://your-bucket/path/to/your/script.py
@pablosjv
pablosjv / tokens_dataset.py
Created August 27, 2021 11:58
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
from collections import namedtuple
from torch.utils.data import Dataset
Tokens = namedtuple("Tokens", ["input_ids", "attention_mask"])
class TokensDataset(Dataset):
def __init__(self, iids, amask):
self.input_ids = iids.to_numpy()
@pablosjv
pablosjv / emr.Dockerfile
Created August 27, 2021 11:58
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
FROM amazoncorretto:8
ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3
RUN yum -y update
RUN yum -y groupinstall development
RUN yum -y update \
&& yum -y group install "Development Tools" development \
&& yum -y install yum-utils which hostname python3-devel python-devel python3-pip python3-virtualenv
@pablosjv
pablosjv / predict_spark.py
Created August 27, 2021 11:57
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
"""Main Entrypoint to submit to the Spark Cluster"""
import os
from typing import Tuple
import pandas as pd
import torch
from data_components.io.files.s3 import Client
from pyspark.sql import SparkSession
from pyspark.sql.functions import PandasUDFType, col, pandas_udf
@pablosjv
pablosjv / get_dask_cluster.py
Created August 27, 2021 11:56
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
from enum import Enum
from dask.distributed import Client, LocalCluster, SpecCluster
from dask_yarn import YarnCluster
class ClusterType(Enum):
YARN = 'yarn'
LOCAL = 'local'
@pablosjv
pablosjv / dask_predict.py
Created August 27, 2021 11:55
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
import os
import dask
import dask.dataframe as dd
import pandas as pd
import torch
from dask.distributed import Client
from transformers import RobertaForSequenceClassification, RobertaTokenizer
from . import ClusterType, TokensDataset, get_cluster
@pablosjv
pablosjv / dask-submit-launcher.sh
Last active August 27, 2021 11:59
Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples
#!/usr/bin/env bash
set -e
check_finish() {
ID=$1
while ! dask-yarn status "${ID}" 2>/dev/null | awk -v col=3 '{print $col}' | grep FINISHED; do
echo -e "Application ${ID} not finihsed"
sleep 5
done
echo -e "Application ${ID} has finished"
@pablosjv
pablosjv / gist:c3f646901c8df61239ccd1cfd13d7dc5
Created September 15, 2019 16:36 — forked from swenson/gist:cf74cd8e282443b43b8a
Google Interview Study Guide
Author unknown.
1.) Algorithm Complexity: You need to know Big-O. If you struggle with
basic big-O complexity analysis, then you are almost guaranteed not to
get hired.
For more information on Algorithms you can visit:
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=alg_index
2.) Coding: You should know at least one programming language really
well, and it should preferably be C++ or Java. C# is OK too, since
@pablosjv
pablosjv / getopt-boilerplate.sh
Created May 30, 2019 16:14 — forked from runswithd6s/getopt-boilerplate.sh
BASH Script Boilerplate
#!/usr/bin/env bash
################################################################################
# Boilerplate Shell Script with getopt parsing
#
# This script is released to the Public Domain by Chad Walstrom
# Chad Walstrom <[email protected]>.
################################################################################
NOACT=0
NAME=$(basename $0|sed 's/\(\..*\)$//')
VERSION="0.1"