🏄‍♂️

Data Surfing

Pablo San José pablosjv

🏄‍♂️

Data Surfing

Happy to solve challenging problems with modern technologies. Keep deep learning 💪

35 followers · 96 following

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

pablosjv / dask-vs-spark-best-experiments.csv

Created September 1, 2021 09:54

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Tables

	Exp. Name	Instance Type	Instance Count	Instance Memory	Instance Cores	Machine Cost (h)	Spot Price (As Today)	Worker Memory	Worker Cores	Worker Count	Batch Size (Rows)	Total Rows	Job Time (min)	On Demand Price	Spot Price	Price/ 1000 Rows	On demand Delta with Prod	Spot Delta Current Prod
	Prod Spark	c5d.4xlarge	26	32	16	$0.8880	$0.3233	13	2	64	250	83957	29	$11.1592	$4.0632	$0.1329	-	-
	Prod Dask	r5d.4xlarge	10	128	16	$1.3840	$0.3254	16	2	80	150	83957	29	$6.6893	$1.5729	$0.0797	40.00%	61.00%

pablosjv / spark-submit-example.sh

Last active August 27, 2021 16:10

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	#!/bin/sh

	spark-submit \
	--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
	--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
	--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
	--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
	--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \
	--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \
	s3://your-bucket/path/to/your/script.py

pablosjv / tokens_dataset.py

Created August 27, 2021 11:58

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	from collections import namedtuple

	from torch.utils.data import Dataset

	Tokens = namedtuple("Tokens", ["input_ids", "attention_mask"])


	class TokensDataset(Dataset):
	def __init__(self, iids, amask):
	self.input_ids = iids.to_numpy()

pablosjv / emr.Dockerfile

Created August 27, 2021 11:58

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	FROM amazoncorretto:8

	ENV PYSPARK_DRIVER_PYTHON python3
	ENV PYSPARK_PYTHON python3

	RUN yum -y update
	RUN yum -y groupinstall development
	RUN yum -y update \
	&& yum -y group install "Development Tools" development \
	&& yum -y install yum-utils which hostname python3-devel python-devel python3-pip python3-virtualenv

pablosjv / predict_spark.py

Created August 27, 2021 11:57

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	"""Main Entrypoint to submit to the Spark Cluster"""

	import os
	from typing import Tuple

	import pandas as pd
	import torch
	from data_components.io.files.s3 import Client
	from pyspark.sql import SparkSession
	from pyspark.sql.functions import PandasUDFType, col, pandas_udf

pablosjv / get_dask_cluster.py

Created August 27, 2021 11:56

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	from enum import Enum

	from dask.distributed import Client, LocalCluster, SpecCluster
	from dask_yarn import YarnCluster


	class ClusterType(Enum):
	YARN = 'yarn'
	LOCAL = 'local'

pablosjv / dask_predict.py

Created August 27, 2021 11:55

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	import os

	import dask
	import dask.dataframe as dd
	import pandas as pd
	import torch
	from dask.distributed import Client
	from transformers import RobertaForSequenceClassification, RobertaTokenizer

	from . import ClusterType, TokensDataset, get_cluster

pablosjv / dask-submit-launcher.sh

Last active August 27, 2021 11:59

Large Scale Pytorch Inference Pipeline: Spark vs Dask - Code Examples

	#!/usr/bin/env bash
	set -e

	check_finish() {
	ID=$1
	while ! dask-yarn status "${ID}" 2>/dev/null \| awk -v col=3 '{print $col}' \| grep FINISHED; do
	echo -e "Application ${ID} not finihsed"
	sleep 5
	done
	echo -e "Application ${ID} has finished"

pablosjv / gist:c3f646901c8df61239ccd1cfd13d7dc5

Created September 15, 2019 16:36 — forked from swenson/gist:cf74cd8e282443b43b8a

Google Interview Study Guide

	Author unknown.

	1.) Algorithm Complexity: You need to know Big-O. If you struggle with
	basic big-O complexity analysis, then you are almost guaranteed not to
	get hired.
	For more information on Algorithms you can visit:
	http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=alg_index

	2.) Coding: You should know at least one programming language really
	well, and it should preferably be C++ or Java. C# is OK too, since

pablosjv / getopt-boilerplate.sh

Created May 30, 2019 16:14 — forked from runswithd6s/getopt-boilerplate.sh

BASH Script Boilerplate

	#!/usr/bin/env bash
	################################################################################
	# Boilerplate Shell Script with getopt parsing
	#
	# This script is released to the Public Domain by Chad Walstrom
	# Chad Walstrom <[email protected]>.
	################################################################################
	NOACT=0
	NAME=$(basename $0\|sed 's/\(\..*\)$//')
	VERSION="0.1"

NewerOlder