This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Exp. Name | Instance Type | Instance Count | Instance Memory | Instance Cores | Machine Cost (h) | Spot Price (As Today) | Worker Memory | Worker Cores | Worker Count | Batch Size (Rows) | Total Rows | Job Time (min) | On Demand Price | Spot Price | Price/ 1000 Rows | On demand Delta with Prod | Spot Delta Current Prod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Prod Spark | c5d.4xlarge | 26 | 32 | 16 | $0.8880 | $0.3233 | 13 | 2 | 64 | 250 | 83957 | 29 | $11.1592 | $4.0632 | $0.1329 | - | - | |
Prod Dask | r5d.4xlarge | 10 | 128 | 16 | $1.3840 | $0.3254 | 16 | 2 | 80 | 150 | 83957 | 29 | $6.6893 | $1.5729 | $0.0797 | 40.00% | 61.00% |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
spark-submit \ | |
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ | |
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \ | |
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \ | |
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ | |
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" \ | |
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YOUR_DOCKER_IMAGE} \ | |
s3://your-bucket/path/to/your/script.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import namedtuple | |
from torch.utils.data import Dataset | |
Tokens = namedtuple("Tokens", ["input_ids", "attention_mask"]) | |
class TokensDataset(Dataset): | |
def __init__(self, iids, amask): | |
self.input_ids = iids.to_numpy() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FROM amazoncorretto:8 | |
ENV PYSPARK_DRIVER_PYTHON python3 | |
ENV PYSPARK_PYTHON python3 | |
RUN yum -y update | |
RUN yum -y groupinstall development | |
RUN yum -y update \ | |
&& yum -y group install "Development Tools" development \ | |
&& yum -y install yum-utils which hostname python3-devel python-devel python3-pip python3-virtualenv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Main Entrypoint to submit to the Spark Cluster""" | |
import os | |
from typing import Tuple | |
import pandas as pd | |
import torch | |
from data_components.io.files.s3 import Client | |
from pyspark.sql import SparkSession | |
from pyspark.sql.functions import PandasUDFType, col, pandas_udf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from enum import Enum | |
from dask.distributed import Client, LocalCluster, SpecCluster | |
from dask_yarn import YarnCluster | |
class ClusterType(Enum): | |
YARN = 'yarn' | |
LOCAL = 'local' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import dask | |
import dask.dataframe as dd | |
import pandas as pd | |
import torch | |
from dask.distributed import Client | |
from transformers import RobertaForSequenceClassification, RobertaTokenizer | |
from . import ClusterType, TokensDataset, get_cluster |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
set -e | |
check_finish() { | |
ID=$1 | |
while ! dask-yarn status "${ID}" 2>/dev/null | awk -v col=3 '{print $col}' | grep FINISHED; do | |
echo -e "Application ${ID} not finihsed" | |
sleep 5 | |
done | |
echo -e "Application ${ID} has finished" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Author unknown. | |
1.) Algorithm Complexity: You need to know Big-O. If you struggle with | |
basic big-O complexity analysis, then you are almost guaranteed not to | |
get hired. | |
For more information on Algorithms you can visit: | |
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=alg_index | |
2.) Coding: You should know at least one programming language really | |
well, and it should preferably be C++ or Java. C# is OK too, since |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
################################################################################ | |
# Boilerplate Shell Script with getopt parsing | |
# | |
# This script is released to the Public Domain by Chad Walstrom | |
# Chad Walstrom <[email protected]>. | |
################################################################################ | |
NOACT=0 | |
NAME=$(basename $0|sed 's/\(\..*\)$//') | |
VERSION="0.1" |
NewerOlder