PySpark on YARN in self-contained environments

Author: https://github.com/seanorama

Note: This was tested on HDP 3.1. It may not work with other Spark/YARN distributions.

Reference:

Why

By default, pyspark jobs will use the Python from the local system of each YARN NodeManager host. Which means:

If a different version of Python is needed, it must be installed and maintain on all of the hosts
If additional modules (i.e. from pip) are needed they must be installed and maintained on all of the hosts

This presents a lot of overhead and introduces many risks. Also, the Spark developers typically do not have direct access to YARN NodeManager hosts.

Further, it is a good practice to manage dependencies from the development side, which is only possible if all Python dependencies are "self-contained".

Overview

There are 2 methods to have self-contained environments:

a) Use an archive (i.e. tar.gz) of a Python environment (virtualenv or conda):
- Benefits:
  - Faster load time than an empty 'virtualenv' since the packages are already present.
  - Uses the YARN Shared cache: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SharedCache.html
    - During 1st use of the environment, it will be cached. Future uses will load very fast.
  - Not tied to the Python driver of the NodeManager machines, meaning any Python version can be used and nothing ever has to be done on the NodeManagers.
  - Completely self-contained. No external dependencies.
- Caveats:
  - Can't install additional packages at run-time.
b) Use python.pyspark.virtualenv which creates a new virtualenv at Spark runtime:
- Benefits:
  - Install packages at runtime.
- Caveats:
  - Not entirely self-contained since it depends on the interpreter being available on all YARN NodeManager hosts (i.e. at /usr/bin/python3 or wherever you place it). Which also makes it harder to change versions.
  - Very SLOW as every job will require downloading and building the pip packages.
  - Depends on public internet access, or a local network pypi/conda repo.

How to: Use an archive (i.e. tar.gz) of a Python environment (`virtualenv` or `conda`):

Overview:

Create an environment with virtualenv or conda
Archive the environment to a .tar.gz or .zip.
Upload the archive to HDFS
Tell Spark (via spark-submit, pyspark, livy, zeppelin) to use this environment
Repeat for each different virtualenv that is required or when the virtualenv needs updating

(Optional) Create a shared HDFS path for storing the environment(s):

Only necessary if sharing the environment and if a location doesn't already exist.

sudo -u hdfs -i

## if kerberos
keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
## endif

hdfs dfs -mkdir -p /share/python-envs
hdfs dfs -chmod -R 775 /share

## replace the group with a user group that will be managing the archives
hdfs dfs -chown -R hdfs:hadoop /share

exit

Create archive using Python `virtualenv` and `tar`

Install python3 and python-virtualenv on a host with same Operating System and CPU architecture as the cluster, such as an edge host:
- Can be in your home-directory without root/sudo access
  - downloading python manually
  - using pyenv or similar application
- Or system-wide:

sudo yum install python3 python-virtualenv

Create the archive:

## Create requirements for `pip`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF

## The name of the environment
env="python3-venv"

## Create the environment
python3 -m venv ${env} --copies
source ${env}/bin/activate
pip install -U pip
pip install -r requirements.txt
deactivate

## Archive the environment
cd ${env}
tar -hzcf ../${env}.tar.gz *
cd ..

Or create archive using `conda` and `conda-pack`

Install Anaconda or Conda on a host with same Operating System and CPU architecture as the cluster, such as an edge host:
- Can be in your home-directory without root/sudo access
  - Anaconda: https://www.anaconda.com/distribution/
  - Miniconda: https://docs.conda.io/en/latest/miniconda.html
- Or system-wide with yum or apt:
  - https://docs.conda.io/projects/conda/en/latest/user-guide/install/rpm-debian.html
Create the archive:

## Create requirements for `conda`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF

## The name of the environment
env="python3-venv"

## Create the environment
conda create -y -n ${env} python=3.7 ## May give a file-system error. If so, simply run again. This is due to conda not being initialized
conda activate ${env}
conda install -y -n ${env} -f requirements.txt
conda install -y -n ${env} -c conda-forge conda-pack

## Archive the environment
conda pack -f -n ${env} -o ${env}.tar.gz

conda deactivate

Use the archive: Same steps for a `virtualenv` or for `conda`:

Put to HDFS:
- You may need to kinit 1st. Can do this as your own user.

hdfs dfs -put -f ${env}.tar.gz /share/python-envs/
hdfs dfs -chmod 0664 /share/python-envs/${env}.tar.gz

Test it:

## Create test script
tee test.py > /dev/null << EOF
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setAppName('pyspark-test')
sc = SparkContext(conf=conf)

import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()
EOF

## Submit to Spark
deactivate
conda deactivate

## Note: `--archives` can be used instead of `--conf spark.yarn.dist.archives`. I prefer to see the full conf statement.
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.dist.archives=hdfs:///share/python-envs/${env}.tar.gz#environment \
--master yarn \
--deploy-mode cluster \
test.py

## Check the logs: Update the `id` to the id of your job from above.
id=application_GetTheIdFromOutputOfCommandAbove
yarn logs -applicationId ${id} | grep "Hello World"

Update Zeppelin %livy2.pyspark

Note: This only works with YARN cluster deploy mode. It is the default in HDP3.
In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:

livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
livy.spark.yarn.dist.archives=hdfs:///share/python-envs/python3-venv.tar.gz#environment

Test from a notebook:

%livy2.pyspark
import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()

Howto create a `virtualenv` at run-time

Install python3 and python-virtualenv
- For this method it must be done on all hosts in the cluster.
- See the archives instructions for more details of how to install.
Example using pyspark shell:

PYSPARK_PYTHON=/bin/python3 \
pyspark \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--master yarn \
--deploy-mode client

## then, in the shell, install packages:
sc.install_packages(["numpy"])

Example using spark-submit.
- Note the addition of a requirements file.
  - This is optional, you could use the same sc.install_packages file instead inside test.py.
  - If using from HDFS, make sure to upload it ;)
  - You can use it with pyspark above, but the file then must be on the localhost that you are executing pyspark from.

spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3 \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--conf spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt \
--master yarn \
--deploy-mode cluster \
test.py

Update Zeppelin %livy2.pyspark
- In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:

livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
livy.spark.pyspark.virtualenv.enabled=true
livy.spark.pyspark.virtualenv.type=virtualenv
livy.spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv
livy.spark.pyspark.virtualenv.python_version=3.7
livy.spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt ## this is optional

seanorama/pyspark-on-yarn-self-contained.md

PySpark on YARN in self-contained environments

Reference:

Why

Overview

How to: Use an archive (i.e. tar.gz) of a Python environment (`virtualenv` or `conda`):

(Optional) Create a shared HDFS path for storing the environment(s):

Create archive using Python `virtualenv` and `tar`

Or create archive using `conda` and `conda-pack`

Use the archive: Same steps for a `virtualenv` or for `conda`:

Howto create a `virtualenv` at run-time

MahsaSeifikar commented Aug 23, 2020

Uh oh!

seanorama commented Aug 24, 2020

Uh oh!

geosmart commented Sep 18, 2020

Uh oh!

geosmart commented Sep 18, 2020 •

edited

Loading

Uh oh!

geosmart commented Sep 22, 2020

Uh oh!

seanorama commented Sep 22, 2020

Uh oh!

Amitg1 commented Jan 19, 2021

Uh oh!

subbareddydagumati commented Feb 2, 2021 •

edited

Loading

Uh oh!

geosmart commented Apr 7, 2021

Uh oh!

seanorama/pyspark-on-yarn-self-contained.md

PySpark on YARN in self-contained environments

Reference:

Why

Overview

How to: Use an archive (i.e. tar.gz) of a Python environment (virtualenv or conda):

(Optional) Create a shared HDFS path for storing the environment(s):

Create archive using Python virtualenv and tar

Or create archive using conda and conda-pack

Use the archive: Same steps for a virtualenv or for conda:

Howto create a virtualenv at run-time

MahsaSeifikar commented Aug 23, 2020

Uh oh!

seanorama commented Aug 24, 2020

Uh oh!

geosmart commented Sep 18, 2020

Uh oh!

geosmart commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geosmart commented Sep 22, 2020

Uh oh!

seanorama commented Sep 22, 2020

Uh oh!

Amitg1 commented Jan 19, 2021

Uh oh!

subbareddydagumati commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geosmart commented Apr 7, 2021

Uh oh!

How to: Use an archive (i.e. tar.gz) of a Python environment (`virtualenv` or `conda`):

Create archive using Python `virtualenv` and `tar`

Or create archive using `conda` and `conda-pack`

Use the archive: Same steps for a `virtualenv` or for `conda`:

Howto create a `virtualenv` at run-time

geosmart commented Sep 18, 2020 •

edited

Loading

subbareddydagumati commented Feb 2, 2021 •

edited

Loading