Author: https://github.com/seanorama
Note: This was tested on HDP 3.1. It may not work with other Spark/YARN distributions.
- https://community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-PySpark/ta-p/245905
- https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551
- https://stackoverflow.com/questions/48770263/bundling-python3-packages-for-pyspark-results-in-missing-imports
- https://stackoverflow.com/questions/53524696/livy-no-yarn-application-is-found-with-tag-livy-batch-10-hg3po7kp-in-120-seconds
- https://conda.github.io/conda-pack/spark.html
- https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties
- https://spark.apache.org/docs/latest/configuration.html
- https://livy.incubator.apache.org/docs/latest/rest-api.html
- https://github.com/apache/zeppelin/blob/master/docs/interpreter/livy.md#configuration
By default, pyspark
jobs will use the Python from the local system of each YARN NodeManager host. Which means:
- If a different version of Python is needed, it must be installed and maintain on all of the hosts
- If additional modules (i.e. from
pip
) are needed they must be installed and maintained on all of the hosts
This presents a lot of overhead and introduces many risks. Also, the Spark developers typically do not have direct access to YARN NodeManager hosts.
Further, it is a good practice to manage dependencies from the development side, which is only possible if all Python dependencies are "self-contained".
There are 2 methods to have self-contained environments:
- a) Use an archive (i.e. tar.gz) of a Python environment (
virtualenv
orconda
):- Benefits:
- Faster load time than an empty 'virtualenv' since the packages are already present.
- Uses the YARN Shared cache: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SharedCache.html
- During 1st use of the environment, it will be cached. Future uses will load very fast.
- Not tied to the Python driver of the NodeManager machines, meaning any Python version can be used and nothing ever has to be done on the NodeManagers.
- Completely self-contained. No external dependencies.
- Caveats:
- Can't install additional packages at run-time.
- Benefits:
- b) Use
python.pyspark.virtualenv
which creates a new virtualenv at Spark runtime:- Benefits:
- Install packages at runtime.
- Caveats:
- Not entirely self-contained since it depends on the interpreter being available on all YARN NodeManager hosts (i.e. at /usr/bin/python3 or wherever you place it). Which also makes it harder to change versions.
- Very SLOW as every job will require downloading and building the
pip
packages. - Depends on public internet access, or a local network pypi/conda repo.
- Benefits:
Overview:
- Create an environment with
virtualenv
orconda
- Archive the environment to a
.tar.gz
or.zip
. - Upload the archive to HDFS
- Tell Spark (via
spark-submit
,pyspark
,livy
,zeppelin
) to use this environment - Repeat for each different virtualenv that is required or when the virtualenv needs updating
Only necessary if sharing the environment and if a location doesn't already exist.
sudo -u hdfs -i
## if kerberos
keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
## endif
hdfs dfs -mkdir -p /share/python-envs
hdfs dfs -chmod -R 775 /share
## replace the group with a user group that will be managing the archives
hdfs dfs -chown -R hdfs:hadoop /share
exit
- Install
python3
andpython-virtualenv
on a host with same Operating System and CPU architecture as the cluster, such as an edge host:- Can be in your home-directory without root/sudo access
- downloading python manually
- using
pyenv
or similar application
- Or system-wide:
- Can be in your home-directory without root/sudo access
sudo yum install python3 python-virtualenv
- Create the archive:
## Create requirements for `pip`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF
## The name of the environment
env="python3-venv"
## Create the environment
python3 -m venv ${env} --copies
source ${env}/bin/activate
pip install -U pip
pip install -r requirements.txt
deactivate
## Archive the environment
cd ${env}
tar -hzcf ../${env}.tar.gz *
cd ..
-
Install Anaconda or Conda on a host with same Operating System and CPU architecture as the cluster, such as an edge host:
- Can be in your home-directory without root/sudo access
- Anaconda: https://www.anaconda.com/distribution/
- Miniconda: https://docs.conda.io/en/latest/miniconda.html
- Or system-wide with
yum
orapt
:
- Can be in your home-directory without root/sudo access
-
Create the archive:
## Create requirements for `conda`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF
## The name of the environment
env="python3-venv"
## Create the environment
conda create -y -n ${env} python=3.7 ## May give a file-system error. If so, simply run again. This is due to conda not being initialized
conda activate ${env}
conda install -y -n ${env} -f requirements.txt
conda install -y -n ${env} -c conda-forge conda-pack
## Archive the environment
conda pack -f -n ${env} -o ${env}.tar.gz
conda deactivate
- Put to HDFS:
- You may need to
kinit
1st. Can do this as your own user.
- You may need to
hdfs dfs -put -f ${env}.tar.gz /share/python-envs/
hdfs dfs -chmod 0664 /share/python-envs/${env}.tar.gz
- Test it:
## Create test script
tee test.py > /dev/null << EOF
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setAppName('pyspark-test')
sc = SparkContext(conf=conf)
import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()
EOF
## Submit to Spark
deactivate
conda deactivate
## Note: `--archives` can be used instead of `--conf spark.yarn.dist.archives`. I prefer to see the full conf statement.
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.dist.archives=hdfs:///share/python-envs/${env}.tar.gz#environment \
--master yarn \
--deploy-mode cluster \
test.py
## Check the logs: Update the `id` to the id of your job from above.
id=application_GetTheIdFromOutputOfCommandAbove
yarn logs -applicationId ${id} | grep "Hello World"
- Update Zeppelin %livy2.pyspark
- Note: This only works with YARN
cluster
deploy mode. It is the default in HDP3. - In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:
livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
livy.spark.yarn.dist.archives=hdfs:///share/python-envs/python3-venv.tar.gz#environment
- Test from a notebook:
%livy2.pyspark
import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()
-
Install
python3
andpython-virtualenv
- For this method it must be done on all hosts in the cluster.
- See the
archives
instructions for more details of how to install.
-
Example using
pyspark
shell:
PYSPARK_PYTHON=/bin/python3 \
pyspark \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--master yarn \
--deploy-mode client
## then, in the shell, install packages:
sc.install_packages(["numpy"])
- Example using
spark-submit
.- Note the addition of a
requirements
file.- This is optional, you could use the same
sc.install_packages
file instead insidetest.py
. - If using from HDFS, make sure to upload it ;)
- You can use it with
pyspark
above, but the file then must be on the localhost that you are executingpyspark
from.
- This is optional, you could use the same
- Note the addition of a
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3 \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--conf spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt \
--master yarn \
--deploy-mode cluster \
test.py
-
Update Zeppelin %livy2.pyspark
- In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:
livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
livy.spark.pyspark.virtualenv.enabled=true
livy.spark.pyspark.virtualenv.type=virtualenv
livy.spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv
livy.spark.pyspark.virtualenv.python_version=3.7
livy.spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt ## this is optional
Didn't work for me!