Skip to content

Instantly share code, notes, and snippets.

@StudioEtrange
Last active July 16, 2019 05:14
Show Gist options
  • Save StudioEtrange/05cfddd13af3733bec88cc7ebcbef3b2 to your computer and use it in GitHub Desktop.
Save StudioEtrange/05cfddd13af3733bec88cc7ebcbef3b2 to your computer and use it in GitHub Desktop.
Jupyterlab and spark

To invoke JupyterLab with Spark capabilities there are two ways. An ad hoc method is to just state on the command line that JupyterLab should use pyspark as kernel. For instance starting JupyterLab with Python 3.6 (needs to be consistent with your Spark distribution), 20 executors each having 5 cores might look like this:

PYSPARK_PYTHON=python3.6 PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8899" /usr/bin/pyspark2 --master yarn --deploy-mode client --num-executors 20 --executor-memory 10g --executor-cores 5 --conf spark.dynamicAllocation.enabled=false

In order to be able to create notebooks with a specific PySpark kernel directly from JupyterLab, just create a file ~/.local/share/jupyter/kernels/pyspark/kernel.json holding:

{
 "display_name": "PySpark",
 "language": "python",
 "argv": [
  "/usr/local/anaconda-py3/bin/python",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "HADOOP_CONF_DIR": "/etc/hadoop/conf",
  "HADOOP_USER_NAME": "username",
  "HADOOP_CONF_LIB_NATIVE_DIR": "/var/lib/cloudera/parcels/CDH/lib/hadoop/lib/native",
  "YARN_CONF_DIR": "/etc/hadoop/conf",
  "SPARK_YARN_QUEUE": "dev",
  "PYTHONPATH": "/usr/local/anaconda-py3/bin/python:/usr/local/anaconda-py3/lib/python3.6/site-packages:/var/lib/cloudera/parcels/SPARK2/lib/spark2/python:/var/lib/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.4-src.zip",
  "SPARK_HOME": "/var/lib/cloudera/parcels/SPARK2/lib/spark2/",
  "PYTHONSTARTUP": "/var/lib/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--queue dev --conf spark.dynamicAllocation.enabled=false --conf spark.scheduler.minRegisteredResourcesRatio=1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --master yarn --num-executors 5 --driver-memory 2g --executor-memory 20g --executor-cores 3 pyspark-shell"
 }
}

Source : https://florianwilhelm.info/2018/11/working_efficiently_with_jupyter_lab/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment