Jupyterlab and spark

To invoke JupyterLab with Spark capabilities there are two ways. An ad hoc method is to just state on the command line that JupyterLab should use pyspark as kernel. For instance starting JupyterLab with Python 3.6 (needs to be consistent with your Spark distribution), 20 executors each having 5 cores might look like this:

PYSPARK_PYTHON=python3.6 PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8899" /usr/bin/pyspark2 --master yarn --deploy-mode client --num-executors 20 --executor-memory 10g --executor-cores 5 --conf spark.dynamicAllocation.enabled=false

In order to be able to create notebooks with a specific PySpark kernel directly from JupyterLab, just create a file ~/.local/share/jupyter/kernels/pyspark/kernel.json holding:

{
 "display_name": "PySpark",
 "language": "python",
 "argv": [
  "/usr/local/anaconda-py3/bin/python",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "HADOOP_CONF_DIR": "/etc/hadoop/conf",
  "HADOOP_USER_NAME": "username",
  "HADOOP_CONF_LIB_NATIVE_DIR": "/var/lib/cloudera/parcels/CDH/lib/hadoop/lib/native",
  "YARN_CONF_DIR": "/etc/hadoop/conf",
  "SPARK_YARN_QUEUE": "dev",
  "PYTHONPATH": "/usr/local/anaconda-py3/bin/python:/usr/local/anaconda-py3/lib/python3.6/site-packages:/var/lib/cloudera/parcels/SPARK2/lib/spark2/python:/var/lib/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.4-src.zip",
  "SPARK_HOME": "/var/lib/cloudera/parcels/SPARK2/lib/spark2/",
  "PYTHONSTARTUP": "/var/lib/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--queue dev --conf spark.dynamicAllocation.enabled=false --conf spark.scheduler.minRegisteredResourcesRatio=1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --master yarn --num-executors 5 --driver-memory 2g --executor-memory 20g --executor-cores 3 pyspark-shell"
 }
}

Source : https://florianwilhelm.info/2018/11/working_efficiently_with_jupyter_lab/

StudioEtrange/jupyterlab-spark.md