Steps:
- wget https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83/raw/865fb35e00f21330b5b82aeb7c31941b6c18f649/spark_on_slurm.sh
- wget https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83/raw/865fb35e00f21330b5b82aeb7c31941b6c18f649/worker_spark_on_slurm.sh
- wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz && tar xf spark-3.3.1-bin-hadoop3.tgz
- sbatch spark_on_slurm.sh
- build a venv, install pyspark, then run something like this:
(you can get https://huggingface.co/datasets/laion/laion-coco/resolve/main/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet as an example parquet)
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, rand
spark = (
SparkSession.builder
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory", "16GB")
.config("spark.executor.memoryOverhead", "8GB")
.config("spark.task.maxFailures", "100")
.master("spark://master_node:7077")
.appName("spark-stats")
.getOrCreate()
)
df = spark.read.parquet("part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet")
df.count()
replace master_node by the first node you got in slurm job
you may check the spark ui by doing something along these lines :
ssh -L 4040:localhost:4040 -L 8080:localhost:8080 login_node
then
ssh -L localhost:4040:master_node:4040 -L localhost:8080:master_node:8080 master_node
and check http://localhost:4040 and http://localhost:8080 in your browser
Does this really start more than 1 worker, since the worker command isn’t launched by an srun?