Based on https://medium.com/sicara/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f
See also: https://www.datacamp.com/tutorial/installation-of-pyspark
- Java (e.g. via VS Code) - e.g. Oracle_JDK-22
- Spark (prebuilt via http://spark.apache.org/downloads.html) - e.g. spark-3.5.1-bin-hadoop3
- PySpark (via pip)
- C:\Users\matthew\repos\Oracle_JDK-22\bin
- C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3\bin
- HADOOP_HOME = C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3
- SPARK_HOME = C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3
- JAVA_HOME = C:\Users\matthew\repos\Oracle_JDK-22
- PYSPARK_DRIVER_PYTHON = C:\Users\matthew\Anaconda3\envs\main\python.exe
- PYSPARK_PYTHON = C:\Users\matthew\Anaconda3\envs\main\python.exe
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
SEO terms: local, pyspark, spark, sparkly, sparklyr, java, jdk, path