Skip to content

Instantly share code, notes, and snippets.

@ololobus
Last active September 26, 2024 08:50
Show Gist options
  • Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython notebook integration guide for Mac OS X

Tested with Apache Spark 1.3.1, Python 2.7.9 and Java 1.8.0_45 + workaround for Spark 1.4.x from @enahwe.

Install Java Development Kit

Download and install it from oracle.com

Add following code to your e.g. .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

You can use Mac OS package manager Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"
  export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi

You can check SPARK_HOME path using following brew command

$ brew info apache-spark
apache-spark: stable 1.3.1, HEAD
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.3.1_1 (361 files, 278M) *
  Built from source
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb

Create ipython profile

Run

ipython profile create pyspark

Create a startup file

$ vim ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
# Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
UPD: for Spark 1.4.x

You can try the universal 00-pyspark-setup.py script from @enahwe for Spark 1.3.x and 1.4.x:

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

Run ipython

ipython notebook --profile=pyspark

sc variable should be available

In [1]: sc
Out[1]: <pyspark.context.SparkContext at 0x10a982b10>

Analytics

@klinkin
Copy link

klinkin commented May 20, 2015

Последний слеш не нужен:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"

@enahwe
Copy link

enahwe commented Jul 1, 2015

For Spark 1.4.x we have to add 'pyspark-shell' at the end of the environment variable "PYSPARK_SUBMIT_ARGS". So I adapted the script '00-pyspark-setup.py' for Spark 1.3.x and Spark 1.4.x as following, by detecting the version of Spark from the RELEASE file.

Here is the code :

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

@ololobus
Copy link
Author

ololobus commented Dec 4, 2015

Thank you @enahwe! I've added it to the text, hope you've tested it :)
Nevertheless, I've missed you comment at July and now Spark 1.5.x is already released...

@sri-srinivas
Copy link

Hi, thanks for your suggestions. I did exactly what you have suggested. But, when I get a notebook and type sc, I do not get the expected output. I just get ' '. Something is not configured properly. I have mac, El Capitan , spark 1.5.2 and running Jupiter. Any help?

@ololobus
Copy link
Author

@sri-srinivas I'm not a Spark user currently, so I can't test it, but you can try following 00-pyspark-setup.py startup file with Spark 1.5.*

# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())

@arendale
Copy link

@sri-srinivas I got same problem. Did you solve it?

@Nomii5007
Copy link

Hello. how can i set this PYSPARK_SUBMIT_ARGS enviorment variable in windows? and what should i give in its path?

@hlin117
Copy link

hlin117 commented Aug 9, 2016

Nowadays, brew install apache-spark installs spark 2.0.0. (Possibly a higher version in the future.)

The python library is also updated, and it's not 0.8.2.1 anymore.

@sanjitroy
Copy link

@sri-srinivas , @arendale
Check this line
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
The version of the py4j library should be the one you have installed.
Change it if you have not and run.

@ololobus
Copy link
Author

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

@thismlguy
Copy link

worked like a charm!!

@acmiyaguchi
Copy link

acmiyaguchi commented Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

@vanessaescalante
Copy link

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

@airwindow
Copy link

In case you are using Python 3.x version, you may run into following error
NameError: name 'execfile' is not defined
Simply replace
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
into
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

@mhasse7441
Copy link

hello all. I am experiencing some issues executing a simple python program:

from pyspark import SparkConf, SparkContext

sc = SparkContext(master="local", appName="Spark Demo")
print(sc.textFile("/Users/mhasse/Desktop/deckofcards.txt").first())

with the following errors:

/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/python /Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py
Traceback (most recent call last):
File "/Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py", line 3, in
sc = SparkContext(master="local", appName="Spark Demo")
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"]
File "/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/../lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'SPARK_HOME'

when I nano .bash_profile my spark is set as below:

export SPARK_HOME=/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:SPARK_HOME/python/lib/py4j-VERSION-src.zip:$PYTHONPATH

Setting PATH for Python 2.7

The original version is saved in .bash_profile.pysave

PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

I am new to MAC OS and am struggling to get this to work - any comments or feedback greatly appreciated

Copy link

ghost commented Jun 9, 2018

Works like a charm...

@suhas22
Copy link

suhas22 commented Jul 17, 2018

Works amazingly well!, thanks a ton for this!

@soheilesm
Copy link

soheilesm commented May 21, 2021

Hi there,

I followed the guideline for installation, but for my own job I face the similar problem that is also described here.

Any idea how I can fix it? I have followed the solutions that commonly say the problem could be wrong path specifications, but my paths seem to be fine, and testing different components such as python, pyspark, py4j seem to be working fine standalone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment