Skip to content

Instantly share code, notes, and snippets.

@slitayem
Last active January 30, 2021 07:36
Show Gist options
  • Select an option

  • Save slitayem/51793c779a9ceffed5b648d1dc98948d to your computer and use it in GitHub Desktop.

Select an option

Save slitayem/51793c779a9ceffed5b648d1dc98948d to your computer and use it in GitHub Desktop.
Install Spark 2.2 and Hadoop 2.7.4 with Jupyter and zeppelin on macOS Sierra
  • Install Homebrew if you don't have it yet
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

The script will explain what changes it will make and prompt you before the installation begins. Once you’ve installed Homebrew, insert the Homebrew directory at the top of your PATH environment variable. You can do this by adding the following line at the bottom of your ~/.bash_profile file

export PATH=/usr/local/bin:/usr/local/sbin:$PATH
  • Install Python 3:
brew install python3
sudo pip3 install jupyter
# Install Jupyter Nbextensions Configurator
sudo pip3 install jupyter_nbextensions_configurator
# Enabling the extension
jupyter nbextensions_configurator enable --user
# The list of enabled jupyter exetnsions will be in ~/.jupyter/nbconfig/notebook.json.
  • Install JAVA8 brew update brew tap caskroom/versions brew cask install java8

  • install Scala and sbt brew install scala sbt wget

Install Hadoop 2.7.4 binaries with Yarn

wget http://apache.mesi.com.ar/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz ~/Downloads/
tar -xvf hadoop-2.7.4.tar.gz
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`

Edit core-site.xml file located at $HADOOP_DIR/etc/hadoop/ location:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
  • Create a folder for namenode and datanode by using command
mkdir -P $HOME/hadoop2_data/hdfs/namenode
mkdir -P $HOME/hadoop2_data/hdfs/datanode

Edit hdfs-site.xml file located at $HADOOP_DIR/etc/hadoop/ location:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name> <value>/Users/hadoopuser/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name> <value>/Users/hadoopuser//hadoop2_data/hdfs/datanode</value>
</property>
</configuration>
cp  etc/hadoop/mapred-site.xml.template  etc/hadoop/mapred-site.xml

Edit mapred-site.xml file located at $HADOOP_DIR/etc/hadoop/ location:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml file located at $HADOOP_DIR/etc/hadoop/ location:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-service.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
  • Run the below command to format the namenode directory:
$HADOOP_HOME/bin/hdfs namenode -format
  • Start the namenode
$HADOOP_HOME/sbin/start-dfs.sh

That is all. Now the namenode can be accessed from http://localhost:50070

Install Spark 2.2.0

wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz ~/Downloads
cd && tar -xvf spark-2.2.0-bin-hadoop2.7.tar

Add the following lines to your .bash_profile file

export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
export HADOOP_HOME=$HOME/hadoop-2.7.4
export ZEPPELIN_HOME=/Users/slitayem/zeppelin-0.7.3-bin-netinst
export SPARK_HOME=$HOME/spark-2.2.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter" 
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
export SBT_HOME=/usr/local/Cellar/sbt/1.0.2
PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SBT_HOME/bin:$ZEPPELIN_HOME/bin:$PATH
export PATH=/Library/Frameworks/Python.framework/Versions/3.6/bin:${PATH}
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark --master local[2]'

Notes: The PYSPARK_DRIVER_PYTHON parameter and the PYSPARK_DRIVER_PYTHON_OPTS parameter are used to launch the PySpark shell in Jupyter Notebook. The — master parameter is used for setting the master node address. Here we launch Spark locally on 2 cores for local testing.

source ~/.bash_profile

Now you can test spark by running pyspark

Install Zeppelin

# Download Zeppeling with Spark interpreter
wget http://apache.lauf-forum.at/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-netinst.tgz ~/Downloads/
cd && tar xzvf zeppelin-0.7.3-bin-netinst.tgz
cd zeppelin-0.7.3-bin-netinst/conf/
cp zeppelin-env.sh.template zeppelin-env.sh

Edit zeppelin-env.sh file by adding this line to the very top of the file.

export SPARK_HOME=$HOME/spark-2.0.1-bin-hadoop2.7
  • Start zeppelin
zeppelin-daemon.sh start

Zepplin should be running on http://localhost:8080. In order to stop zeppelin run zeppelin-daemon.sh stop

References

@rizqidarsono27
Copy link

I'm still new and learning Spark.

I followed every steps on how to install spark on mac os from you but when i try to run spark-shell i got

192:~ aryadarsono$ spark-shell /usr/local/Cellar/apache-spark/2.2.0/libexec/bin/spark-shell: line 57: /Users/aryadarsono/spark-2.2.0-bin-hadoop2.7/bin/spark-submit: No such file or directory

and when i run pyspark i got

/usr/local/Cellar/apache-spark/2.2.0/libexec/bin/pyspark: line 24: /Users/aryadarsono/spark-2.2.0-bin-hadoop2.7/bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/apache-spark/2.2.0/libexec/bin/pyspark: line 77: /Users/aryadarsono/spark-2.2.0-bin-hadoop2.7/bin/spark-submit: No such file or directory /usr/local/Cellar/apache-spark/2.2.0/libexec/bin/pyspark: line 77: exec: /Users/aryadarsono/spark-2.2.0-bin-hadoop2.7/bin/spark-submit: cannot execute: No such file or directory

is there anything that i have to revise?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment