For those like me who wish to continue learning about ML using scientific Python stack, check this video workshop by Jake VanderPlas
Here is the code https://github.com/jakevdp/sklearn_pycon2015/
So what steps I did to setup correctly working PySpark with Anaconda with 200 libraries on courses Vagrant VM
- Install Anaconda or Miniconda, you should be familiar with linux shell. Vagrant Spark VM is Ubuntu 32bit and Python 2.7 until PySpark for py3 not yet released. Get download url from http://continuum.io/downloads#all Depending on your needs if you wish only selected packages, get Miniconda.
vagrant ssh
curl -L https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86.sh | bash
Wait until download and install complete. Anaconda installed to /home/vagrant/anaconda/
- Now tweak Notebook upstart job config and modify PATH env var to launch the Anaconda distribution from your home directory
sudo nano /etc/init/notebook.conf
change env PATH
to
env PATH=/home/vagrant/anaconda/bin/:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/local/bin/spark-1.3.1-bin-hadoop2.6/bin
Exit and save config, now reload upstart config:
sudo initctl reload notebook
Optional if you wish to change IPython notebooks directory:
echo "c.NotebookApp.notebook_dir = u'/vagrant'" >> ~/.ipython/profile_pyspark/ipython_notebook_config.py
restart job:
sudo restart notebook
Check your installated libraries in import path: https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/01-Preliminaries.ipynb