For those like me who wish to continue learning about ML using scientific Python stack, check this video workshop by Jake VanderPlas
Here is the code https://github.com/jakevdp/sklearn_pycon2015/
So what steps I did to setup correctly working PySpark with Anaconda with 200 libraries on courses Vagrant VM
- Install Anaconda or Miniconda, you should be familiar with linux shell. Vagrant Spark VM is Ubuntu 32bit and Python 2.7 until PySpark for py3 not yet released. Get download url from http://continuum.io/downloads#all Depending on your needs if you wish only selected packages, get Miniconda.
vagrant ssh
curl -L https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86.sh | bashWait until download and install complete. Anaconda installed to /home/vagrant/anaconda/
- Now tweak Notebook upstart job config and modify PATH env var to launch the Anaconda distribution from your home directory
sudo nano /etc/init/notebook.confchange env PATH to
env PATH=/home/vagrant/anaconda/bin/:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/local/bin/spark-1.3.1-bin-hadoop2.6/binExit and save config, now reload upstart config:
sudo initctl reload notebook Optional if you wish to change IPython notebooks directory:
echo "c.NotebookApp.notebook_dir = u'/vagrant'" >> ~/.ipython/profile_pyspark/ipython_notebook_config.pyrestart job:
sudo restart notebookCheck your installated libraries in import path: https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/01-Preliminaries.ipynb