Want to move computation on machine with much power. We will set up Anaconda 4.0.0 and XGBoost 0.4 (it is tricky installable).
- Amazon AWS Educate gives 100$ for MIPT students.
- GitHub Students Pack additionaly gives 15$.
- Register on https://aws.amazon.com/ and Sign In to the AWS Console.
- Get your $$$ from Amazon AWS Educate and GitHub Students Pack and activate them in your account settings (top right corner in AWS Console), "Credits" tab.
- Go to EC2 tab.
- Click Launch Instance button.
Choose Ubuntu Server 14.04 LTS (HVM), SSD Volume Type. Click Next.
Now you need to choose Instance Type. Types c3,c4,g2,r3* seem good for our tasks. You can compare them on http://www.ec2instances.info/ (Don't forget to choose your region. You can determine and change it in top right corner of your AWS Console.). We choose c4.4xlarge. It costs ~20-25 cents per hour.
Click Next.
There are two types of Amazon Instances: On-Demand and Spot Instances. Read about them here: https://aws.amazon.com/ec2/spot/. In few words, Spot Instances have auction on Amazon EC2 power.
If you want to pay less (~20-25 cents as was said) you need to check Request Spot instances. You will see current prices on this type of instance in your region. Set price on 2-5 cents higher than maximum of three prices if you want to get an instance as fast as possible.
Click Next.
- Amazon provides 30GB SSD storage device for free. So replace default 8 with 30GiB.
- Uncheck Delete on Termination. It prevents from deleting your storage after instance terminating.
Click Next.
No changes here. Press Next.
- Check Create a new security group
- Remain SSH rule, and add two more rules: (HTTPS/TCP/443/Anywhere) and (Custom TCP Rule/TCP/8888/Anywhere).
Click Review and Launch and then Launch.
Create new key pair following Amazon's instructions and download it. You need this file for connecting to your instance through SSH.
Click Request Spot Instance.
You will see picture like this:
Wait until your Instance State will be Running and click Connect button.
According to instructions, add read permissions:
chmod 400 aws_c4_xlarge.pem
You can follow these instructions and connect to your instance over SSH with command like this:
ssh -i "aws_c4_xlarge.pem" [email protected]
but more handy is to connect via
ssh aws
And that's how to do that:
- Create SSH config or edit existing adding these lines:
Host aws
HostName ec2-52-11-148-133.us-west-2.compute.amazonaws.com
User ubuntu
IdentityFile ~/.ssh/aws_c4_xlarge.pem
- Indentaion with TAB
- Instead of 'aws' you can make your own alias.
- You will have different HostName, check instructions clicking Connect button in your AWS Console.
Reopen your terminal and connect to the instance via ssh aws
.
In this section, we will install Anaconda 4.0.0, XGBoost 0.4 and set up Jupyter server.
To install Anaconda, execute the following lines:
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-4.0.0-Linux-x86_64.sh
chmod +x Anaconda2-4.0.0-Linux-x86_64.sh
./Anaconda2-4.0.0-Linux-x86_64.sh
Install Anaconda just pressing Enter and typing 'yes' everywhere. Reconnect to the instance:
logout
ssh aws
Create virtual environment with Python 2.7:
conda create --name venv anaconda
To activate virtual environment, type source activate venv
, to deactivate source deactivate
.
Activate your virtual environment.
Install XGBoost:
sudo apt-get install git make g++ python-setuptools
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
cd python-package
sudo python setup.py install
Last command can raise exception ImportError: No module named numpy.distutils.core
. Whatever, XGBoost is correctly installed. Only thing we need is to add the package to PYTHONPATH
:
echo "export PYTHONPATH=~/xgboost/python-package" > ~/.bash_profile
source ~/.bash_profile
That's all! Try to import it.
You need to generate Jupyter config to start remote server. The simplest way is the following:
cd ~
wget https://raw.githubusercontent.com/persiyanov/ml-mipt/master/amazon-howto/jupyter_notebook_ec2.sh
chmod +x jupyter_notebook_ec2.sh
./jupyter_notebook_ec2.sh
Enter the password which you want to use while connecting to Jupyter through browser. Repeat it. Then press Enter several times. This is the my log:
(venv)ubuntu@ip-172-31-12-235:~$ chmod +x jupyter_notebook_ec2.sh
(venv)ubuntu@ip-172-31-12-235:~$ ./jupyter_notebook_ec2.sh
Writing default config to: /home/ubuntu/.jupyter/jupyter_notebook_config.py
Enter password:
Verify password:
Generating a 1024 bit RSA private key
...........................................................................................++++++
...........++++++
writing new private key to 'mycert.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:
Email Address []:
(venv)ubuntu@ip-172-31-12-235:~$
(venv)ubuntu@ip-172-31-12-235:~$ ls
anaconda2 Anaconda2-4.0.0-Linux-x86_64.sh certs jupyter_notebook_ec2.sh xgboost
To start Jupyter server, you need to execute jupyter notebook --certfile=~/certs/mycert.pem --keyfile ~/certs/mycert.key
command or download my bash script which executes this line:
wget https://raw.githubusercontent.com/persiyanov/ml-mipt/master/amazon-howto/start-jupyter.sh
chmod +x start-jupyter.sh
./start-jupyter.sh
Now server has started. Try to connect it via HTTPS. Type in your browser https://<hostname>:8888
or https://<public_ip>:8888
. Hostname is the similar to that we used writing SSH config. You can always determine your HostName and Public Ip in AWS Console clicking at your instance. In my case: https://ec2-52-38-217-74.us-west-2.compute.amazonaws.com:8888
.
Now you can connect to Jupyter and run your notebooks at EC2 instance! But that's not the end. We want to make our interaction with instance more comfortable. Next two sections are about that.
We want to off our computer or disconnect from the Internet but preserve computing our models on EC2 instance. As for now, we will lost our SSH session if something from this will happen. And for solving this problem we use tmux
:
tmux new -s .
./start-jupyter.sh
We have just started Jupyter server in tmux session. As soon as we did it, we can close this SSH connection and all processes will retain.
We don't want to set up Anaconda, XGBoost, SSH Config and other each time we start new instance. We want to preserve this state. For this purpose, we use Amazon AMI.
In your Amazon AWS Console, in the tab Instances, select your instance, click Actions -> Image -> Create Image. Name your image and click Create Image.
Next time you want to start instance, at the Step 1 select tab My AMIs and choose your AMI.
THANK YOU FOR THIS TUTORIAL!
I've been scratching my head for days trying to get XGBoost onto my Ec2 instance. I've followed your syntax to the dot and I got XGboost installed. But when I tried to import xgboost in my Jupyter notebook, I get the following error. I'm a Linux newbie so I have absolutely no idea what to do...Any help will be appreciated!
OSError Traceback (most recent call last)
in ()
1 import pandas as pd
----> 2 import xgboost
/home/ubuntu/xgboost/python-package/xgboost/init.py in ()
9 import os
10
---> 11 from .core import DMatrix, Booster
12 from .training import train, cv
13 from . import rabit # noqa
/home/ubuntu/xgboost/python-package/xgboost/core.py in ()
110
111 # load the XGBoost library globally
--> 112 _LIB = _load_lib()
113
114
/home/ubuntu/xgboost/python-package/xgboost/core.py in _load_lib()
104 if len(lib_path) == 0:
105 return None
--> 106 lib = ctypes.cdll.LoadLibrary(lib_path[0])
107 lib.XGBGetLastError.restype = ctypes.c_char_p
108 return lib
/home/ubuntu/anaconda2/lib/python2.7/ctypes/init.pyc in LoadLibrary(self, name)
438
439 def LoadLibrary(self, name):
--> 440 return self._dlltype(name)
441
442 cdll = LibraryLoader(CDLL)
/home/ubuntu/anaconda2/lib/python2.7/ctypes/init.pyc in init(self, name, mode, handle, use_errno, use_last_error)
360
361 if handle is None:
--> 362 self._handle = _dlopen(self._name, mode)
363 else:
364 self._handle = handle
OSError: /home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/backend/cython/../../../../.././libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/ubuntu/xgboost/python-package/xgboost/../../lib/libxgboost.so)