Created
October 15, 2018 00:49
-
-
Save tnachen/9c58755c5f59d40179c5177e71cb1e30 to your computer and use it in GitHub Desktop.
Salus build instructions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
you have to do the build step manually. But I think it serves as a good start point. The image is available on Docker Hub: https://hub.docker.com/r/qi437103/salus/tags/ | |
You will need to start the docker with Nvidia runtime (see https://github.com/NVIDIA/nvidia-docker). | |
docker --runtime nvidia -it qi437103/salus:latest | |
# After starting the docker container, go to /salus | |
cd /salus | |
# get sources | |
git clone https://github.com/SymbioticLab/tensorflow-salus.git tensorflow | |
git clone https://github.com/SymbioticLab/Salus.git salus | |
# install dependencies | |
## There's currently some issue with the installation URL of zeromq, so you need to edit it first | |
apt update && apt install -y vim | |
spack edit zeromq | |
## Change the line 38 from "version('4.2.2', '52499909b29604c1e47a86f1cb6a9115')" | |
## to "version('4.2.2', '52499909b29604c1e47a86f1cb6a9115', url='https://github.com/zeromq/libzmq/releases/download/v4.2.2/zeromq-4.2.2.tar.gz')" | |
## Add missing tools | |
spack view -d false -v add /salus/packages pkgconf | |
## Then install all dependencies | |
spack install [email protected] [email protected] [email protected]~shared [email protected] [email protected] [email protected] | |
## Install bazel | |
curl -JOL "https://github.com/bazelbuild/bazel/releases/download/0.5.4/bazel_0.5.4-linux-x86_64.deb" | |
apt install bash-completion | |
dpkg -i bazel_0.5.4-linux-x86_64.deb | |
# create a virtualenv for tensorflow | |
apt install -y python-virtualenv | |
virtualenv /salus/tfbuild | |
source /salus/tfbuild/bin/activate | |
# build tensorflow | |
cd tensorflow | |
## map dependencies into source tree | |
spack view -d false -v add spack-packages cppzmq libsodium zeromq | |
## install python dependencies | |
pip install six numpy wheel mock | |
## initialize build, don't answer yes when asked if to edit the file, there's an error that need to be fiex | |
inv init | |
## instead, manually edit the file. | |
## check the following variables are set correctly | |
## PYTHON_BIN_PATH: /salus/tfbuild/bin/python | |
## TF_CUDA_VERSION: 9.1 | |
## CUDA_TOOLKIT_PATH: /usr/local/cuda | |
## TF_CUDNN_VERSION: 7 | |
## CUDNN_INSTALL_PATH: /usr/lib/x86_64-linux-gnu | |
## GCC_HOST_COMPILER_PATH: /usr/bin/gcc-5 | |
## TF_CUDA_COMPUTE_CAPABILITIES: <set according to your device> | |
vim invoke.yml | |
## configure the build system, no questions should be asked by the command | |
## if you set variables correctly in the previous step | |
inv cf | |
## build, install and save the wheel package to ~/downloads | |
inv bbi --save | |
# build salus | |
cd /salus/salus | |
git checkout develop | |
## map dependencies | |
spack view -d false -v add spack-packages cppzmq zeromq boost nlohmann-json protobuf gperftools | |
## some python dependencies for testing | |
pip install -r requirements.txt | |
## configure & install | |
mkdir -p build/Release && cd build/Release | |
export CC=gcc-7 CXX=g++-7 | |
cmake -DCMAKE_BUILD_TYPE=Release -DTENSORFLOW_ROOT=/salus/tensorflow ../.. | |
make -j | |
If everything goes well, you should have a binary at src/executor, it will listen on 5501 port after startup. Ctrl-C can stop it. | |
Run test workloads | |
Before you run your own workloads. You can try to run some test workloads to verify the system is correctly compiled. The helper script that I used to run experiments for my paper is also included in the repo. | |
It expects certain layout of workload scripts that you can setup as below: | |
cd /salus | |
git clone https://github.com/Aetf/tf_benchmarks.git | |
Then you can go back to salus folder. The script assumes certain hardware layout of the system that you can override by set CUDA_VISIBLE_DEVICES=0. | |
cd /salus/salus | |
export CUDA_VISIBLE_DEVICES=0 | |
# 25 is the batch size, 20 is the batch num | |
./bc one vgg16 25 20 --force_preset MostEfficient | |
Run user scripts | |
Using the tfbuild virtualenv, you can now run user scripts. Instead of creating a local session, create a session with target "zrpc://tcp://127.0.0.1:5501", so it will connect to salus. For now, you have to mark the training iteration with a noop operation named "salus-marker", but this requirement will be removed later. Just create the operation before iterations and then add it to session.run. Something like: | |
session.run([train_op, marker_op]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment