Skip to content

Instantly share code, notes, and snippets.

@ontheklaud
Last active November 23, 2019 13:27
Show Gist options
  • Save ontheklaud/f61c3f71fcc0ebbc4a27702f7a9085fc to your computer and use it in GitHub Desktop.
Save ontheklaud/f61c3f71fcc0ebbc4a27702f7a9085fc to your computer and use it in GitHub Desktop.
Horovod Installation Guide for Distributed Computing Platform
#
# Horovod Installation Guide
#
0a. Requirement
- CentOS 7.x (tested on 7.1, 7.3, 7.4)
- Python 3.6.x (source build, refer https://www.python.org/downloads/release/python-366/)
- GPU (not a prerequisit, but encouraged)
- all workers must share same installation process
- DO NOT UPDATE ANY KERNEL PACKAGES (USE WITH STOCK KERNEL PROVIDED BY CENTOS)
0b. bootstrap
yum groupinstall "Development Tools"
yum install epel-release
1. Install MPI (openmpi/ompi)
yum install openmpi3 openmpi3-devel # openmpi 3.x.x, jemalloc
2. Install jemalloc
yum install jemalloc jemalloc-devel
3. Install TensorFlow & Horovod (MUST INSTALL MPI BEFORE HOROVOD INSTALLATION)
# For CPU only TensorFlow,
pip install --upgrade tensorflow (or pip3 install --upgrade tensorflow)
# For GPU support TensorFlow,
pip install --upgrade tensorflow tensorflow-gpu (or pip3 install --upgrade tensorflow tensorflow-gpu)
# install horovod
pip install --upgrade horovod (or pip3 install --upgrade horovod)
4. Test Horovod (on CPU Only TensorFlow)
$ mpirun -np 3 -H localhost:3 -bind-to none -map-by slot bash -c 'export CUDA_VISIBLE_DEVICES=;python -c "import tensorflow as tf;tf.Session();"'
2018-07-11 20:38:54.704650: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-11 20:38:54.704808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox
2018-07-11 20:38:54.704817: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox
2018-07-11 20:38:54.704859: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0
2018-07-11 20:38:54.705037: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0
2018-07-11 20:38:54.705047: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
2018-07-11 20:38:54.718438: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-11 20:38:54.718527: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox
2018-07-11 20:38:54.718538: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox
2018-07-11 20:38:54.718598: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0
2018-07-11 20:38:54.719134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0
2018-07-11 20:38:54.719170: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
2018-07-11 20:38:54.727063: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-11 20:38:54.727142: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox
2018-07-11 20:38:54.727154: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox
2018-07-11 20:38:54.727218: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0
2018-07-11 20:38:54.727532: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0
2018-07-11 20:38:54.727548: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment