Last active
November 23, 2019 13:27
-
-
Save ontheklaud/f61c3f71fcc0ebbc4a27702f7a9085fc to your computer and use it in GitHub Desktop.
Horovod Installation Guide for Distributed Computing Platform
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# | |
# Horovod Installation Guide | |
# | |
0a. Requirement | |
- CentOS 7.x (tested on 7.1, 7.3, 7.4) | |
- Python 3.6.x (source build, refer https://www.python.org/downloads/release/python-366/) | |
- GPU (not a prerequisit, but encouraged) | |
- all workers must share same installation process | |
- DO NOT UPDATE ANY KERNEL PACKAGES (USE WITH STOCK KERNEL PROVIDED BY CENTOS) | |
0b. bootstrap | |
yum groupinstall "Development Tools" | |
yum install epel-release | |
1. Install MPI (openmpi/ompi) | |
yum install openmpi3 openmpi3-devel # openmpi 3.x.x, jemalloc | |
2. Install jemalloc | |
yum install jemalloc jemalloc-devel | |
3. Install TensorFlow & Horovod (MUST INSTALL MPI BEFORE HOROVOD INSTALLATION) | |
# For CPU only TensorFlow, | |
pip install --upgrade tensorflow (or pip3 install --upgrade tensorflow) | |
# For GPU support TensorFlow, | |
pip install --upgrade tensorflow tensorflow-gpu (or pip3 install --upgrade tensorflow tensorflow-gpu) | |
# install horovod | |
pip install --upgrade horovod (or pip3 install --upgrade horovod) | |
4. Test Horovod (on CPU Only TensorFlow) | |
$ mpirun -np 3 -H localhost:3 -bind-to none -map-by slot bash -c 'export CUDA_VISIBLE_DEVICES=;python -c "import tensorflow as tf;tf.Session();"' | |
2018-07-11 20:38:54.704650: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE | |
2018-07-11 20:38:54.704808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox | |
2018-07-11 20:38:54.704817: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox | |
2018-07-11 20:38:54.704859: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0 | |
2018-07-11 20:38:54.705037: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0 | |
2018-07-11 20:38:54.705047: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0 | |
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. | |
from ._conv import register_converters as _register_converters | |
2018-07-11 20:38:54.718438: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE | |
2018-07-11 20:38:54.718527: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox | |
2018-07-11 20:38:54.718538: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox | |
2018-07-11 20:38:54.718598: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0 | |
2018-07-11 20:38:54.719134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0 | |
2018-07-11 20:38:54.719170: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0 | |
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. | |
from ._conv import register_converters as _register_converters | |
2018-07-11 20:38:54.727063: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE | |
2018-07-11 20:38:54.727142: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: toolbox | |
2018-07-11 20:38:54.727154: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: toolbox | |
2018-07-11 20:38:54.727218: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 387.26.0 | |
2018-07-11 20:38:54.727532: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 387.26.0 | |
2018-07-11 20:38:54.727548: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 387.26.0 | |
/home/🤞/anaconda3/envs/mpiworker/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. | |
from ._conv import register_converters as _register_converters |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment