-
-
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile . | |
# export TF_SERVING_PORT=9000 | |
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel | |
# export CONTAINER_NAME=tf_serving_1_6 | |
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/ | |
# docker start -ai $CONTAINER_NAME | |
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 | |
# CUDA and CUDNN versions (must match the image source) | |
ENV TF_CUDA_VERSION=9.0 \ | |
TF_CUDNN_VERSION=7 \ | |
TF_SERVING_COMMIT=tags/1.6.0 \ | |
BAZEL_VERSION=0.11.1 | |
# Set up ubuntu packages | |
RUN apt-get update && apt-get install -y \ | |
build-essential \ | |
curl \ | |
git \ | |
libfreetype6-dev \ | |
libpng12-dev \ | |
libzmq3-dev \ | |
mlocate \ | |
pkg-config \ | |
python-dev \ | |
python-numpy \ | |
python-pip \ | |
software-properties-common \ | |
swig \ | |
zip \ | |
zlib1g-dev \ | |
libcurl3-dev \ | |
openjdk-8-jdk\ | |
openjdk-8-jre-headless \ | |
wget \ | |
&& \ | |
apt-get clean && \ | |
rm -rf /var/lib/apt/lists/* | |
# Set up grpc | |
RUN pip install mock grpcio | |
# Set up Bazel. | |
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134 | |
RUN echo "startup --batch" >>/root/.bazelrc | |
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418 | |
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc | |
ENV BAZELRC /root/.bazelrc | |
# Install the most recent bazel release. | |
WORKDIR /bazel | |
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ | |
chmod +x bazel-*.sh && \ | |
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh | |
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264 | |
WORKDIR / | |
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \ | |
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \ | |
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION | |
# Enable CUDA support | |
ENV TF_NEED_CUDA=1 \ | |
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \ | |
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH | |
# Download TensorFlow Serving | |
WORKDIR /tensorflow | |
RUN git clone --recurse-submodules https://github.com/tensorflow/serving | |
WORKDIR /tensorflow/serving | |
RUN git checkout $TF_SERVING_COMMIT | |
# Build TensorFlow Serving | |
WORKDIR /tensorflow/serving | |
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server | |
# Install tensorflow_model_server and clean bazel | |
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \ | |
bazel clean --expunge | |
CMD ["/bin/bash"] |
There is a dependency on "software-properties-common" in line#33 to "python-pip" in line#32. This is breaking the pip installation...I swapped the lines and pip installed gracefully. This may need to changed.
Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:
INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully
My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.
I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..
@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.
Here are the official docker files: https://github.com/tensorflow/serving/tree/master/tensorflow_serving/tools/docker
For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.
thanks @kondrashov-do I guess I forgot that!