-
-
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile . | |
# export TF_SERVING_PORT=9000 | |
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel | |
# export CONTAINER_NAME=tf_serving_1_6 | |
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/ | |
# docker start -ai $CONTAINER_NAME | |
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 | |
# CUDA and CUDNN versions (must match the image source) | |
ENV TF_CUDA_VERSION=9.0 \ | |
TF_CUDNN_VERSION=7 \ | |
TF_SERVING_COMMIT=tags/1.6.0 \ | |
BAZEL_VERSION=0.11.1 | |
# Set up ubuntu packages | |
RUN apt-get update && apt-get install -y \ | |
build-essential \ | |
curl \ | |
git \ | |
libfreetype6-dev \ | |
libpng12-dev \ | |
libzmq3-dev \ | |
mlocate \ | |
pkg-config \ | |
python-dev \ | |
python-numpy \ | |
python-pip \ | |
software-properties-common \ | |
swig \ | |
zip \ | |
zlib1g-dev \ | |
libcurl3-dev \ | |
openjdk-8-jdk\ | |
openjdk-8-jre-headless \ | |
wget \ | |
&& \ | |
apt-get clean && \ | |
rm -rf /var/lib/apt/lists/* | |
# Set up grpc | |
RUN pip install mock grpcio | |
# Set up Bazel. | |
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134 | |
RUN echo "startup --batch" >>/root/.bazelrc | |
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418 | |
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc | |
ENV BAZELRC /root/.bazelrc | |
# Install the most recent bazel release. | |
WORKDIR /bazel | |
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ | |
chmod +x bazel-*.sh && \ | |
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh | |
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264 | |
WORKDIR / | |
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \ | |
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \ | |
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION | |
# Enable CUDA support | |
ENV TF_NEED_CUDA=1 \ | |
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \ | |
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH | |
# Download TensorFlow Serving | |
WORKDIR /tensorflow | |
RUN git clone --recurse-submodules https://github.com/tensorflow/serving | |
WORKDIR /tensorflow/serving | |
RUN git checkout $TF_SERVING_COMMIT | |
# Build TensorFlow Serving | |
WORKDIR /tensorflow/serving | |
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server | |
# Install tensorflow_model_server and clean bazel | |
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \ | |
bazel clean --expunge | |
CMD ["/bin/bash"] |
Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:
INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully
My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.
I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..
@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.
Here are the official docker files: https://github.com/tensorflow/serving/tree/master/tensorflow_serving/tools/docker
For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.
There is a dependency on "software-properties-common" in line#33 to "python-pip" in line#32. This is breaking the pip installation...I swapped the lines and pip installed gracefully. This may need to changed.