-
-
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile . | |
# export TF_SERVING_PORT=9000 | |
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel | |
# export CONTAINER_NAME=tf_serving_1_6 | |
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/ | |
# docker start -ai $CONTAINER_NAME | |
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 | |
# CUDA and CUDNN versions (must match the image source) | |
ENV TF_CUDA_VERSION=9.0 \ | |
TF_CUDNN_VERSION=7 \ | |
TF_SERVING_COMMIT=tags/1.6.0 \ | |
BAZEL_VERSION=0.11.1 | |
# Set up ubuntu packages | |
RUN apt-get update && apt-get install -y \ | |
build-essential \ | |
curl \ | |
git \ | |
libfreetype6-dev \ | |
libpng12-dev \ | |
libzmq3-dev \ | |
mlocate \ | |
pkg-config \ | |
python-dev \ | |
python-numpy \ | |
python-pip \ | |
software-properties-common \ | |
swig \ | |
zip \ | |
zlib1g-dev \ | |
libcurl3-dev \ | |
openjdk-8-jdk\ | |
openjdk-8-jre-headless \ | |
wget \ | |
&& \ | |
apt-get clean && \ | |
rm -rf /var/lib/apt/lists/* | |
# Set up grpc | |
RUN pip install mock grpcio | |
# Set up Bazel. | |
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134 | |
RUN echo "startup --batch" >>/root/.bazelrc | |
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418 | |
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc | |
ENV BAZELRC /root/.bazelrc | |
# Install the most recent bazel release. | |
WORKDIR /bazel | |
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ | |
chmod +x bazel-*.sh && \ | |
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh | |
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264 | |
WORKDIR / | |
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \ | |
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \ | |
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION | |
# Enable CUDA support | |
ENV TF_NEED_CUDA=1 \ | |
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \ | |
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH | |
# Download TensorFlow Serving | |
WORKDIR /tensorflow | |
RUN git clone --recurse-submodules https://github.com/tensorflow/serving | |
WORKDIR /tensorflow/serving | |
RUN git checkout $TF_SERVING_COMMIT | |
# Build TensorFlow Serving | |
WORKDIR /tensorflow/serving | |
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server | |
# Install tensorflow_model_server and clean bazel | |
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \ | |
bazel clean --expunge | |
CMD ["/bin/bash"] |
I've tried to build TS Serving from your file, but I got this error during bazel compilation :
ERROR: /root/.cache/bazel/_bazel_root/d9c8385ec38b40593868ab263ecdc773/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1:
Couldn't build file external/org_tensorflow/tensorflow/contrib/nccl/_objs/nccl_kernels/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_ops.o:
C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels' failed (Exit 1):
crosstool_wrapper_driver_is_not_gcc failed: error executing command
Any idea how to deal with that or what could have change that nccl is not building properly ?
I've researched that problem and one workaround is to comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD. but it requiers the change @org_tensorflow variable.
I always build failed,build to this step and exit. I don't know why this is so and there is no error message.
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(169): warning: __device__ annotation on a defaulted function("scalar_left") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __host__ annotation on a defaulted function("scalar_right") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __device__ annotation on a defaulted function("scalar_right") is ignored
[4,435 / 4,442] 3 actions running
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
INFO: Elapsed time: 813.215s, Critical Path: 552.69s
FAILED: Build did NOT complete successfully
The problem has been solved.
Build failed because the latest version is 1.7.
@jorgemf Thank you for the script.
Managed to build it on AWS EC2, p2.xlarge, DL AMI with CUDA 9.
The only detail, I checked out specific tag 1.6 of tf/serving repository
RUN git clone --recurse-submodules https://github.com/tensorflow/serving
WORKDIR /tensorflow/serving
RUN git checkout tags/1.6.0
Hope it will help somebody!
thanks @kondrashov-do I guess I forgot that!
There is a dependency on "software-properties-common" in line#33 to "python-pip" in line#32. This is breaking the pip installation...I swapped the lines and pip installed gracefully. This may need to changed.
Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:
INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully
My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.
I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..
@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.
Here are the official docker files: https://github.com/tensorflow/serving/tree/master/tensorflow_serving/tools/docker
For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.
Updated to 1.6 with GPU support.
@maorzalt the problem wasn't that line, I added that one to fix an issue but it wasn't enough to make it work with the GPU