-
-
Save jorgemf/c791841f769bff96718fd54bbdecfd4e to your computer and use it in GitHub Desktop.
# docker build --pull -t tf/tensorflow-serving --label 1.6 -f Dockerfile . | |
# export TF_SERVING_PORT=9000 | |
# export TF_SERVING_MODEL_PATH=/tf_models/mymodel | |
# export CONTAINER_NAME=tf_serving_1_6 | |
# CUDA_VISIBLE_DEVICES=0 docker run --runtime=nvidia -it -p $TF_SERVING_PORT:$TF_SERVING_PORT -v $TF_SERVING_MODEL_PATH:/root/tf_model --name $CONTAINER_NAME tf/tensorflow-serving /usr/local/bin/tensorflow_model_server --port=$TF_SERVING_PORT --enable_batching=true --model_base_path=/root/tf_model/ | |
# docker start -ai $CONTAINER_NAME | |
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 | |
# CUDA and CUDNN versions (must match the image source) | |
ENV TF_CUDA_VERSION=9.0 \ | |
TF_CUDNN_VERSION=7 \ | |
TF_SERVING_COMMIT=tags/1.6.0 \ | |
BAZEL_VERSION=0.11.1 | |
# Set up ubuntu packages | |
RUN apt-get update && apt-get install -y \ | |
build-essential \ | |
curl \ | |
git \ | |
libfreetype6-dev \ | |
libpng12-dev \ | |
libzmq3-dev \ | |
mlocate \ | |
pkg-config \ | |
python-dev \ | |
python-numpy \ | |
python-pip \ | |
software-properties-common \ | |
swig \ | |
zip \ | |
zlib1g-dev \ | |
libcurl3-dev \ | |
openjdk-8-jdk\ | |
openjdk-8-jre-headless \ | |
wget \ | |
&& \ | |
apt-get clean && \ | |
rm -rf /var/lib/apt/lists/* | |
# Set up grpc | |
RUN pip install mock grpcio | |
# Set up Bazel. | |
# Running bazel inside a `docker build` command causes trouble, cf: https://github.com/bazelbuild/bazel/issues/134 | |
RUN echo "startup --batch" >>/root/.bazelrc | |
# Similarly, we need to workaround sandboxing issues: https://github.com/bazelbuild/bazel/issues/418 | |
RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" >>/root/.bazelrc | |
ENV BAZELRC /root/.bazelrc | |
# Install the most recent bazel release. | |
WORKDIR /bazel | |
RUN curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ | |
chmod +x bazel-*.sh && \ | |
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh | |
# Fix paths so that CUDNN can be found: https://github.com/tensorflow/tensorflow/issues/8264 | |
WORKDIR / | |
RUN mkdir /usr/lib/x86_64-linux-gnu/include/ && \ | |
ln -s /usr/lib/x86_64-linux-gnu/include/cudnn.h /usr/lib/x86_64-linux-gnu/include/cudnn.h && \ | |
ln -s /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \ | |
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.$TF_CUDNN_VERSION /usr/local/cuda/lib64/libcudnn.so.$TF_CUDNN_VERSION | |
# Enable CUDA support | |
ENV TF_NEED_CUDA=1 \ | |
TF_CUDA_COMPUTE_CAPABILITIES="3.0,3.5,5.2,6.0,6.1" \ | |
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH | |
# Download TensorFlow Serving | |
WORKDIR /tensorflow | |
RUN git clone --recurse-submodules https://github.com/tensorflow/serving | |
WORKDIR /tensorflow/serving | |
RUN git checkout $TF_SERVING_COMMIT | |
# Build TensorFlow Serving | |
WORKDIR /tensorflow/serving | |
RUN bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k --verbose_failures --crosstool_top=@local_config_cuda//crosstool:toolchain tensorflow_serving/model_servers:tensorflow_model_server | |
# Install tensorflow_model_server and clean bazel | |
RUN cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ && \ | |
bazel clean --expunge | |
CMD ["/bin/bash"] |
@taojian1989 you need the nvidia runtime for docker, take a look at the first lines that describes how to build and run the container.
Thanks for putting this together. I built it also, but I get this error when running serving. host OS is Ubuntu 16.04 with CUDA 9.0 and nvidia-docker2 version 2.0.3+docker18.03.0-1 and docker-ce version 18.03.0ce-0ubuntu. seems to have trouble finding libcuda.so. GPU does not seem to be used.
2018-04-06 18:31:11.489706: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /root/tf_model/1
2018-04-06 18:31:11.490992: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUresult(-1)
2018-04-06 18:31:11.491030: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: edc550abaa6b
2018-04-06 18:31:11.491045: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: edc550abaa6b
2018-04-06 18:31:11.491143: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2018-04-06 18:31:11.491198: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.111 Tue Dec 19 23:51:45 PST 2017
@samuelswu There is a docker file in the serving repo, they do some changes for the CUDA issue: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile.devel-gpu
I will try to do them and test it.
Remove this line from the dockerfile:
ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/libcuda.so.1
Updated to 1.6 with GPU support.
@maorzalt the problem wasn't that line, I added that one to fix an issue but it wasn't enough to make it work with the GPU
I've tried to build TS Serving from your file, but I got this error during bazel compilation :
ERROR: /root/.cache/bazel/_bazel_root/d9c8385ec38b40593868ab263ecdc773/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1:
Couldn't build file external/org_tensorflow/tensorflow/contrib/nccl/_objs/nccl_kernels/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_ops.o:
C++ compilation of rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels' failed (Exit 1):
crosstool_wrapper_driver_is_not_gcc failed: error executing command
Any idea how to deal with that or what could have change that nccl is not building properly ?
I've researched that problem and one workaround is to comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD. but it requiers the change @org_tensorflow variable.
I always build failed,build to this step and exit. I don't know why this is so and there is no error message.
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(169): warning: __device__ annotation on a defaulted function("scalar_left") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __host__ annotation on a defaulted function("scalar_right") is ignored
external/org_tensorflow/tensorflow/core/kernels/cwise_ops.h(199): warning: __device__ annotation on a defaulted function("scalar_right") is ignored
[4,435 / 4,442] 3 actions running
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
INFO: Elapsed time: 813.215s, Critical Path: 552.69s
FAILED: Build did NOT complete successfully
The problem has been solved.
Build failed because the latest version is 1.7.
@jorgemf Thank you for the script.
Managed to build it on AWS EC2, p2.xlarge, DL AMI with CUDA 9.
The only detail, I checked out specific tag 1.6 of tf/serving repository
RUN git clone --recurse-submodules https://github.com/tensorflow/serving
WORKDIR /tensorflow/serving
RUN git checkout tags/1.6.0
Hope it will help somebody!
thanks @kondrashov-do I guess I forgot that!
There is a dependency on "software-properties-common" in line#33 to "python-pip" in line#32. This is breaking the pip installation...I swapped the lines and pip installed gracefully. This may need to changed.
Looks like the is no nvcc 9.0 available anymore, so I've updated cuda to 9.2 but now it fails to link the binary:
INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (127 packages loaded).
INFO: Found 1 target...
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 7s local
[4,598 / 4,599] Linking .../model_servers/tensorflow_model_server; 8s local
ERROR: /serving/tensorflow_serving/model_servers/BUILD:270:1: Linking of rule '//tensorflow_serving/model_servers:tensorflow_model_server' failed (Exit 1)
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `std::_Function_handler<void (), tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
nccl_manager.cc:(.text._ZNSt17_Function_handlerIFvvEZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS2_10NcclStreamEEUlvE_E9_M_invokeERKSt9_Any_data+0x141): undefined reference to `ncclGetErrorString'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::LoopKernelLaunches(tensorflow::NcclManager::NcclStream*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x23d): undefined reference to `ncclAllReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x31f): undefined reference to `ncclReduce'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager18LoopKernelLaunchesEPNS0_10NcclStreamE+0x383): undefined reference to `ncclBcast'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `void std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > >::_M_realloc_insert<tensorflow::NcclManager::Communicator*>(__gnu_cxx::__normal_iterator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >*, std::vector<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> >, std::allocator<std::unique_ptr<tensorflow::NcclManager::Communicator, std::default_delete<tensorflow::NcclManager::Communicator> > > > >, tensorflow::NcclManager::Communicator*&&)':
nccl_manager.cc:(.text._ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_[_ZNSt6vectorISt10unique_ptrIN10tensorflow11NcclManager12CommunicatorESt14default_deleteIS3_EESaIS6_EE17_M_realloc_insertIJPS3_EEEvN9__gnu_cxx17__normal_iteratorIPS6_S8_EEDpOT_]+0x159): undefined reference to `ncclCommDestroy'
bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/libnccl_kernels.lo(nccl_manager.o): In function `tensorflow::NcclManager::GetCommunicator(tensorflow::NcclManager::Collective*)':
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x565): undefined reference to `ncclGetUniqueId'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x574): undefined reference to `ncclGroupStart'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x66b): undefined reference to `ncclCommInitRank'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0x6ed): undefined reference to `ncclGetErrorString'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xc73): undefined reference to `ncclCommInitAll'
nccl_manager.cc:(.text._ZN10tensorflow11NcclManager15GetCommunicatorEPNS0_10CollectiveE+0xe46): undefined reference to `ncclGroupEnd'
collect2: error: ld returned 1 exit status
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 48.712s, Critical Path: 19.38s
FAILED: Build did NOT complete successfully
My Dockerfile is pretty much the same a s https://raw.githubusercontent.com/tensorflow/serving/c8cc43b/tensorflow_serving/tools/docker/Dockerfile.devel-gpu but based on cuda 9.2 and using libpng-dev instead of libpng12-dev because my cuda 9.2 image is based on ubuntu 18.04.
I'd test with 9.1 and other ubuntu and/or serving versions, but the build is so aweful slow (>2h on my rather beefy laptop)..
@discordianfish any change you make in the dockerfile can make it not work. For example, I think latest ubuntu version doesn't have cuda 9.0, which is required to compile TF serving (because of some dirvers issues I think). So if you changed that it wont work unless you add the necessary workarounds.
Here are the official docker files: https://github.com/tensorflow/serving/tree/master/tensorflow_serving/tools/docker
For some reason this works better than the official docker file for my case. the "latest-devel-gpu" TF serving pulled from the docker repo doesn't recognize my CUDA device, whereas this one does. Even when building the official latest-devel-gpu from github.
I successfully build the images.but when I run it, cp the model to running container, and run command "/root/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name="default" --model_base_path="/data/serving_model/"‘’
there is a error log:
2018-03-01 02:23:39.972222: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUresult(-1)
2018-03-01 02:23:39.972285: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
it looks like serving can not find the GPU?
ps:my os is ubuntu 16.04, and cuda is 9.0,cudnn is 7.0.