- CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
- GPU: NVIDIA V100
- Memory: 251GiB
- OS: Ubuntu 16.04.6 LTS (Xenial Xerus)
Docker Images:
- tensorflow/tensorflow:latest-gpu
- tensorflow/serving:latest-gpu
- nvcr.io/nvidia/tensorrtserver:19.10-py3
Framework | Model | Model Type | Images | Batch size | Time(s) |
---|---|---|---|---|---|
Tensorflow | ResNet50 | TF Savedmodel | 32000 | 32 | 83.189 |
Tensorflow | ResNet50 | TF Savedmodel | 32000 | 10 | 86.897 |
Tensorflow Serving | ResNet50 | TF Savedmodel | 32000 | 32 | 120.496 |
Tensorflow Serving | ResNet50 | TF Savedmodel | 32000 | 10 | 116.887 |
Triton (TensorRT Inference Server) | ResNet50 | TF Savedmodel | 32000 | 32 | 201.855 |
Triton (TensorRT Inference Server) | ResNet50 | TF Savedmodel | 32000 | 10 | 171.056 |
Falcon + msgpack + Tensorflow | ResNet50 | TF Savedmodel | 32000 | 32 | 115.686 |
Falcon + msgpack + Tensorflow | ResNet50 | TF Savedmodel | 32000 | 10 | 115.572 |
This shows TF serving has a better web server, have you considered to use async non-blocking inference request?