- CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
- GPU: NVIDIA V100
- Memory: 251GiB
- OS: Ubuntu 16.04.6 LTS (Xenial Xerus)
Docker Images:
- tensorflow/tensorflow:latest-gpu
- tensorflow/serving:latest-gpu
- nvcr.io/nvidia/tensorrtserver:19.10-py3
Framework | Model | Model Type | Images | Batch size | Time(s) |
---|---|---|---|---|---|
Tensorflow | ResNet50 | TF Savedmodel | 32000 | 32 | 83.189 |
Tensorflow | ResNet50 | TF Savedmodel | 32000 | 10 | 86.897 |
Tensorflow Serving | ResNet50 | TF Savedmodel | 32000 | 32 | 120.496 |
Tensorflow Serving | ResNet50 | TF Savedmodel | 32000 | 10 | 116.887 |
Triton (TensorRT Inference Server) | ResNet50 | TF Savedmodel | 32000 | 32 | 201.855 |
Triton (TensorRT Inference Server) | ResNet50 | TF Savedmodel | 32000 | 10 | 171.056 |
Falcon + msgpack + Tensorflow | ResNet50 | TF Savedmodel | 32000 | 32 | 115.686 |
Falcon + msgpack + Tensorflow | ResNet50 | TF Savedmodel | 32000 | 10 | 115.572 |
I recommend using https://github.com/triton-inference-server/server/blob/master/docs/perf_analyzer.md to continue your study, and put the result csv following the instruction here https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/optimization.html#visualizing-latency-vs-throughput to understand the triton performance. Recently I did some perf analysis on Triton with TRT optimization, we are looking at RN50 about 3k images/s at ~20ms latency.