Why `torchserve` is great in the age of LLMs

I previously wrote about TorchServe as nice way to serve models.

There are a plethora of ways to optimize LLM models for inference, such as quantization, Paged Attention, kernel fusion, and other compilation techniques. The list of these keeps expanding as the demand for serving OSS LLMs increases. There is a groundswell of demand for serving OSS LLMs, because many organizations have tried OpenAI/Anthropic/Palm, but want a a solution they can control

This is why TorchServe is nice, you can use any framework you want in the custom handler, such as GPTQ, transformers, vLLM, etc (as long as there is a python client). With TorchServe, there is enough freedom to leverage SOTA model optimization techniques while still using a production-ready infernce server.

Some questions remain such as how to distribute models when using those bespoke optimization techniques over many GPUs (which might be necessary for very large LLMs), but I think its a good starting point.

hamelsmu/torchserve.md

Why `torchserve` is great in the age of LLMs

chauhang commented Aug 1, 2023

hamelsmu/torchserve.md

Why torchserve is great in the age of LLMs

chauhang commented Aug 1, 2023

Why `torchserve` is great in the age of LLMs