I previously wrote about TorchServe as nice way to serve models.
There are a plethora of ways to optimize LLM models for inference, such as quantization, Paged Attention, kernel fusion, and other compilation techniques. The list of these keeps expanding as the demand for serving OSS LLMs increases. There is a groundswell of demand for serving OSS LLMs, because many organizations have tried OpenAI/Anthropic/Palm, but want a a solution they can control
This is why TorchServe is nice, you can use any framework you want in the custom handler, such as GPTQ, transformers, vLLM, etc (as long as there is a python client). With TorchServe, there is enough freedom to leverage SOTA model optimization techniques while still using a production-ready infernce server.
Some questions remain such as how to distribute models when using those bespoke optimization techniques over many GPUs (which might be necessary for very large LLMs), but I think its a good starting point.
@hamelsmu We already have multiple examples for distributed inference for large models in TorchServe. Please check out the different options here. We have a native PyTorch solution using PiPPy for pipeline parallelism, examples with HF Accelerate, DeepSpeed, DeepSpeed MII, AWS Neuron (Inferentia 2) for LLMs and Diffusion models