Skip to content

Instantly share code, notes, and snippets.

@hamelsmu
Last active August 24, 2024 04:07
Show Gist options
  • Save hamelsmu/2b5e2a1d21d6bbe4ec6fa5844299feb3 to your computer and use it in GitHub Desktop.
Save hamelsmu/2b5e2a1d21d6bbe4ec6fa5844299feb3 to your computer and use it in GitHub Desktop.
Why `torchserve` is great in the age of LLMs

Why torchserve is great in the age of LLMs

I previously wrote about TorchServe as nice way to serve models.

There are a plethora of ways to optimize LLM models for inference, such as quantization, Paged Attention, kernel fusion, and other compilation techniques. The list of these keeps expanding as the demand for serving OSS LLMs increases. There is a groundswell of demand for serving OSS LLMs, because many organizations have tried OpenAI/Anthropic/Palm, but want a a solution they can control

image

This is why TorchServe is nice, you can use any framework you want in the custom handler, such as GPTQ, transformers, vLLM, etc (as long as there is a python client). With TorchServe, there is enough freedom to leverage SOTA model optimization techniques while still using a production-ready infernce server.

Some questions remain such as how to distribute models when using those bespoke optimization techniques over many GPUs (which might be necessary for very large LLMs), but I think its a good starting point.

@chauhang
Copy link

chauhang commented Aug 1, 2023

@hamelsmu We already have multiple examples for distributed inference for large models in TorchServe. Please check out the different options here. We have a native PyTorch solution using PiPPy for pipeline parallelism, examples with HF Accelerate, DeepSpeed, DeepSpeed MII, AWS Neuron (Inferentia 2) for LLMs and Diffusion models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment