FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU; Sheng et al. (2023)

Motivated by latency insensitive tasks and high dependence on accelerators.
FlexGen can be run on a single commodity system with CPU, GPU, and disk.
Solves an LP problem, searching for efficient patterns to store and access tensors.
Compresses weights and attention cache to 4 bits with minimal accuracy loss (fine-grained group-wise quantization).
These enable FlexGen to have larger batch size choices and improve its throughput significantly.
Running OPT-175B on a 16GB GPU, FlexGen achieves 1 token/s throughput for the first time.
Runs HELM benchmark with a 30B model in 21 hours.

GPT-175B needs 325GB of GPU memory just to load it, would require atleast 5 A100 GPUs.
Reducing LLM inference resource requirements has recently attracted intense interest.

Throughput oriented generative inference

Used for back-office tasks, with large number of tokens to process.

Benchmarking
Information extraction
Data wrangling
Form processing

Reducing resource requirements for LLM inference

Model compression
Collaborative inference (distributed/hybrid?)
Offloading - use memory from CPU and disk

Note

1 and 2 assume the model will fit into the GPU memory.
3 currently does not achieve acceptable throughput (2022).

During generative inference

The following tensors are used:

Weights
Activations
KV cache

What tensors to offload? Where? When?

Batch-by-batch
Token-by-token
Layer-by-layer

Both sparsification and quantization have been adopted for inference. Weights only can be compressed to 3 bits. Weights and activations can be compressed to 8 bits. FlexGen compresses weights and attention cache to 4 bits.

wolfram77/notes-flexgen-high-throughput-generative-inference-of-large-language-models-with-a-single-gpu.md

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU; Sheng et al. (2023)

Throughput oriented generative inference

Reducing resource requirements for LLM inference

During generative inference