FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU; Sheng et al. (2023)
- Motivated by latency insensitive tasks and high dependence on accelerators.
- FlexGen can be run on a single commodity system with CPU, GPU, and disk.
- Solves an LP problem, searching for efficient patterns to store and access tensors.
- Compresses weights and attention cache to 4 bits with minimal accuracy loss (fine-grained group-wise quantization).
- These enable FlexGen to have larger batch size choices and improve its throughput significantly.
- Running OPT-175B on a 16GB GPU, FlexGen achieves 1 token/s throughput for the first time.
- Runs HELM benchmark with a 30B model in 21 hours.
- GPT-175B needs 325GB of GPU memory just to load it, would require atleast 5 A100 GPUs.
- Reducing LLM inference resource requirements has recently attracted intense interest.
Used for back-office tasks, with large number of tokens to process.
- Benchmarking
- Information extraction
- Data wrangling
- Form processing
- Model compression
- Collaborative inference (distributed/hybrid?)
- Offloading - use memory from CPU and disk
Note
1 and 2 assume the model will fit into the GPU memory.
3 currently does not achieve acceptable throughput (2022).
The following tensors are used:
- Weights
- Activations
- KV cache
What tensors to offload? Where? When?
- Batch-by-batch
- Token-by-token
- Layer-by-layer
Both sparsification and quantization have been adopted for inference. Weights only can be compressed to 3 bits. Weights and activations can be compressed to 8 bits. FlexGen compresses weights and attention cache to 4 bits.