Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wolfram77/142e6221662130eca656c5c0d2fc2f53 to your computer and use it in GitHub Desktop.
Save wolfram77/142e6221662130eca656c5c0d2fc2f53 to your computer and use it in GitHub Desktop.
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU : NOTES

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU; Sheng et al. (2023)

  1. Motivated by latency insensitive tasks and high dependence on accelerators.
  2. FlexGen can be run on a single commodity system with CPU, GPU, and disk.
  3. Solves an LP problem, searching for efficient patterns to store and access tensors.
  4. Compresses weights and attention cache to 4 bits with minimal accuracy loss (fine-grained group-wise quantization).
  5. These enable FlexGen to have larger batch size choices and improve its throughput significantly.
  6. Running OPT-175B on a 16GB GPU, FlexGen achieves 1 token/s throughput for the first time.
  7. Runs HELM benchmark with a 30B model in 21 hours.
  • GPT-175B needs 325GB of GPU memory just to load it, would require atleast 5 A100 GPUs.
  • Reducing LLM inference resource requirements has recently attracted intense interest.

Throughput oriented generative inference

Used for back-office tasks, with large number of tokens to process.

  • Benchmarking
  • Information extraction
  • Data wrangling
  • Form processing

Reducing resource requirements for LLM inference
  1. Model compression
  2. Collaborative inference (distributed/hybrid?)
  3. Offloading - use memory from CPU and disk

Note

1 and 2 assume the model will fit into the GPU memory.
3 currently does not achieve acceptable throughput (2022).


During generative inference

The following tensors are used:

  1. Weights
  2. Activations
  3. KV cache

What tensors to offload? Where? When?

  • Batch-by-batch
  • Token-by-token
  • Layer-by-layer

Both sparsification and quantization have been adopted for inference. Weights only can be compressed to 3 bits. Weights and activations can be compressed to 8 bits. FlexGen compresses weights and attention cache to 4 bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment