Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wolfram77/142e6221662130eca656c5c0d2fc2f53 to your computer and use it in GitHub Desktop.
Save wolfram77/142e6221662130eca656c5c0d2fc2f53 to your computer and use it in GitHub Desktop.
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU : NOTES

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU; Sheng et al. (2023)

  1. Motivated by latency insensitive tasks and high dependence on accelerators.
  2. FlexGen can be run on a single commodity system with CPU, GPU, and disk.
  3. Solves an LP problem, searching for efficient patterns to store and access tensors.
  4. Compresses weights and attention cache to 4 bits with minimal accuracy loss (fine-grained group-wise quantization).
  5. These enable FlexGen to have larger batch size choices and improve its throughput significantly.
  6. Running OPT-175B on a 16GB GPU, FlexGen achieves 1 token/s throughput for the first time.
  7. Runs HELM benchmark with a 30B model in 21 hours.
  • GPT-175B needs 325GB of GPU memory just to load it, would require atleast 5 A100 GPUs.
  • Reducing LLM inference resource requirements has recently attracted intense interest.

Throughput oriented generative inference

Used for back-office tasks, with large number of tokens to process.

  • Benchmarking
  • Information extraction
  • Data wrangling
  • Form processing

Reducing resource requirements for LLM inference
  1. Model compression
  2. Collaborative inference (distributed/hybrid?)
  3. Offloading - use memory from CPU and disk

Note

1 and 2 assume the model will fit into the GPU memory.
3 currently does not achieve acceptable throughput (2022).


During generative inference

The following tensors are used:

  1. Weights
  2. Activations
  3. KV cache

What tensors to offload? Where? When?

  • Batch-by-batch
  • Token-by-token
  • Layer-by-layer

Both sparsification and quantization have been adopted for inference. Weights only can be compressed to 3 bits. Weights and activations can be compressed to 8 bits. FlexGen compresses weights and attention cache to 4 bits.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment