├── .github └── pull_request_template.md ├── README.md ├── docs ├── Acknowledgement.md ├── CONTRIBUTING.md └── License.md ├── examples ├── huggingface │ └── README.md ├── lightning │ └── README.md └── medusa │ └── README.md └── src └── liger_kernel └── chunked_loss └── README.md
1 | ## Summary
2 |
3 |
4 |
8 |
9 | ## Testing Done
10 |
11 |
12 |
18 |
19 | - Hardware Type:
20 | - [ ] run make test
to ensure correctness
21 | - [ ] run make checkstyle
to ensure code style
22 | - [ ] run make test-convergence
to ensure convergence
23 |
1 | 2 | 3 | # Liger Kernel: Efficient Triton Kernels for LLM Training 4 | 5 | 6 |
7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 19 | 24 | 29 | 34 | 39 | 51 | 52 |Stable | Nightly | Discord | Build | ||
---|---|---|---|---|---|
15 |
16 | |
20 |
21 | |
25 |
26 | |
30 |
31 | |
35 |
36 | |
40 |
41 |
42 |
45 |
46 |
47 |
50 | |
Latest News 🔥
62 |63 | - [2024/12/11] We release v0.5.0: 80% more memory efficient post training losses (DPO, ORPO, CPO, etc)! 64 | - [2024/12/5] We release LinkedIn Engineering Blog - Liger-Kernel: Empowering an open source ecosystem of Triton Kernels for Efficient LLM Training 65 | - [2024/11/6] We release v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision! 66 | - [2024/10/21] We have released the tech report of Liger Kernel on Arxiv: https://arxiv.org/pdf/2410.10989 67 | - [2024/9/6] We release v0.2.1 (X post). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks! 68 | - [2024/8/31] CUDA MODE talk, Liger-Kernel: Real-world Triton kernel for LLM Training, Slides 69 | - [2024/8/23] Official release: check out our X post 70 | 71 |
RMSNorm
, RoPE
, SwiGLU
, CrossEntropy
, FusedLinearCrossEntropy
, and more to come. The kernel works out of the box with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed. We welcome contributions from the community to gather the best kernels for LLM training.
75 |
76 | We've also added optimized Post-Training kernels that deliver up to 80% memory savings for alignment and distillation tasks. We support losses like DPO, CPO, ORPO, SimPO, JSD, and many more. Check out how we optimize the memory.
77 |
78 | ## Supercharge Your Model with Liger Kernel
79 |
80 |
81 |
82 | With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.
83 |
84 |
85 | | Speed Up | Memory Reduction |
86 | |--------------------------|-------------------------|
87 | | | |
88 |
89 | > Note:
90 | > - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16
, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
91 | > - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.
92 |
93 | ## Optimize Post Training with Liger Kernel
94 |
95 |
98 |
99 | We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules.
100 |
101 | python 102 | from liger_kernel.chunked_loss import LigerFusedLinearDPOLoss 103 | orpo_loss = LigerFusedLinearORPOLoss() 104 | y = orpo_loss(lm_head.weight, x, target) 105 |
106 |
107 |
108 | ## Examples
109 |
110 | | Use Case | Description |
111 | |------------------------------------------------|---------------------------------------------------------------------------------------------------|
112 | | Hugging Face Trainer | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP |
113 | | Lightning Trainer | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 |
114 | | Medusa Multi-head LLM (Retraining Phase) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP |
115 | | Vision-Language Model SFT | Finetune Qwen2-VL on image-text data using 4 A100s with FSDP |
116 | | Liger ORPO Trainer | Align Llama 3.2 using Liger ORPO Trainer with FSDP with 50% memory reduction |
117 |
118 | ## Key Features
119 |
120 | - Ease of use: Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
121 | - Time and memory efficient: In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques.
122 | - Exact: Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
123 | - Lightweight: Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
124 | - Multi-GPU supported: Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
125 | - Trainer Framework Integration: Axolotl, LLaMa-Factory, SFTTrainer, Hugging Face Trainer, SWIFT
126 |
127 | ## Installation
128 |
129 | ### Dependencies
130 |
131 | #### CUDA
132 |
133 | - torch >= 2.1.2
134 | - triton >= 2.3.0
135 |
136 | #### ROCm
137 |
138 | - torch >= 2.5.0
Install according to the instruction in Pytorch official webpage.
139 | - triton >= 3.0.0
Install from pypi. (e.g. pip install triton==3.0.0
)
140 |
141 | ### Optional Dependencies
142 |
143 | - transformers >= 4.x
: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
144 |
145 | > Note:
146 | > Our kernels inherit the full spectrum of hardware compatibility offered by Triton.
147 |
148 | To install the stable version:
149 |
150 | bash 151 | $ pip install liger-kernel 152 |
153 |
154 | To install the nightly version:
155 |
156 | bash 157 | $ pip install liger-kernel-nightly 158 |
159 |
160 | To install from source:
161 |
162 | bash 163 | git clone https://github.com/linkedin/Liger-Kernel.git 164 | cd Liger-Kernel 165 | 166 | # Install Default Dependencies 167 | # Setup.py will detect whether you are using AMD or NVIDIA 168 | pip install -e . 169 | 170 | # Setup Development Dependencies 171 | pip install -e ".[dev]" 172 |
173 |
174 |
175 | ## Getting Started
176 |
177 | There are a couple of ways to apply Liger kernels, depending on the level of customization required.
178 |
179 | ### 1. Use AutoLigerKernelForCausalLM
180 |
181 | Using the AutoLigerKernelForCausalLM
is the simplest approach, as you don't have to import a model-specific patching API. If the model type is supported, the modeling code will be automatically patched using the default settings.
182 |
183 | python 184 | from liger_kernel.transformers import AutoLigerKernelForCausalLM 185 | 186 | # This AutoModel wrapper class automatically monkey-patches the 187 | # model with the optimized Liger kernels if the model is supported. 188 | model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model") 189 |
190 |
191 | ### 2. Apply Model-Specific Patching APIs
192 |
193 | Using the patching APIs, you can swap Hugging Face models with optimized Liger Kernels.
194 |
195 | python 196 | import transformers 197 | from liger_kernel.transformers import apply_liger_kernel_to_llama 198 | 199 | # 1a. Adding this line automatically monkey-patches the model with the optimized Liger kernels 200 | apply_liger_kernel_to_llama() 201 | 202 | # 1b. You could alternatively specify exactly which kernels are applied 203 | apply_liger_kernel_to_llama( 204 | rope=True, 205 | swiglu=True, 206 | cross_entropy=True, 207 | fused_linear_cross_entropy=False, 208 | rms_norm=False 209 | ) 210 | 211 | # 2. Instantiate patched model 212 | model = transformers.AutoModelForCausalLM("path/to/llama/model") 213 |
214 |
215 | ### 3. Compose Your Own Model
216 |
217 | You can take individual kernels to compose your models.
218 |
219 | python 220 | from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss 221 | import torch.nn as nn 222 | import torch 223 | 224 | model = nn.Linear(128, 256).cuda() 225 | 226 | # fuses linear + cross entropy layers together and performs chunk-by-chunk computation to reduce memory 227 | loss_fn = LigerFusedLinearCrossEntropyLoss() 228 | 229 | input = torch.randn(4, 128, requires_grad=True, device="cuda") 230 | target = torch.randint(256, (4, ), device="cuda") 231 | 232 | loss = loss_fn(model.weight, input, target) 233 | loss.backward() 234 |
235 |
236 | ## High-level APIs
237 |
238 | ### AutoModel
239 |
240 | | AutoModel Variant | API |
241 | |-----------|---------|
242 | | AutoModelForCausalLM | liger_kernel.transformers.AutoLigerKernelForCausalLM
|
243 |
244 |
245 | ### Patching
246 |
247 | | Model | API | Supported Operations |
248 | |-------------|--------------------------------------------------------------|-------------------------------------------------------------------------|
249 | | LLaMA 2 & 3 | liger_kernel.transformers.apply_liger_kernel_to_llama
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
250 | | LLaMA 3.2-Vision | liger_kernel.transformers.apply_liger_kernel_to_mllama
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
251 | | Mistral | liger_kernel.transformers.apply_liger_kernel_to_mistral
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
252 | | Mixtral | liger_kernel.transformers.apply_liger_kernel_to_mixtral
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
253 | | Gemma1 | liger_kernel.transformers.apply_liger_kernel_to_gemma
| RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
254 | | Gemma2 | liger_kernel.transformers.apply_liger_kernel_to_gemma2
| RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
255 | | Qwen2, Qwen2.5, & QwQ | liger_kernel.transformers.apply_liger_kernel_to_qwen2
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
256 | | Qwen2-VL | liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl
| RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
257 | | Phi3 & Phi3.5 | liger_kernel.transformers.apply_liger_kernel_to_phi3
| RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
258 |
259 |
260 | ## Low-level APIs
261 |
262 | - Fused Linear
kernels combine linear layers with losses, reducing memory usage by up to 80% - ideal for HBM-constrained workloads.
263 | - Other kernels use fusion and in-place techniques for memory and performance optimization.
264 |
265 | ### Model Kernels
266 |
267 | | Kernel | API |
268 | |---------------------------------|-------------------------------------------------------------|
269 | | RMSNorm | liger_kernel.transformers.LigerRMSNorm
|
270 | | LayerNorm | liger_kernel.transformers.LigerLayerNorm
|
271 | | RoPE | liger_kernel.transformers.liger_rotary_pos_emb
|
272 | | SwiGLU | liger_kernel.transformers.LigerSwiGLUMLP
|
273 | | GeGLU | liger_kernel.transformers.LigerGEGLUMLP
|
274 | | CrossEntropy | liger_kernel.transformers.LigerCrossEntropyLoss
|
275 | | Fused Linear CrossEntropy | liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss
|
276 |
277 |
278 | ### Alignment Kernels
279 |
280 | | Kernel | API |
281 | |---------------------------------|-------------------------------------------------------------|
282 | | Fused Linear CPO Loss | liger_kernel.chunked_loss.LigerFusedLinearCPOLoss
|
283 | | Fused Linear DPO Loss | liger_kernel.chunked_loss.LigerFusedLinearDPOLoss
|
284 | | Fused Linear ORPO Loss | liger_kernel.chunked_loss.LigerFusedLinearORPOLoss
|
285 | | Fused Linear SimPO Loss | liger_kernel.chunked_loss.LigerFusedLinearSimPOLoss
|
286 |
287 | ### Distillation Kernels
288 |
289 | | Kernel | API |
290 | |---------------------------------|-------------------------------------------------------------|
291 | | KLDivergence | liger_kernel.transformers.LigerKLDIVLoss
|
292 | | JSD | liger_kernel.transformers.LigerJSD
|
293 | | Fused Linear JSD | liger_kernel.transformers.LigerFusedLinearJSD
|
294 |
295 | ### Experimental Kernels
296 |
297 | | Kernel | API |
298 | |---------------------------------|-------------------------------------------------------------|
299 | | Embedding | liger_kernel.transformers.experimental.LigerEmbedding
|
300 | | Matmul int2xint8 | liger_kernel.transformers.experimental.matmul
|
301 |
302 |
303 | ## Contributing, Acknowledgements, and License
304 |
305 | - Contributing Guidelines
306 | - Acknowledgements
307 | - License Information
308 |
309 | ## Sponsorship and Collaboration
310 |311 | - AMD: Providing AMD GPUs for our AMD CI. 312 | - Intel: Providing Intel GPUs for our Intel CI. 313 | - Modal: Free 3000 credits from GPU MODE IRL for our NVIDIA CI. 314 | - EmbeddedLLM: Making Liger Kernel run fast and stable on AMD. 315 | - HuggingFace: Integrating Liger Kernel into Hugging Face Transformers and TRL. 316 | - Lightning AI: Integrating Liger Kernel into Lightning Thunder. 317 | - Axolotl: Integrating Liger Kernel into Axolotl. 318 | - Llama-Factory: Integrating Liger Kernel into Llama-Factory. 319 | 320 | ## Contact 321 | 322 | - For issues, create a Github ticket in this repository 323 | - For open discussion, join our discord channel 324 | - For formal collaboration, send an email to [email protected] 325 | 326 | ## Cite this work 327 | 328 | Biblatex entry: 329 |
bib 330 | @article{hsu2024ligerkernelefficienttriton, 331 | title={Liger Kernel: Efficient Triton Kernels for LLM Training}, 332 | author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen}, 333 | year={2024}, 334 | eprint={2410.10989}, 335 | archivePrefix={arXiv}, 336 | primaryClass={cs.LG}, 337 | url={https://arxiv.org/abs/2410.10989}, 338 | journal={arXiv preprint arXiv:2410.10989}, 339 | } 340 |
341 |
342 | ## Star History
343 | 346 | 347 | ↑ Back to Top ↑ 348 | 349 |
350 |1 |
2 | ## Acknowledgement
3 |
4 |
5 | ### Design
6 |
7 | - @claire_yishan for the LOGO design
8 | - Wave Snippets for generating the animated code snippets
9 |
10 | ### Code
11 |
12 | We referenced or used the following projects:
13 |
14 |
15 |
16 | | # | Project | Description | Location | License |
17 | |---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
18 | | 1 | Unsloth | calculate_settings
to determine block size and warp; We reuse it for Norm and MLP | Liger Kernel Utils | Apache |
19 | | 2 | Unsloth | We modified and added dW calculation on top of Unsloth implementation | Liger Kernel RMS Norm | Apache |
20 | | 3 | Triton tutorial | We modified on top of triton tutorials | Liger Kernel RMS Norm | MIT |
21 | | 4 | tiny shakespeare dataset | We use tiny shakespeare dataset to conduct convergence test on mini model | Liger Kernel Convergence | N/A |
22 | | 5 | Efficient Cross Entropy | We use the idea of gradient-in-forward and chunking | Liger Kernel Linear Cross Entropy | MIT |
23 | | 6 | Flash attn | We take many optimization ideas from the work, such as tiling and recomputation | | BSD |
24 | | 7 | AutoAWQ | We reference the design of automodel | Liger Kernel Auto Model | MIT |
25 | | 8 | llm.c | We reference the design of end-to-end testing | Liger Kernel Convergence Tests | MIT |
26 |
27 | Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible.
28 |
1 | # Contributing to Liger-Kernel
2 |
3 | Thank you for your interest in contributing to Liger-Kernel! This guide will help you set up your development environment, add a new kernel, run tests, and submit a pull request (PR).
4 |
5 | ## Maintainer
6 |
7 | @ByronHsu(admin) @qingquansong @yundai424 @kvignesh1420 @lancerts @JasonZhu1313 @shimizust
8 |
9 | ## Interested in the ticket?
10 |
11 | Leave #take
in the comment and tag the maintainer.
12 |
13 | ## Setting Up Your Development Environment
14 |
15 | 1. Clone the Repository
16 | sh 17 | git clone https://github.com/linkedin/Liger-Kernel.git 18 | cd Liger-Kernel 19 |
20 | 2. Install Dependencies and Editable Package
21 | 22 | pip install . -e[dev] 23 |
24 | If encounter error no matches found: .[dev]
, please use
25 | 26 | pip install -e .'[dev]' 27 |
28 |
29 | ## Structure
30 |
31 | ### Source Code
32 |
33 | - ops/
: Core Triton operations.
34 | - transformers/
: PyTorch nn.Module
implementations built on Triton operations, compliant with the transformers
API.
35 |
36 | ### Tests
37 |
38 | - transformers/
: Correctness tests for the Triton-based layers.
39 | - convergence/
: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer-by-layer.
40 |
41 | ### Benchmark
42 |
43 | - benchmark/
: Execution time and memory benchmarks compared to Hugging Face layers.
44 |
45 | ## Adding support for a new model
46 | To get familiar with the folder structure, please refer to https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#structure.
47 |
48 | 1. Figure out the kernels that can be monkey-patched
49 | - Check the src/liger_kernel/ops
directory to find the kernels that can be monkey-patched.
50 | - Kernels like Fused Linear Cross Entropy require a custom lce_forward function to allow monkey-patching. For adding kernels requiring a similar approach, ensure that you create the corresponding forward function in the src/liger_kernel/transformers/model
directory.
51 |
52 | 2. Monkey-patch the HuggingFace model
53 | - Add the monkey-patching code in the src/liger_kernel/transformers/monkey_patch.py
file.
54 | - Ensure that the monkey-patching function is added to the __init__.py
file in the src/liger_kernel/transformers/
directory.
55 |
56 | 3. Add Unit Tests
57 | - Create unit tests and convergence tests for the monkey-patched model in the tests directory. Ensure that your tests cover all functionalities of the monkey-patched model.
58 |
59 | ## Adding a New Kernel
60 | To get familiar with the folder structure, please refer to https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#structure.
61 |
62 | 1. Create Your Kernel
63 | Add your kernel implementation in src/liger_kernel/
.
64 |
65 | 2. Add Unit Tests
66 | Create unit tests and convergence tests for your kernel in the tests directory. Ensure that your tests cover all kernel functionalities.
67 |
68 | 3. Add Benchmark Script
69 | Add a benchmarking script under benchmark/scripts
using the naming convention benchmark_{kernel_name}.py
showing the performance difference between the Liger kernel and HuggingFace.
70 |
71 | ## Run tests
72 |
73 | ### Use Makefile to run full tests
74 | 1. Run make test
to ensure correctness.
75 | 2. Run make checkstyle
to ensure code style.
76 | 3. Run make test-convergence
to ensure convergence.
77 |
78 | ### Run pytest on single file
79 | python -m pytest test_sample.py::test_function_name
80 |
81 | ## Run kernel benchmarks
82 | The /benchmark
directory contains benchmarking scripts for the individual kernels, demonstrating differences in speed and memory usage between using Liger and HuggingFace module implementations.
83 |
84 | 1. Run make run-benchmarks
to run all benchmarking scripts and append data to benchmark/data/all_benchmark_data.csv
.
85 | - Existing entries that are the same (based on kernel_name
, kernel_provider
, kernel_operation_mode
, metric_name
, x_name
, x_value
, extra_benchmark_config_str
, and gpu_name
) will not be overwritten.
86 | 2. Run make run-benchmarks OVERWRITE=1
to overwrite any existing entries that have the same configuration.
87 | 3. Run python benchmark/scripts/benchmark_{kernel_name}.py
to run an individual benchmark.
88 | 4. You can use the benchmark/benchmarks_visualizer.py
script to generate visualizations from the CSV, these are then saved to the benchmark/visualizations
directory (note: this directory is not tracked by git).
89 |
90 | ## Submit PR
91 | Fork the repo, copy and paste the successful test logs in the PR and submit the PR followed by the PR template (example PR).
92 |
93 | > As a contributor, you represent that the code you submit is your original work or that of your employer (in which case you represent you have the right to bind your employer). By submitting code, you (and, if applicable, your employer) are licensing the submitted code to LinkedIn and the open source community subject to the BSD 2-Clause license.
94 |
95 | ## Release (maintainer only)
96 |
97 | 1. Bump the version in pyproject.toml to the desired version (for example, 0.2.0
)
98 | 2. Submit a PR and merge
99 | 3. Create a new release based on the current HEAD, tag name using v<version number>
for example v0.2.0
. Alternatively, If you want to create release based on a different commit hash, git tag v0.2.0 <commit hash> && git push origin v0.2.0
, and create release based on this tag
100 | 4. Adding release note: Minimum requirement is to click the Generate Release Notes
button that will automatically generates 1) changes included, 2) new contributors. It's good to add sections on top to highlight the important changes.
101 | 5. New pip uploading will be triggered upon a new release. NOTE: Both pre-release and official release will trigger the workflow to build wheel and publish to pypi, so please be sure that step 1-3 are followed correctly!
102 |
103 | ### Notes on version:
104 | Here we follow the sematic versioning. Denote the version as major.minor.patch
, we increment:
105 | - Major version when there is backward incompatible change
106 | - Minor version when there is new backward-compatible functionality
107 | - Patch version for bug fixes
108 |
1 | This project is licensed under the BSD 2-CLAUSE License (see LICENSE
for details).
2 | It also includes components from projects licensed under:
3 |
4 | - Apache License 2.0 (see LICENSE-APACHE-2.0
for details).
5 | - MIT License (see LICENSE-MIT-AutoAWQ
for details).
6 | - MIT License (see LICENSE-MIT-Efficient Cross Entropy
for details).
7 | - MIT License (see LICENSE-MIT-llmc
for details).
8 | - MIT License (see LICENSE-MIT-triton
for details).
1 | # Liger-Kernel Example with HuggingFace Trainer
2 |
3 | ## How to Run
4 |
5 | ### Locally on a GPU machine
6 | You can run the example locally on a GPU machine. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs.
7 |
8 | bash 9 | pip install -r requirements.txt 10 | sh run_{MODEL}.sh 11 |
12 |
13 | ### Remotely on Modal
14 | If you do not have access to a GPU machine, you can run the example on Modal. Modal is a serverless platform that allows you to run your code on a remote GPU machine. You can sign up for a free account at Modal.
15 |
16 | bash 17 | pip install modal 18 | modal setup # authenticate with Modal 19 | modal run launch_on_modal.py --script "run_qwen2_vl.sh" 20 |
21 |
22 | Notes
23 | 1. This example uses an optional use_liger
flag. If true, it does a 1 line monkey patch to apply liger kernel.
24 | 2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
25 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
26 | * Run huggingface-cli login
and enter your HuggingFace token
27 | 3. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload
in FSDP.
28 |
29 |
30 | ## Benchmark Result
31 |
32 | ### LLaMA
33 | Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
34 |
35 | Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs.
36 |
37 |
38 |
39 |
40 | ### QWEN
41 | Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
42 |
43 | Throughput improves by around 10%, while GPU memory usage drops by 50%.
44 |
45 |
46 |
47 |
48 |
49 | ### GEMMA 7B
50 | Benchmark conditions: Gemma-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
51 |
52 | Throughput improves by around 24%, while GPU memory usage drops by 33%.
53 |
54 |
55 |
56 |
1 | # Liger-Kernel Example with Lightning Trainer
2 |
3 | ## How to Run
4 | bash 5 | pip install -r requirements.txt 6 | 7 | # For single L40 48GB GPU 8 | python training.py --model Qwen/Qwen2-0.5B-Instruct --num_gpu 1 --max_length 1024 9 | 10 | # For 8XA100 40GB 11 | python training.py --model meta-llama/Meta-Llama-3-8B --strategy deepspeed 12 |
13 |
14 | Notes
15 | 1. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
16 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
17 | * Run huggingface-cli login
and enter your HuggingFace token
18 | 2. The default hyperparameters and configurations for gemma works on single L40 48GB GPU and config for llama work on single node with 8xA100 40GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload
in FSDP.
19 |
20 |
21 |
1 | # Liger-Kernel Example with Medusa
2 |
3 | Medusa is a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads. [repo], [paper]
4 |
5 | During training, Medusa requires adding (k) decoding heads to the hidden states right before the regular LM head (h_t). The (k)-th head is used to predict the token in the ((t + k + 1))-th position of the next tokens (the original language model head is used to predict the ((t + 1))-th position).
6 |
7 | The Liger fused CE kernel is highly effective in this scenario, eliminating the need to materialize logits for each head, which usually consumes a large volume of memory due to the extensive vocabulary size (e.g., for LLaMA-3, the vocabulary size is 128k). The introduction of multiple heads can easily lead to OOM (Out of Memory) issues. However, thanks to the efficient Liger fused CE, which calculates the gradient in place and doesn't materialize the logits, we have observed very effective results. This efficiency opens up more opportunities for multi-token prediction research and development.
8 |
9 |
10 | # Instructions to Run the Training Script
11 |
12 | 13 | git clone [email protected]:linkedin/Liger-Kernel.git 14 | cd {PATH_TO_Liger-Kernel}/Liger-Kernel/ 15 | pip install -e . 16 | cd {PATH_TO_Liger-Kernel}/Liger-Kernel/examples/medusa 17 | pip install -r requirements.txt 18 | sh scripts/llama3_8b_medusa.sh 19 |
20 |
21 | Notes
22 | 1. This example uses an optional use_liger
flag. If true, it does a monkey patch to apply liger kernel with medusa heads.
23 | 2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
24 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
25 | * Run huggingface-cli login
and enter your HuggingFace token
26 | 3. The default hyperparameters and configurations work on single node with 8xA100 GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload
in FSDP.
27 | 4. We are using a smaller sample of shared GPT data primarily to benchmark performance. The example requires hyperparameter tuning and dataset selection to work effectively, also ensuring the dataset has the same distribution as the LLaMA pretraining data. Welcome contribution to enhance the example code.
28 |
29 |
30 | # Memory Profiling Result
31 |
32 | > Note:
33 | > 1. Benchmark conditions: LLaMA 3-8B, Batch Size = 6, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
34 |
35 | ## Stage1
36 |
37 | Stage1 refers to Medusa-1 where the backbone model is frozen and only weights of LLM heads are updated.
38 |
39 | 40 | # Modify this flag in llama3_8b_medusa.sh to True enables stage1 41 | --medusa_only_heads True 42 |
43 |
44 | ### num_head = 3
45 |
46 |
47 |
48 |
49 | ### num_head = 5
50 |
51 |
52 |
53 |
54 | ## Stage2
55 |
56 |
57 | # Modify this flag to False in llama3_8b_medusa.sh enables stage2 58 | --medusa_only_heads False 59 |
60 |
61 | Stage2 refers to Medusa-2 where all the model weights are updated incuding backbone model and llm heads.
62 |
63 | ### num_head = 3
64 |
65 |
66 |
67 |
68 | ### num_head = 5
69 |
70 |
71 |
72 |
73 |
1 | # Liger FlexChunkLoss: Alignment and Distillation loss
2 |
3 | Liger FlexChunkLoss offers a versatile interface, delivering up to 80% memory savings and a 10% throughput boost for post-training loss functions, including alignment (DPO, ORPO, CPO) and very soon, distillation. Its flexible design supports custom losses, ensuring efficiency gains across diverse use cases.
4 |
5 | ### User interface
6 |
7 | FlexChunkLoss offers two flexible usage options:
8 |
9 | 1. Via Liger[Custom Loss]Trainer
10 | For example, by simply replacing the HuggingFace ORPOTrainer
with LigerORPOTrainer
in your code, you can leverage our optimized ORPO implementation and immediately benefit from improved performance.
11 |
12 | 2. Using nn.Module
Implementations of Custom Loss Functions
13 | Explore the LigerORPOTrainer implementation to see how the modular design integrates custom loss functions seamlessly.
14 |
15 | ### What's under the hood?
16 |
17 | We employ chunking and fused kernel optimizations to enhance performance. By fusing the final linear layer with loss computation and calculating backward gradients during the forward pass, we significantly reduce the need for storing intermediate activations. All operations are implemented in PyTorch, leveraging torch.compile
to streamline kernel execution without relying on extensive low-level optimizations. Additionally, we minimize torch.compile
recompilations to reduce overhead and ensure consistent performance gains.
18 |
19 | ### Extending to custom loss functions
20 |
21 | We provide two base classes: LigerFusedLinearPreferenceBase
for alignment use cases and LigerFusedLinearDistillationBase
for distillation use cases. These base classes manage chunking, kernel fusions, and Torch compilation.
22 |
23 | To implement a custom loss function, you need to create a subclass that defines the custom preference or distillation loss function, capable of processing a given input chunk. The base class will take care of the optimizations, handling most of the heavy lifting for you.
24 |
25 | For a working example, refer to the ORPO loss implementation.