liger

├── .github └── pull_request_template.md ├── README.md ├── docs ├── Acknowledgement.md ├── CONTRIBUTING.md └── License.md ├── examples ├── huggingface │ └── README.md ├── lightning │ └── README.md └── medusa │ └── README.md └── src └── liger_kernel └── chunked_loss └── README.md

/.github/pull_request_template.md:

1 | ## Summary 2 | 3 | 4 | 8 | 9 | ## Testing Done 10 | 11 | 12 | 18 | 19 | - Hardware Type: 20 | - [ ] run make test to ensure correctness 21 | - [ ] run make checkstyle to ensure code style 22 | - [ ] run make test-convergence to ensure convergence 23 |

/README.md:

1 | 2 | 3 | # Liger Kernel: Efficient Triton Kernels for LLM Training 4 | 5 | 6 |

7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 19 | 24 | 29 | 34 | 39 | 51 | 52 |

Stable		Nightly		Discord	Build
15 \| 16 \| 17 \| 18 \|	20 \| 21 \| 22 \| 23 \|	25 \| 26 \| 27 \| 28 \|	30 \| 31 \| 32 \| 33 \|	35 \| 36 \| 37 \| 38 \|	40 \| 41 \| 42 \| 43 \| 44 \| 45 \| 46 \| 47 \| 48 \| 49 \| 50 \|

53 | 54 | 55 | 56 |

61 |

Latest News 🔥

62 |
63 | - [2024/12/11] We release v0.5.0: 80% more memory efficient post training losses (DPO, ORPO, CPO, etc)! 64 | - [2024/12/5] We release LinkedIn Engineering Blog - Liger-Kernel: Empowering an open source ecosystem of Triton Kernels for Efficient LLM Training 65 | - [2024/11/6] We release v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision! 66 | - [2024/10/21] We have released the tech report of Liger Kernel on Arxiv: https://arxiv.org/pdf/2410.10989 67 | - [2024/9/6] We release v0.2.1 (X post). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks! 68 | - [2024/8/31] CUDA MODE talk, Liger-Kernel: Real-world Triton kernel for LLM Training, Slides 69 | - [2024/8/23] Official release: check out our X post 70 | 71 |

72 | 73 | 74 | Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. We have implemented Hugging Face Compatible RMSNorm, RoPE, SwiGLU, CrossEntropy, FusedLinearCrossEntropy, and more to come. The kernel works out of the box with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed. We welcome contributions from the community to gather the best kernels for LLM training. 75 | 76 | We've also added optimized Post-Training kernels that deliver up to 80% memory savings for alignment and distillation tasks. We support losses like DPO, CPO, ORPO, SimPO, JSD, and many more. Check out how we optimize the memory. 77 | 78 | ## Supercharge Your Model with Liger Kernel 79 | 80 |

81 | 82 | With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies. 83 | 84 | 85 | | Speed Up | Memory Reduction | 86 | |--------------------------|-------------------------| 87 | |

| 88 | 89 | > Note: 90 | > - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s. 91 | > - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K. 92 | 93 | ## Optimize Post Training with Liger Kernel 94 | 95 |

96 | 97 |

98 | 99 | We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules. 100 | 101 |

python 102 | from liger_kernel.chunked_loss import LigerFusedLinearDPOLoss 103 | orpo_loss = LigerFusedLinearORPOLoss() 104 | y = orpo_loss(lm_head.weight, x, target) 105 |

106 | 107 | 108 | ## Examples 109 | 110 | | Use Case | Description | 111 | |------------------------------------------------|---------------------------------------------------------------------------------------------------| 112 | | Hugging Face Trainer | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP | 113 | | Lightning Trainer | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 | 114 | | Medusa Multi-head LLM (Retraining Phase) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | 115 | | Vision-Language Model SFT | Finetune Qwen2-VL on image-text data using 4 A100s with FSDP | 116 | | Liger ORPO Trainer | Align Llama 3.2 using Liger ORPO Trainer with FSDP with 50% memory reduction | 117 | 118 | ## Key Features 119 | 120 | - Ease of use: Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules. 121 | - Time and memory efficient: In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques. 122 | - Exact: Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy. 123 | - Lightweight: Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches! 124 | - Multi-GPU supported: Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.). 125 | - Trainer Framework Integration: Axolotl, LLaMa-Factory, SFTTrainer, Hugging Face Trainer, SWIFT 126 | 127 | ## Installation 128 | 129 | ### Dependencies 130 | 131 | #### CUDA 132 | 133 | - torch >= 2.1.2 134 | - triton >= 2.3.0 135 | 136 | #### ROCm 137 | 138 | - torch >= 2.5.0 Install according to the instruction in Pytorch official webpage. 139 | - triton >= 3.0.0 Install from pypi. (e.g. pip install triton==3.0.0) 140 | 141 | ### Optional Dependencies 142 | 143 | - transformers >= 4.x: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers. 144 | 145 | > Note: 146 | > Our kernels inherit the full spectrum of hardware compatibility offered by Triton. 147 | 148 | To install the stable version: 149 | 150 | bash 151 | $ pip install liger-kernel 152 | 153 | 154 | To install the nightly version: 155 | 156 | bash 157 | $ pip install liger-kernel-nightly 158 | 159 | 160 | To install from source: 161 | 162 |

bash 163 | git clone https://github.com/linkedin/Liger-Kernel.git 164 | cd Liger-Kernel 165 |  166 | # Install Default Dependencies 167 | # Setup.py will detect whether you are using AMD or NVIDIA 168 | pip install -e . 169 |  170 | # Setup Development Dependencies 171 | pip install -e ".[dev]" 172 |

173 | 174 | 175 | ## Getting Started 176 | 177 | There are a couple of ways to apply Liger kernels, depending on the level of customization required. 178 | 179 | ### 1. Use AutoLigerKernelForCausalLM 180 | 181 | Using the AutoLigerKernelForCausalLM is the simplest approach, as you don't have to import a model-specific patching API. If the model type is supported, the modeling code will be automatically patched using the default settings. 182 | 183 |

python 184 | from liger_kernel.transformers import AutoLigerKernelForCausalLM 185 |  186 | # This AutoModel wrapper class automatically monkey-patches the 187 | # model with the optimized Liger kernels if the model is supported. 188 | model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model") 189 |

190 | 191 | ### 2. Apply Model-Specific Patching APIs 192 | 193 | Using the patching APIs, you can swap Hugging Face models with optimized Liger Kernels. 194 | 195 |

python 196 | import transformers 197 | from liger_kernel.transformers import apply_liger_kernel_to_llama 198 |  199 | # 1a. Adding this line automatically monkey-patches the model with the optimized Liger kernels 200 | apply_liger_kernel_to_llama() 201 |  202 | # 1b. You could alternatively specify exactly which kernels are applied 203 | apply_liger_kernel_to_llama( 204 |   rope=True, 205 |   swiglu=True, 206 |   cross_entropy=True, 207 |   fused_linear_cross_entropy=False, 208 |   rms_norm=False 209 | ) 210 |  211 | # 2. Instantiate patched model 212 | model = transformers.AutoModelForCausalLM("path/to/llama/model") 213 |

214 | 215 | ### 3. Compose Your Own Model 216 | 217 | You can take individual kernels to compose your models. 218 | 219 |

python 220 | from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss 221 | import torch.nn as nn 222 | import torch 223 |  224 | model = nn.Linear(128, 256).cuda() 225 |  226 | # fuses linear + cross entropy layers together and performs chunk-by-chunk computation to reduce memory 227 | loss_fn = LigerFusedLinearCrossEntropyLoss() 228 |  229 | input = torch.randn(4, 128, requires_grad=True, device="cuda") 230 | target = torch.randint(256, (4, ), device="cuda") 231 |  232 | loss = loss_fn(model.weight, input, target) 233 | loss.backward() 234 |

235 | 236 | ## High-level APIs 237 | 238 | ### AutoModel 239 | 240 | | AutoModel Variant | API | 241 | |-----------|---------| 242 | | AutoModelForCausalLM | liger_kernel.transformers.AutoLigerKernelForCausalLM | 243 | 244 | 245 | ### Patching 246 | 247 | | Model | API | Supported Operations | 248 | |-------------|--------------------------------------------------------------|-------------------------------------------------------------------------| 249 | | LLaMA 2 & 3 | liger_kernel.transformers.apply_liger_kernel_to_llama | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 250 | | LLaMA 3.2-Vision | liger_kernel.transformers.apply_liger_kernel_to_mllama | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 251 | | Mistral | liger_kernel.transformers.apply_liger_kernel_to_mistral | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 252 | | Mixtral | liger_kernel.transformers.apply_liger_kernel_to_mixtral | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 253 | | Gemma1 | liger_kernel.transformers.apply_liger_kernel_to_gemma | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 254 | | Gemma2 | liger_kernel.transformers.apply_liger_kernel_to_gemma2 | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 255 | | Qwen2, Qwen2.5, & QwQ | liger_kernel.transformers.apply_liger_kernel_to_qwen2 | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 256 | | Qwen2-VL | liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl | RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 257 | | Phi3 & Phi3.5 | liger_kernel.transformers.apply_liger_kernel_to_phi3 | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy | 258 | 259 | 260 | ## Low-level APIs 261 | 262 | - Fused Linear kernels combine linear layers with losses, reducing memory usage by up to 80% - ideal for HBM-constrained workloads. 263 | - Other kernels use fusion and in-place techniques for memory and performance optimization. 264 | 265 | ### Model Kernels 266 | 267 | | Kernel | API | 268 | |---------------------------------|-------------------------------------------------------------| 269 | | RMSNorm | liger_kernel.transformers.LigerRMSNorm | 270 | | LayerNorm | liger_kernel.transformers.LigerLayerNorm | 271 | | RoPE | liger_kernel.transformers.liger_rotary_pos_emb | 272 | | SwiGLU | liger_kernel.transformers.LigerSwiGLUMLP | 273 | | GeGLU | liger_kernel.transformers.LigerGEGLUMLP | 274 | | CrossEntropy | liger_kernel.transformers.LigerCrossEntropyLoss | 275 | | Fused Linear CrossEntropy | liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss| 276 | 277 | 278 | ### Alignment Kernels 279 | 280 | | Kernel | API | 281 | |---------------------------------|-------------------------------------------------------------| 282 | | Fused Linear CPO Loss | liger_kernel.chunked_loss.LigerFusedLinearCPOLoss | 283 | | Fused Linear DPO Loss | liger_kernel.chunked_loss.LigerFusedLinearDPOLoss | 284 | | Fused Linear ORPO Loss | liger_kernel.chunked_loss.LigerFusedLinearORPOLoss | 285 | | Fused Linear SimPO Loss | liger_kernel.chunked_loss.LigerFusedLinearSimPOLoss | 286 | 287 | ### Distillation Kernels 288 | 289 | | Kernel | API | 290 | |---------------------------------|-------------------------------------------------------------| 291 | | KLDivergence | liger_kernel.transformers.LigerKLDIVLoss | 292 | | JSD | liger_kernel.transformers.LigerJSD | 293 | | Fused Linear JSD | liger_kernel.transformers.LigerFusedLinearJSD | 294 | 295 | ### Experimental Kernels 296 | 297 | | Kernel | API | 298 | |---------------------------------|-------------------------------------------------------------| 299 | | Embedding | liger_kernel.transformers.experimental.LigerEmbedding | 300 | | Matmul int2xint8 | liger_kernel.transformers.experimental.matmul | 301 | 302 | 303 | ## Contributing, Acknowledgements, and License 304 | 305 | - Contributing Guidelines 306 | - Acknowledgements 307 | - License Information 308 | 309 | ## Sponsorship and Collaboration 310 |
311 | - AMD: Providing AMD GPUs for our AMD CI. 312 | - Intel: Providing Intel GPUs for our Intel CI. 313 | - Modal: Free 3000 credits from GPU MODE IRL for our NVIDIA CI. 314 | - EmbeddedLLM: Making Liger Kernel run fast and stable on AMD. 315 | - HuggingFace: Integrating Liger Kernel into Hugging Face Transformers and TRL. 316 | - Lightning AI: Integrating Liger Kernel into Lightning Thunder. 317 | - Axolotl: Integrating Liger Kernel into Axolotl. 318 | - Llama-Factory: Integrating Liger Kernel into Llama-Factory. 319 | 320 | ## Contact 321 | 322 | - For issues, create a Github ticket in this repository 323 | - For open discussion, join our discord channel 324 | - For formal collaboration, send an email to [email protected] 325 | 326 | ## Cite this work 327 | 328 | Biblatex entry: 329 |

bib 330 | @article{hsu2024ligerkernelefficienttriton, 331 |       title={Liger Kernel: Efficient Triton Kernels for LLM Training}, 332 |       author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen}, 333 |       year={2024}, 334 |       eprint={2410.10989}, 335 |       archivePrefix={arXiv}, 336 |       primaryClass={cs.LG}, 337 |       url={https://arxiv.org/abs/2410.10989}, 338 |       journal={arXiv preprint arXiv:2410.10989}, 339 | } 340 |

341 | 342 | ## Star History 343 |

344 | 345 |

346 | 347 | ↑ Back to Top ↑ 348 | 349 |

350 |

/docs/Acknowledgement.md:

1 | 2 | ## Acknowledgement 3 | 4 | 5 | ### Design 6 | 7 | - @claire_yishan for the LOGO design 8 | - Wave Snippets for generating the animated code snippets 9 | 10 | ### Code 11 | 12 | We referenced or used the following projects: 13 | 14 | 15 | 16 | | # | Project | Description | Location | License | 17 | |---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------| 18 | | 1 | Unsloth | calculate_settings to determine block size and warp; We reuse it for Norm and MLP | Liger Kernel Utils | Apache | 19 | | 2 | Unsloth | We modified and added dW calculation on top of Unsloth implementation | Liger Kernel RMS Norm | Apache | 20 | | 3 | Triton tutorial | We modified on top of triton tutorials | Liger Kernel RMS Norm | MIT | 21 | | 4 | tiny shakespeare dataset | We use tiny shakespeare dataset to conduct convergence test on mini model | Liger Kernel Convergence | N/A | 22 | | 5 | Efficient Cross Entropy | We use the idea of gradient-in-forward and chunking | Liger Kernel Linear Cross Entropy | MIT | 23 | | 6 | Flash attn | We take many optimization ideas from the work, such as tiling and recomputation | | BSD | 24 | | 7 | AutoAWQ | We reference the design of automodel | Liger Kernel Auto Model | MIT | 25 | | 8 | llm.c | We reference the design of end-to-end testing | Liger Kernel Convergence Tests | MIT | 26 | 27 | Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible. 28 |

/docs/CONTRIBUTING.md:

1 | # Contributing to Liger-Kernel 2 | 3 | Thank you for your interest in contributing to Liger-Kernel! This guide will help you set up your development environment, add a new kernel, run tests, and submit a pull request (PR). 4 | 5 | ## Maintainer 6 | 7 | @ByronHsu(admin) @qingquansong @yundai424 @kvignesh1420 @lancerts @JasonZhu1313 @shimizust 8 | 9 | ## Interested in the ticket? 10 | 11 | Leave #take in the comment and tag the maintainer. 12 | 13 | ## Setting Up Your Development Environment 14 | 15 | 1. Clone the Repository 16 | sh 17 | git clone https://github.com/linkedin/Liger-Kernel.git 18 | cd Liger-Kernel 19 | 20 | 2. Install Dependencies and Editable Package 21 | 22 | pip install . -e[dev] 23 | 24 | If encounter error no matches found: .[dev], please use 25 | 26 | pip install -e .'[dev]' 27 | 28 | 29 | ## Structure 30 | 31 | ### Source Code 32 | 33 | - ops/: Core Triton operations. 34 | - transformers/: PyTorch nn.Module implementations built on Triton operations, compliant with the transformers API. 35 | 36 | ### Tests 37 | 38 | - transformers/: Correctness tests for the Triton-based layers. 39 | - convergence/: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer-by-layer. 40 | 41 | ### Benchmark 42 | 43 | - benchmark/: Execution time and memory benchmarks compared to Hugging Face layers. 44 | 45 | ## Adding support for a new model 46 | To get familiar with the folder structure, please refer to https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#structure. 47 | 48 | 1. Figure out the kernels that can be monkey-patched 49 | - Check the src/liger_kernel/ops directory to find the kernels that can be monkey-patched. 50 | - Kernels like Fused Linear Cross Entropy require a custom lce_forward function to allow monkey-patching. For adding kernels requiring a similar approach, ensure that you create the corresponding forward function in the src/liger_kernel/transformers/model directory. 51 | 52 | 2. Monkey-patch the HuggingFace model 53 | - Add the monkey-patching code in the src/liger_kernel/transformers/monkey_patch.py file. 54 | - Ensure that the monkey-patching function is added to the __init__.py file in the src/liger_kernel/transformers/ directory. 55 | 56 | 3. Add Unit Tests 57 | - Create unit tests and convergence tests for the monkey-patched model in the tests directory. Ensure that your tests cover all functionalities of the monkey-patched model. 58 | 59 | ## Adding a New Kernel 60 | To get familiar with the folder structure, please refer to https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#structure. 61 | 62 | 1. Create Your Kernel 63 | Add your kernel implementation in src/liger_kernel/. 64 | 65 | 2. Add Unit Tests 66 | Create unit tests and convergence tests for your kernel in the tests directory. Ensure that your tests cover all kernel functionalities. 67 | 68 | 3. Add Benchmark Script 69 | Add a benchmarking script under benchmark/scripts using the naming convention benchmark_{kernel_name}.py showing the performance difference between the Liger kernel and HuggingFace. 70 | 71 | ## Run tests 72 | 73 | ### Use Makefile to run full tests 74 | 1. Run make test to ensure correctness. 75 | 2. Run make checkstyle to ensure code style. 76 | 3. Run make test-convergence to ensure convergence. 77 | 78 | ### Run pytest on single file 79 | python -m pytest test_sample.py::test_function_name 80 | 81 | ## Run kernel benchmarks 82 | The /benchmark directory contains benchmarking scripts for the individual kernels, demonstrating differences in speed and memory usage between using Liger and HuggingFace module implementations. 83 | 84 | 1. Run make run-benchmarks to run all benchmarking scripts and append data to benchmark/data/all_benchmark_data.csv. 85 | - Existing entries that are the same (based on kernel_name, kernel_provider, kernel_operation_mode, metric_name, x_name, x_value, extra_benchmark_config_str, and gpu_name) will not be overwritten. 86 | 2. Run make run-benchmarks OVERWRITE=1 to overwrite any existing entries that have the same configuration. 87 | 3. Run python benchmark/scripts/benchmark_{kernel_name}.py to run an individual benchmark. 88 | 4. You can use the benchmark/benchmarks_visualizer.py script to generate visualizations from the CSV, these are then saved to the benchmark/visualizations directory (note: this directory is not tracked by git). 89 | 90 | ## Submit PR 91 | Fork the repo, copy and paste the successful test logs in the PR and submit the PR followed by the PR template (example PR). 92 | 93 | > As a contributor, you represent that the code you submit is your original work or that of your employer (in which case you represent you have the right to bind your employer). By submitting code, you (and, if applicable, your employer) are licensing the submitted code to LinkedIn and the open source community subject to the BSD 2-Clause license. 94 | 95 | ## Release (maintainer only) 96 | 97 | 1. Bump the version in pyproject.toml to the desired version (for example, 0.2.0) 98 | 2. Submit a PR and merge 99 | 3. Create a new release based on the current HEAD, tag name using v<version number> for example v0.2.0. Alternatively, If you want to create release based on a different commit hash, git tag v0.2.0 <commit hash> && git push origin v0.2.0, and create release based on this tag 100 | 4. Adding release note: Minimum requirement is to click the Generate Release Notes button that will automatically generates 1) changes included, 2) new contributors. It's good to add sections on top to highlight the important changes. 101 | 5. New pip uploading will be triggered upon a new release. NOTE: Both pre-release and official release will trigger the workflow to build wheel and publish to pypi, so please be sure that step 1-3 are followed correctly! 102 | 103 | ### Notes on version: 104 | Here we follow the sematic versioning. Denote the version as major.minor.patch, we increment: 105 | - Major version when there is backward incompatible change 106 | - Minor version when there is new backward-compatible functionality 107 | - Patch version for bug fixes 108 |

/docs/License.md:

1 | This project is licensed under the BSD 2-CLAUSE License (see LICENSE for details). 2 | It also includes components from projects licensed under: 3 | 4 | - Apache License 2.0 (see LICENSE-APACHE-2.0 for details). 5 | - MIT License (see LICENSE-MIT-AutoAWQ for details). 6 | - MIT License (see LICENSE-MIT-Efficient Cross Entropy for details). 7 | - MIT License (see LICENSE-MIT-llmc for details). 8 | - MIT License (see LICENSE-MIT-triton for details).

/examples/huggingface/README.md:

1 | # Liger-Kernel Example with HuggingFace Trainer 2 | 3 | ## How to Run 4 | 5 | ### Locally on a GPU machine 6 | You can run the example locally on a GPU machine. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. 7 | 8 | bash 9 | pip install -r requirements.txt 10 | sh run_{MODEL}.sh 11 | 12 | 13 | ### Remotely on Modal 14 | If you do not have access to a GPU machine, you can run the example on Modal. Modal is a serverless platform that allows you to run your code on a remote GPU machine. You can sign up for a free account at Modal. 15 | 16 | bash 17 | pip install modal 18 | modal setup # authenticate with Modal 19 | modal run launch_on_modal.py --script "run_qwen2_vl.sh" 20 | 21 | 22 | Notes 23 | 1. This example uses an optional use_liger flag. If true, it does a 1 line monkey patch to apply liger kernel. 24 | 2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings: 25 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B 26 | * Run huggingface-cli login and enter your HuggingFace token 27 | 3. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload in FSDP. 28 | 29 | 30 | ## Benchmark Result 31 | 32 | ### LLaMA 33 | Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s. 34 | 35 | Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs. 36 | 37 | 38 | 39 | 40 | ### QWEN 41 | Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s. 42 | 43 | Throughput improves by around 10%, while GPU memory usage drops by 50%. 44 | 45 | 46 | 47 | 48 | 49 | ### GEMMA 7B 50 | Benchmark conditions: Gemma-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s. 51 | 52 | Throughput improves by around 24%, while GPU memory usage drops by 33%. 53 | 54 | 55 | 56 |

/examples/lightning/README.md:

1 | # Liger-Kernel Example with Lightning Trainer 2 | 3 | ## How to Run 4 | bash 5 | pip install -r requirements.txt 6 | 7 | # For single L40 48GB GPU 8 | python training.py --model Qwen/Qwen2-0.5B-Instruct --num_gpu 1 --max_length 1024 9 | 10 | # For 8XA100 40GB 11 | python training.py --model meta-llama/Meta-Llama-3-8B --strategy deepspeed 12 | 13 | 14 | Notes 15 | 1. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings: 16 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B 17 | * Run huggingface-cli login and enter your HuggingFace token 18 | 2. The default hyperparameters and configurations for gemma works on single L40 48GB GPU and config for llama work on single node with 8xA100 40GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload in FSDP. 19 | 20 | 21 |

/examples/medusa/README.md:

1 | # Liger-Kernel Example with Medusa 2 | 3 | Medusa is a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads. [repo], [paper] 4 | 5 | During training, Medusa requires adding (k) decoding heads to the hidden states right before the regular LM head (h_t). The (k)-th head is used to predict the token in the ((t + k + 1))-th position of the next tokens (the original language model head is used to predict the ((t + 1))-th position). 6 | 7 | The Liger fused CE kernel is highly effective in this scenario, eliminating the need to materialize logits for each head, which usually consumes a large volume of memory due to the extensive vocabulary size (e.g., for LLaMA-3, the vocabulary size is 128k). The introduction of multiple heads can easily lead to OOM (Out of Memory) issues. However, thanks to the efficient Liger fused CE, which calculates the gradient in place and doesn't materialize the logits, we have observed very effective results. This efficiency opens up more opportunities for multi-token prediction research and development. 8 | 9 | 10 | # Instructions to Run the Training Script 11 | 12 | 13 | git clone [email protected]:linkedin/Liger-Kernel.git 14 | cd {PATH_TO_Liger-Kernel}/Liger-Kernel/ 15 | pip install -e . 16 | cd {PATH_TO_Liger-Kernel}/Liger-Kernel/examples/medusa 17 | pip install -r requirements.txt 18 | sh scripts/llama3_8b_medusa.sh 19 | 20 | 21 | Notes 22 | 1. This example uses an optional use_liger flag. If true, it does a monkey patch to apply liger kernel with medusa heads. 23 | 2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings: 24 | * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B 25 | * Run huggingface-cli login and enter your HuggingFace token 26 | 3. The default hyperparameters and configurations work on single node with 8xA100 GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable CPUOffload in FSDP. 27 | 4. We are using a smaller sample of shared GPT data primarily to benchmark performance. The example requires hyperparameter tuning and dataset selection to work effectively, also ensuring the dataset has the same distribution as the LLaMA pretraining data. Welcome contribution to enhance the example code. 28 | 29 | 30 | # Memory Profiling Result 31 | 32 | > Note:
33 | > 1. Benchmark conditions: LLaMA 3-8B, Batch Size = 6, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s. 34 | 35 | ## Stage1 36 | 37 | Stage1 refers to Medusa-1 where the backbone model is frozen and only weights of LLM heads are updated. 38 | 39 | 40 | # Modify this flag in llama3_8b_medusa.sh to True enables stage1 41 | --medusa_only_heads True 42 | 43 | 44 | ### num_head = 3 45 | 46 | 47 | 48 | 49 | ### num_head = 5 50 | 51 | 52 | 53 | 54 | ## Stage2 55 | 56 | 57 | # Modify this flag to False in llama3_8b_medusa.sh enables stage2 58 | --medusa_only_heads False 59 | 60 | 61 | Stage2 refers to Medusa-2 where all the model weights are updated incuding backbone model and llm heads. 62 | 63 | ### num_head = 3 64 | 65 | 66 | 67 | 68 | ### num_head = 5 69 | 70 | 71 | 72 | 73 |

/src/liger_kernel/chunked_loss/README.md:

1 | # Liger FlexChunkLoss: Alignment and Distillation loss 2 | 3 | Liger FlexChunkLoss offers a versatile interface, delivering up to 80% memory savings and a 10% throughput boost for post-training loss functions, including alignment (DPO, ORPO, CPO) and very soon, distillation. Its flexible design supports custom losses, ensuring efficiency gains across diverse use cases. 4 | 5 | ### User interface 6 | 7 | FlexChunkLoss offers two flexible usage options:
8 | 9 | 1. Via Liger[Custom Loss]Trainer
10 | For example, by simply replacing the HuggingFace ORPOTrainer with LigerORPOTrainer in your code, you can leverage our optimized ORPO implementation and immediately benefit from improved performance.
11 | 12 | 2. Using nn.Module Implementations of Custom Loss Functions
13 | Explore the LigerORPOTrainer implementation to see how the modular design integrates custom loss functions seamlessly.
14 | 15 | ### What's under the hood? 16 | 17 | We employ chunking and fused kernel optimizations to enhance performance. By fusing the final linear layer with loss computation and calculating backward gradients during the forward pass, we significantly reduce the need for storing intermediate activations. All operations are implemented in PyTorch, leveraging torch.compile to streamline kernel execution without relying on extensive low-level optimizations. Additionally, we minimize torch.compile recompilations to reduce overhead and ensure consistent performance gains. 18 | 19 | ### Extending to custom loss functions 20 | 21 | We provide two base classes: LigerFusedLinearPreferenceBase for alignment use cases and LigerFusedLinearDistillationBase for distillation use cases. These base classes manage chunking, kernel fusions, and Torch compilation. 22 | 23 | To implement a custom loss function, you need to create a subclass that defines the custom preference or distillation loss function, capable of processing a given input chunk. The base class will take care of the optimizations, handling most of the heavy lifting for you. 24 | 25 | For a working example, refer to the ORPO loss implementation.

kashif/docs.md