Task: Fine-tune a base language model to act as a customer support chatbot for a telecom company.
- Updates all the model's parameters (can be billions).
- Requires high computational resources.
- Large model checkpoints (~10–100 GB).
- Higher chance of forgetting prior knowledge.
- Can fully adapt model to a new domain.
- Best performance when base model and target domain are very different.
- High compute and memory requirements.
- Risk of catastrophic forgetting.
- Hard to share or deploy multiple variants.
- Keeps the base model frozen.
- Adds small trainable matrices to selected layers (e.g., attention).
- Trains only a few million parameters.
- LoRA adapters are usually <100MB.
- Extremely efficient.
- Trainable on consumer GPUs.
- Easy to manage and swap different domain-specific adapters.
- Less flexible for tasks far from the base model's knowledge.
- Not suited for tasks needing early layer modification or structural changes.
- Combines quantization (4-bit or 8-bit) of the base model with LoRA adapters.
- Reduces memory even further than LoRA alone.
- Enables full LLM fine-tuning on a single GPU or even a laptop.
- Lowest memory footprint among all methods.
- Supports very large models (e.g., 65B+) on limited hardware.
- Ideal for research and rapid iteration.
- May introduce quantization noise.
- Some performance drop vs full precision, especially for math-heavy tasks.
- Requires additional libraries (e.g.,
bitsandbytes
).
Metric | Full Fine-Tune | LoRA | QLoRA |
---|---|---|---|
Parameters updated | 6.7 billion | ~8 million (r=8) | ~8 million (r=8) |
Time to train | ~48 hours | ~3–4 hours | ~3–4 hours |
GPU needed | 4x A100 | 1x consumer-grade GPU | 1x consumer/laptop GPU |
File size | ~20–40 GB | ~50 MB | ~50 MB + quantized model |
Maintains base model? | ❌ No | ✅ Yes | ✅ Yes |
Easily swappable? | ❌ No | ✅ Yes | ✅ Yes |
Memory efficiency | ❌ Low | ✅ Good | ✅✅ Excellent |
Use Case | Best Method |
---|---|
Maximum performance on very different domain | Full Fine-Tune |
Cost-effective fine-tuning in a similar domain | LoRA |
Training large models on low-end hardware | QLoRA |
Many task variants, modular deployment | LoRA / QLoRA |
Solo devs, startups, fast iteration | QLoRA |
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")
# Assume dataset is ready
tokenized_dataset = ...
training_args = TrainingArguments(
output_dir="./full-ft",
per_device_train_batch_size=2,
num_train_epochs=3,
save_total_limit=2,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset
)
trainer.train()
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")
# Inject LoRA
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir="./lora-ft",
per_device_train_batch_size=2,
num_train_epochs=3,
save_total_limit=2,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset
)
trainer.train()
from transformers import AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
import bitsandbytes as bnb
model = AutoModelForCausalLM.from_pretrained(
"llama-base",
load_in_4bit=True,
device_map="auto",
quantization_config=bnb.QuantizationConfig(load_in_4bit=True)
)
tokenizer = AutoTokenizer.from_pretrained("llama-base")
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir="./qlora-ft",
per_device_train_batch_size=2,
num_train_epochs=3,
save_total_limit=2,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset
)
trainer.train()
LoRA and QLoRA are excellent choices for efficient model adaptation. QLoRA extends LoRA’s efficiency by adding quantization, making it even more accessible for fine-tuning large models on modest hardware. Full fine-tuning remains relevant for maximum flexibility and performance when resources are not a limitation.