Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created April 24, 2025 15:39
Show Gist options
  • Save decagondev/c01d4f8cf9398e0771278c24e1e1cfa9 to your computer and use it in GitHub Desktop.
Save decagondev/c01d4f8cf9398e0771278c24e1e1cfa9 to your computer and use it in GitHub Desktop.

LoRA vs Full Fine-Tuning vs QLoRA

Real-World Scenario

Task: Fine-tune a base language model to act as a customer support chatbot for a telecom company.


Full Fine-Tuning

Description

  • Updates all the model's parameters (can be billions).
  • Requires high computational resources.
  • Large model checkpoints (~10–100 GB).
  • Higher chance of forgetting prior knowledge.

Pros

  • Can fully adapt model to a new domain.
  • Best performance when base model and target domain are very different.

Cons

  • High compute and memory requirements.
  • Risk of catastrophic forgetting.
  • Hard to share or deploy multiple variants.

LoRA (Low-Rank Adaptation)

Description

  • Keeps the base model frozen.
  • Adds small trainable matrices to selected layers (e.g., attention).
  • Trains only a few million parameters.
  • LoRA adapters are usually <100MB.

Pros

  • Extremely efficient.
  • Trainable on consumer GPUs.
  • Easy to manage and swap different domain-specific adapters.

Cons

  • Less flexible for tasks far from the base model's knowledge.
  • Not suited for tasks needing early layer modification or structural changes.

QLoRA (Quantized LoRA)

Description

  • Combines quantization (4-bit or 8-bit) of the base model with LoRA adapters.
  • Reduces memory even further than LoRA alone.
  • Enables full LLM fine-tuning on a single GPU or even a laptop.

Pros

  • Lowest memory footprint among all methods.
  • Supports very large models (e.g., 65B+) on limited hardware.
  • Ideal for research and rapid iteration.

Cons

  • May introduce quantization noise.
  • Some performance drop vs full precision, especially for math-heavy tasks.
  • Requires additional libraries (e.g., bitsandbytes).

Quantitative Comparison

Metric Full Fine-Tune LoRA QLoRA
Parameters updated 6.7 billion ~8 million (r=8) ~8 million (r=8)
Time to train ~48 hours ~3–4 hours ~3–4 hours
GPU needed 4x A100 1x consumer-grade GPU 1x consumer/laptop GPU
File size ~20–40 GB ~50 MB ~50 MB + quantized model
Maintains base model? ❌ No ✅ Yes ✅ Yes
Easily swappable? ❌ No ✅ Yes ✅ Yes
Memory efficiency ❌ Low ✅ Good ✅✅ Excellent

When to Use Each?

Use Case Best Method
Maximum performance on very different domain Full Fine-Tune
Cost-effective fine-tuning in a similar domain LoRA
Training large models on low-end hardware QLoRA
Many task variants, modular deployment LoRA / QLoRA
Solo devs, startups, fast iteration QLoRA

Example Code Snippets

Full Fine-Tuning (HuggingFace)

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")

# Assume dataset is ready
tokenized_dataset = ...

training_args = TrainingArguments(
    output_dir="./full-ft",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()

LoRA Fine-Tuning (Using peft)

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType

model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")

# Inject LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./lora-ft",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()

QLoRA Fine-Tuning (Using bitsandbytes + peft)

from transformers import AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
import bitsandbytes as bnb

model = AutoModelForCausalLM.from_pretrained(
    "llama-base",
    load_in_4bit=True,
    device_map="auto",
    quantization_config=bnb.QuantizationConfig(load_in_4bit=True)
)

tokenizer = AutoTokenizer.from_pretrained("llama-base")

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./qlora-ft",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()

Conclusion

LoRA and QLoRA are excellent choices for efficient model adaptation. QLoRA extends LoRA’s efficiency by adding quantization, making it even more accessible for fine-tuning large models on modest hardware. Full fine-tuning remains relevant for maximum flexibility and performance when resources are not a limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment