LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) democratised LLM fine-tuning. Instead of retraining all parameters (expensive), LoRA adds small adapter matrices (low-rank). QLoRA combines with quantisation to fit on consumer GPUs. Fine-tune Llama 3 70B on one A100? Possible. On RTX 3090? With QLoRA, yes.
The Problem
Traditional fine-tune (Llama 3 8B):
- Parameters: 8B × 4 bytes = 32GB just model.
- Gradients: another 32GB.
- Optimiser state (Adam): 64GB+.
- Total: 128GB+ VRAM.
Out of reach for most.
LoRA Solution
Idea: freeze base model. Add small matrices (A × B) adapting specific layers:
- Rank: low (e.g. 8, 16, 32) — tiny.
- Training: only adapter params (~1% of total).
- Memory: huge savings.
Fine-tuned model = base + adapter. Stored separately; swap adapters per task.
QLoRA Goes Further
QLoRA:
- Quantise base model to 4-bit (NF4 format).
- Compute adapter gradients (FP16) over quantised base.
- Dequantise on-the-fly during forward pass.
Result: fit 70B model in 48GB (2× A100 40GB), or 13B in 24GB (RTX 4090).
Hardware Guidance
For fine-tuning:
| Model | LoRA GPU | QLoRA GPU |
|---|---|---|
| Llama 3 8B | 1× A100 40GB | 1× RTX 4090 24GB |
| Llama 3 70B | 4× A100 80GB | 2× A100 80GB |
| Mistral 7B | 1× A100 24GB | 1× RTX 4070 16GB |
| Phi-3 Mini | Consumer GPU | Consumer GPU |
QLoRA makes fine-tune accessible.
Setup with PEFT
Hugging Face PEFT (Parameter-Efficient Fine-Tuning):
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_4bit=True, # QLoRA
device_map="auto"
)
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.24% of total params
Then usual training loop with Trainer.
Training Time
Real examples:
- Llama 3 8B + 10k examples: ~2-4 hours on A100.
- Llama 3 70B QLoRA + 10k examples: ~12-24 hours on 2× A100 80GB.
- Mistral 7B + 1k examples: ~30 min on A100.
Significantly cheaper than full fine-tune.
Data Requirements
- High-quality examples: 100-1000 typically sufficient.
- Task-specific: SQL generation, support replies, classification.
- Format: prompt-completion pairs, chat format, instruction tuning format.
- Diversity: cover edge cases.
Quantity less than quality. 100 perfect examples > 10,000 mediocre.
Use Cases
Where LoRA/QLoRA helps:
- Domain adaptation: legal, medical, technical jargon.
- Task specialisation: SQL gen, classification, extraction.
- Style/tone: specific voice.
- Language: adapt to specific language.
- Compliance: remove unsafe behaviours.
Where prompt engineering suffices:
- Simple tasks where in-context examples work.
- Model that already does OK.
- Limited data (<50 examples).
Vs Full Fine-Tune
Quality:
- LoRA: 95-99% of full fine-tune quality in most cases.
- QLoRA: 93-98% — slight quality loss from quantisation.
Cost:
- LoRA: 10-100x cheaper than full.
- QLoRA: similar to LoRA + quantisation savings.
For most, LoRA/QLoRA suffices.
Serving Fine-Tuned Models
Deploy:
- Merged model: combine base + adapter → standard model file.
- Adapter-only: keep separate, swap per request.
vLLM supports multi-LoRA:
vllm serve base-model \
--enable-lora \
--lora-modules legal-lora=/path/legal \
medical-lora=/path/medical
Serve multiple fine-tunes on same infrastructure.
Popular Platforms
Fine-tuning services:
- Hugging Face AutoTrain: UI-based.
- Together.ai: API fine-tune.
- Modal.com: compute-based.
- RunPod: GPU rental.
- AWS SageMaker: enterprise.
- Google Vertex AI: similar.
Self-hosting: axolotl, unsloth, trl.
DPO: Preference-Based
Direct Preference Optimisation — fine-tune with preferences instead of examples:
- Dataset: (prompt, preferred_response, dispreferred_response).
- Model learns what to prefer.
- Often better than supervised fine-tuning alone.
Combine LoRA + DPO for alignment.
Evaluation
After fine-tune:
- Held-out test set: quality metrics.
- Human evaluation: 50-100 samples.
- Benchmark comparison: vs base model.
- Production monitoring: drift detection.
Preventing overfitting common.
Limitations
- Catastrophic forgetting: fine-tune loses base capabilities.
- Domain narrowness: overfitted.
- Safety regression: alignment can degrade.
- Drift over time: stale model.
Careful evaluation needed.
Cost Example
Fine-tune Mistral 7B QLoRA + 1000 examples:
- Compute: ~30 min on A100 ($1-2 rental).
- Storage: ~500MB adapter file.
- Inference: same as base + small adapter overhead.
Total: ~$5-10 for first fine-tune. Cheap iteration.
Conclusion
LoRA and QLoRA democratised fine-tuning. For teams without datacentre GPUs, QLoRA makes it possible. For cases where prompt engineering doesn’t suffice, specific fine-tune is now accessible. Prompt engineering first, fine-tune when justified. Workflow: evaluation → identify gaps → specific fine-tune → measure improvement. Tools (PEFT, unsloth, axolotl, TRL) mature. Iterate quickly, cheap. Skill set engineers should have in 2024+.
Follow us on jacar.es for more on fine-tuning, LoRA, and LLM optimisation.