LoRA (Low-Rank Adaptation) y QLoRA (Quantized LoRA) democratizaron fine-tuning LLMs. En vez de retrain todos parameters (expensive), LoRA añade small adapter matrices (low-rank). QLoRA combines con quantization para fit en GPUs consumer. Fine-tune Llama 3 70B en un A100? Posible. En RTX 3090? Con QLoRA, yes.
El problema
Fine-tune traditional (Llama 3 8B):
- Parameters: 8B × 4 bytes = 32GB just model.
- Gradients: another 32GB.
- Optimizer state (Adam): 64GB+.
- Total: 128GB+ VRAM.
Fuera de reach para la mayoría.
LoRA solution
Idea: freeze base model. Add small matrices (A × B) que adapt specific layers:
- Rank: low (e.g. 8, 16, 32) — tiny.
- Training: only adapter params (~1% del total).
- Memory: savings enormes.
Fine-tuned model = base + adapter. Stored separately; swap adapters per task.
QLoRA goes further
QLoRA:
- Quantize base model a 4-bit (NF4 format).
- Compute gradients en adapter (FP16) sobre quantized base.
- Dequantize on-the-fly durante forward pass.
Result: fit 70B model en 48GB (2× A100 40GB), o 13B en 24GB (RTX 4090).
Hardware guidance
Para fine-tuning:
| Model | LoRA GPU | QLoRA GPU |
|---|---|---|
| Llama 3 8B | 1× A100 40GB | 1× RTX 4090 24GB |
| Llama 3 70B | 4× A100 80GB | 2× A100 80GB |
| Mistral 7B | 1× A100 24GB | 1× RTX 4070 16GB |
| Phi-3 Mini | Consumer GPU | Consumer GPU |
QLoRA makes fine-tune accessible.
Setup con PEFT
Hugging Face PEFT (Parameter-Efficient Fine-Tuning):
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_4bit=True, # QLoRA
device_map="auto"
)
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.24% of total params
Then usual training loop con Trainer.
Training time
Ejemplos reales:
- Llama 3 8B + 10k examples: ~2-4 hours en A100.
- Llama 3 70B QLoRA + 10k examples: ~12-24 hours en 2× A100 80GB.
- Mistral 7B + 1k examples: ~30 min en A100.
Significativamente cheaper que full fine-tune.
Data requirements
- High-quality examples: 100-1000 típicamente sufficient.
- Task-specific: SQL generation, support replies, classification.
- Format: prompt-completion pairs, chat format, instruction tuning format.
- Diversity: cover edge cases.
Quantity menos que quality. 100 perfect ejemplos > 10,000 mediocres.
Use cases
Where LoRA/QLoRA helps:
- Domain adaptation: legal, medical, technical jargon.
- Task specialization: SQL gen, classification, extraction.
- Style/tone: specific voice.
- Language: adapt to specific language.
- Compliance: remove unsafe behaviors.
Where prompt engineering suffices:
- Simple tasks where in-context examples work.
- Model that already does OK.
- Limited data (<50 examples).
Vs full fine-tune
Quality:
- LoRA: 95-99% of full fine-tune quality en most cases.
- QLoRA: 93-98% — slight quality loss por quantization.
Cost:
- LoRA: 10-100x cheaper que full.
- QLoRA: similar a LoRA + savings quantization.
Para la mayoría, LoRA/QLoRA suffices.
Serving fine-tuned models
Deploy:
- Merged model: combine base + adapter → standard model file.
- Adapter-only: keep separate, swap per request.
vLLM soporta multi-LoRA:
vllm serve base-model \
--enable-lora \
--lora-modules legal-lora=/path/legal \
medical-lora=/path/medical
Serve multiple fine-tunes same infrastructure.
Popular platforms
Fine-tuning services:
- Hugging Face AutoTrain: UI-based.
- Together.ai: API fine-tune.
- Modal.com: compute-based.
- RunPod: GPU rental.
- AWS SageMaker: enterprise.
- Google Vertex AI: similar.
Self-hosting: axolotl, unsloth, trl.
DPO: preference-based
Direct Preference Optimization — fine-tune con preferences en vez de examples:
- Dataset: (prompt, preferred_response, dispreferred_response).
- Model learns what to prefer.
- Often better que supervised fine-tuning solo.
Combina LoRA + DPO para alignment.
Evaluation
After fine-tune:
- Held-out test set: quality metrics.
- Human evaluation: 50-100 samples.
- Benchmark comparison: vs base model.
- Production monitoring: drift detection.
Prevent overfitting common.
Limitations
- Catastrophic forgetting: fine-tune pierde capabilities base.
- Domain narrowness: overfitted.
- Safety regression: alignment puede degrade.
- Drift over time: model stale.
Careful evaluation needed.
Cost example
Fine-tune Mistral 7B QLoRA + 1000 examples:
- Compute: ~30 min on A100 ($1-2 rental).
- Storage: ~500MB adapter file.
- Inference: same as base + small adapter overhead.
Total: ~$5-10 para first fine-tune. Iteration cheap.
Conclusión
LoRA y QLoRA democratizaron fine-tuning. Para equipos sin GPUs datacenter, QLoRA hace possible. Para casos donde prompt engineering no suffices, fine-tune específico es now accessible. Prompt engineering first, fine-tune when justified. Workflow: evaluation → identify gaps → fine-tune específico → measure improvement. Tools (PEFT, unsloth, axolotl, TRL) mature. Iterate quickly, cheap. Skill set that engineers deben tener en 2024+.
Síguenos en jacar.es para más sobre fine-tuning, LoRA y LLM optimization.