LoRA y QLoRA: fine-tuning eficiente al alcance de un solo portátil

LoRA (Low-Rank Adaptation) y QLoRA (Quantized LoRA) democratizaron fine-tuning LLMs. En vez de retrain todos parameters (expensive), LoRA añade small adapter matrices (low-rank). QLoRA combines con quantization para fit en GPUs consumer. Fine-tune Llama 3 70B en un A100? Posible. En RTX 3090? Con QLoRA, yes.

El problema

Fine-tune traditional (Llama 3 8B):

Parameters: 8B × 4 bytes = 32GB just model.
Gradients: another 32GB.
Optimizer state (Adam): 64GB+.
Total: 128GB+ VRAM.

Fuera de reach para la mayoría.

LoRA solution

Idea: freeze base model. Add small matrices (A × B) que adapt specific layers:

Rank: low (e.g. 8, 16, 32) — tiny.
Training: only adapter params (~1% del total).
Memory: savings enormes.

Fine-tuned model = base + adapter. Stored separately; swap adapters per task.

QLoRA goes further

QLoRA:

Quantize base model a 4-bit (NF4 format).
Compute gradients en adapter (FP16) sobre quantized base.
Dequantize on-the-fly durante forward pass.

Result: fit 70B model en 48GB (2× A100 40GB), o 13B en 24GB (RTX 4090).

Hardware guidance

Para fine-tuning:

Model	LoRA GPU	QLoRA GPU
Llama 3 8B	1× A100 40GB	1× RTX 4090 24GB
Llama 3 70B	4× A100 80GB	2× A100 80GB
Mistral 7B	1× A100 24GB	1× RTX 4070 16GB
Phi-3 Mini	Consumer GPU	Consumer GPU

QLoRA makes fine-tune accessible.

Setup con PEFT

Hugging Face PEFT (Parameter-Efficient Fine-Tuning):

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,  # QLoRA
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.24% of total params

Then usual training loop con Trainer.

Training time

Ejemplos reales:

Llama 3 8B + 10k examples: ~2-4 hours en A100.
Llama 3 70B QLoRA + 10k examples: ~12-24 hours en 2× A100 80GB.
Mistral 7B + 1k examples: ~30 min en A100.

Significativamente cheaper que full fine-tune.

Data requirements

High-quality examples: 100-1000 típicamente sufficient.
Task-specific: SQL generation, support replies, classification.
Format: prompt-completion pairs, chat format, instruction tuning format.
Diversity: cover edge cases.

Quantity menos que quality. 100 perfect ejemplos > 10,000 mediocres.

Use cases

Where LoRA/QLoRA helps:

Domain adaptation: legal, medical, technical jargon.
Task specialization: SQL gen, classification, extraction.
Style/tone: specific voice.
Language: adapt to specific language.
Compliance: remove unsafe behaviors.

Where prompt engineering suffices:

Simple tasks where in-context examples work.
Model that already does OK.
Limited data (<50 examples).

Vs full fine-tune

Quality:

LoRA: 95-99% of full fine-tune quality en most cases.
QLoRA: 93-98% — slight quality loss por quantization.

Cost:

LoRA: 10-100x cheaper que full.
QLoRA: similar a LoRA + savings quantization.

Para la mayoría, LoRA/QLoRA suffices.

Serving fine-tuned models

Deploy:

Merged model: combine base + adapter → standard model file.
Adapter-only: keep separate, swap per request.

vLLM soporta multi-LoRA:

vllm serve base-model \
  --enable-lora \
  --lora-modules legal-lora=/path/legal \
               medical-lora=/path/medical

Serve multiple fine-tunes same infrastructure.

Popular platforms

Fine-tuning services:

Hugging Face AutoTrain: UI-based.
Together.ai: API fine-tune.
Modal.com: compute-based.
RunPod: GPU rental.
AWS SageMaker: enterprise.
Google Vertex AI: similar.

Self-hosting: axolotl, unsloth, trl.

DPO: preference-based

Direct Preference Optimization — fine-tune con preferences en vez de examples:

Dataset: (prompt, preferred_response, dispreferred_response).
Model learns what to prefer.
Often better que supervised fine-tuning solo.

Combina LoRA + DPO para alignment.

Evaluation

After fine-tune:

Held-out test set: quality metrics.
Human evaluation: 50-100 samples.
Benchmark comparison: vs base model.
Production monitoring: drift detection.

Prevent overfitting common.

Limitations

Catastrophic forgetting: fine-tune pierde capabilities base.
Domain narrowness: overfitted.
Safety regression: alignment puede degrade.
Drift over time: model stale.

Careful evaluation needed.

Cost example

Fine-tune Mistral 7B QLoRA + 1000 examples:

Compute: ~30 min on A100 ($1-2 rental).
Storage: ~500MB adapter file.
Inference: same as base + small adapter overhead.

Total: ~$5-10 para first fine-tune. Iteration cheap.

Conclusión

LoRA y QLoRA democratizaron fine-tuning. Para equipos sin GPUs datacenter, QLoRA hace possible. Para casos donde prompt engineering no suffices, fine-tune específico es now accessible. Prompt engineering first, fine-tune when justified. Workflow: evaluation → identify gaps → fine-tune específico → measure improvement. Tools (PEFT, unsloth, axolotl, TRL) mature. Iterate quickly, cheap. Skill set that engineers deben tener en 2024+.

Síguenos en jacar.es para más sobre fine-tuning, LoRA y LLM optimization.