LoRA y QLoRA: fine-tuning eficiente al alcance de un solo portátil

Placa de circuito con conexiones eléctricas multicolor representando arquitectura adaptativa

LoRA (Low-Rank Adaptation) y QLoRA (Quantized LoRA) democratizaron fine-tuning LLMs. En vez de retrain todos parameters (expensive), LoRA añade small adapter matrices (low-rank). QLoRA combines con quantization para fit en GPUs consumer. Fine-tune Llama 3 70B en un A100? Posible. En RTX 3090? Con QLoRA, yes.

El problema

Fine-tune traditional (Llama 3 8B):

  • Parameters: 8B × 4 bytes = 32GB just model.
  • Gradients: another 32GB.
  • Optimizer state (Adam): 64GB+.
  • Total: 128GB+ VRAM.

Fuera de reach para la mayoría.

LoRA solution

Idea: freeze base model. Add small matrices (A × B) que adapt specific layers:

  • Rank: low (e.g. 8, 16, 32) — tiny.
  • Training: only adapter params (~1% del total).
  • Memory: savings enormes.

Fine-tuned model = base + adapter. Stored separately; swap adapters per task.

QLoRA goes further

QLoRA:

  1. Quantize base model a 4-bit (NF4 format).
  2. Compute gradients en adapter (FP16) sobre quantized base.
  3. Dequantize on-the-fly durante forward pass.

Result: fit 70B model en 48GB (2× A100 40GB), o 13B en 24GB (RTX 4090).

Hardware guidance

Para fine-tuning:

Model LoRA GPU QLoRA GPU
Llama 3 8B 1× A100 40GB 1× RTX 4090 24GB
Llama 3 70B 4× A100 80GB 2× A100 80GB
Mistral 7B 1× A100 24GB 1× RTX 4070 16GB
Phi-3 Mini Consumer GPU Consumer GPU

QLoRA makes fine-tune accessible.

Setup con PEFT

Hugging Face PEFT (Parameter-Efficient Fine-Tuning):

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,  # QLoRA
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.24% of total params

Then usual training loop con Trainer.

Training time

Ejemplos reales:

  • Llama 3 8B + 10k examples: ~2-4 hours en A100.
  • Llama 3 70B QLoRA + 10k examples: ~12-24 hours en 2× A100 80GB.
  • Mistral 7B + 1k examples: ~30 min en A100.

Significativamente cheaper que full fine-tune.

Data requirements

  • High-quality examples: 100-1000 típicamente sufficient.
  • Task-specific: SQL generation, support replies, classification.
  • Format: prompt-completion pairs, chat format, instruction tuning format.
  • Diversity: cover edge cases.

Quantity menos que quality. 100 perfect ejemplos > 10,000 mediocres.

Use cases

Where LoRA/QLoRA helps:

  • Domain adaptation: legal, medical, technical jargon.
  • Task specialization: SQL gen, classification, extraction.
  • Style/tone: specific voice.
  • Language: adapt to specific language.
  • Compliance: remove unsafe behaviors.

Where prompt engineering suffices:

  • Simple tasks where in-context examples work.
  • Model that already does OK.
  • Limited data (<50 examples).

Vs full fine-tune

Quality:

  • LoRA: 95-99% of full fine-tune quality en most cases.
  • QLoRA: 93-98% — slight quality loss por quantization.

Cost:

  • LoRA: 10-100x cheaper que full.
  • QLoRA: similar a LoRA + savings quantization.

Para la mayoría, LoRA/QLoRA suffices.

Serving fine-tuned models

Deploy:

  • Merged model: combine base + adapter → standard model file.
  • Adapter-only: keep separate, swap per request.

vLLM soporta multi-LoRA:

vllm serve base-model \
  --enable-lora \
  --lora-modules legal-lora=/path/legal \
               medical-lora=/path/medical

Serve multiple fine-tunes same infrastructure.

Fine-tuning services:

  • Hugging Face AutoTrain: UI-based.
  • Together.ai: API fine-tune.
  • Modal.com: compute-based.
  • RunPod: GPU rental.
  • AWS SageMaker: enterprise.
  • Google Vertex AI: similar.

Self-hosting: axolotl, unsloth, trl.

DPO: preference-based

Direct Preference Optimization — fine-tune con preferences en vez de examples:

  • Dataset: (prompt, preferred_response, dispreferred_response).
  • Model learns what to prefer.
  • Often better que supervised fine-tuning solo.

Combina LoRA + DPO para alignment.

Evaluation

After fine-tune:

  • Held-out test set: quality metrics.
  • Human evaluation: 50-100 samples.
  • Benchmark comparison: vs base model.
  • Production monitoring: drift detection.

Prevent overfitting common.

Limitations

  • Catastrophic forgetting: fine-tune pierde capabilities base.
  • Domain narrowness: overfitted.
  • Safety regression: alignment puede degrade.
  • Drift over time: model stale.

Careful evaluation needed.

Cost example

Fine-tune Mistral 7B QLoRA + 1000 examples:

  • Compute: ~30 min on A100 ($1-2 rental).
  • Storage: ~500MB adapter file.
  • Inference: same as base + small adapter overhead.

Total: ~$5-10 para first fine-tune. Iteration cheap.

Conclusión

LoRA y QLoRA democratizaron fine-tuning. Para equipos sin GPUs datacenter, QLoRA hace possible. Para casos donde prompt engineering no suffices, fine-tune específico es now accessible. Prompt engineering first, fine-tune when justified. Workflow: evaluation → identify gaps → fine-tune específico → measure improvement. Tools (PEFT, unsloth, axolotl, TRL) mature. Iterate quickly, cheap. Skill set that engineers deben tener en 2024+.

Síguenos en jacar.es para más sobre fine-tuning, LoRA y LLM optimization.

Entradas relacionadas