LoRA and QLoRA: Efficient Fine-Tuning on a Single Laptop

Placa de circuito con conexiones eléctricas multicolor representando arquitectura adaptativa

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) democratised LLM fine-tuning. Instead of retraining all parameters (expensive), LoRA adds small adapter matrices (low-rank). QLoRA combines with quantisation to fit on consumer GPUs. Fine-tune Llama 3 70B on one A100? Possible. On RTX 3090? With QLoRA, yes.

The Problem

Traditional fine-tune (Llama 3 8B):

  • Parameters: 8B × 4 bytes = 32GB just model.
  • Gradients: another 32GB.
  • Optimiser state (Adam): 64GB+.
  • Total: 128GB+ VRAM.

Out of reach for most.

LoRA Solution

Idea: freeze base model. Add small matrices (A × B) adapting specific layers:

  • Rank: low (e.g. 8, 16, 32) — tiny.
  • Training: only adapter params (~1% of total).
  • Memory: huge savings.

Fine-tuned model = base + adapter. Stored separately; swap adapters per task.

QLoRA Goes Further

QLoRA:

  1. Quantise base model to 4-bit (NF4 format).
  2. Compute adapter gradients (FP16) over quantised base.
  3. Dequantise on-the-fly during forward pass.

Result: fit 70B model in 48GB (2× A100 40GB), or 13B in 24GB (RTX 4090).

Hardware Guidance

For fine-tuning:

Model LoRA GPU QLoRA GPU
Llama 3 8B 1× A100 40GB 1× RTX 4090 24GB
Llama 3 70B 4× A100 80GB 2× A100 80GB
Mistral 7B 1× A100 24GB 1× RTX 4070 16GB
Phi-3 Mini Consumer GPU Consumer GPU

QLoRA makes fine-tune accessible.

Setup with PEFT

Hugging Face PEFT (Parameter-Efficient Fine-Tuning):

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,  # QLoRA
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.24% of total params

Then usual training loop with Trainer.

Training Time

Real examples:

  • Llama 3 8B + 10k examples: ~2-4 hours on A100.
  • Llama 3 70B QLoRA + 10k examples: ~12-24 hours on 2× A100 80GB.
  • Mistral 7B + 1k examples: ~30 min on A100.

Significantly cheaper than full fine-tune.

Data Requirements

  • High-quality examples: 100-1000 typically sufficient.
  • Task-specific: SQL generation, support replies, classification.
  • Format: prompt-completion pairs, chat format, instruction tuning format.
  • Diversity: cover edge cases.

Quantity less than quality. 100 perfect examples > 10,000 mediocre.

Use Cases

Where LoRA/QLoRA helps:

  • Domain adaptation: legal, medical, technical jargon.
  • Task specialisation: SQL gen, classification, extraction.
  • Style/tone: specific voice.
  • Language: adapt to specific language.
  • Compliance: remove unsafe behaviours.

Where prompt engineering suffices:

  • Simple tasks where in-context examples work.
  • Model that already does OK.
  • Limited data (<50 examples).

Vs Full Fine-Tune

Quality:

  • LoRA: 95-99% of full fine-tune quality in most cases.
  • QLoRA: 93-98% — slight quality loss from quantisation.

Cost:

  • LoRA: 10-100x cheaper than full.
  • QLoRA: similar to LoRA + quantisation savings.

For most, LoRA/QLoRA suffices.

Serving Fine-Tuned Models

Deploy:

  • Merged model: combine base + adapter → standard model file.
  • Adapter-only: keep separate, swap per request.

vLLM supports multi-LoRA:

vllm serve base-model \
  --enable-lora \
  --lora-modules legal-lora=/path/legal \
               medical-lora=/path/medical

Serve multiple fine-tunes on same infrastructure.

Fine-tuning services:

  • Hugging Face AutoTrain: UI-based.
  • Together.ai: API fine-tune.
  • Modal.com: compute-based.
  • RunPod: GPU rental.
  • AWS SageMaker: enterprise.
  • Google Vertex AI: similar.

Self-hosting: axolotl, unsloth, trl.

DPO: Preference-Based

Direct Preference Optimisation — fine-tune with preferences instead of examples:

  • Dataset: (prompt, preferred_response, dispreferred_response).
  • Model learns what to prefer.
  • Often better than supervised fine-tuning alone.

Combine LoRA + DPO for alignment.

Evaluation

After fine-tune:

  • Held-out test set: quality metrics.
  • Human evaluation: 50-100 samples.
  • Benchmark comparison: vs base model.
  • Production monitoring: drift detection.

Preventing overfitting common.

Limitations

  • Catastrophic forgetting: fine-tune loses base capabilities.
  • Domain narrowness: overfitted.
  • Safety regression: alignment can degrade.
  • Drift over time: stale model.

Careful evaluation needed.

Cost Example

Fine-tune Mistral 7B QLoRA + 1000 examples:

  • Compute: ~30 min on A100 ($1-2 rental).
  • Storage: ~500MB adapter file.
  • Inference: same as base + small adapter overhead.

Total: ~$5-10 for first fine-tune. Cheap iteration.

Conclusion

LoRA and QLoRA democratised fine-tuning. For teams without datacentre GPUs, QLoRA makes it possible. For cases where prompt engineering doesn’t suffice, specific fine-tune is now accessible. Prompt engineering first, fine-tune when justified. Workflow: evaluation → identify gaps → specific fine-tune → measure improvement. Tools (PEFT, unsloth, axolotl, TRL) mature. Iterate quickly, cheap. Skill set engineers should have in 2024+.

Follow us on jacar.es for more on fine-tuning, LoRA, and LLM optimisation.

Entradas relacionadas