Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

cuantización fine-tuning huggingface llm lora peft qlora quantization

LoRA and QLoRA: Efficient Fine-Tuning on a Single Laptop

October 29, 2024 13 min read 65 reads

Table of contents

Key takeaways
The problem they solve
QLoRA: quantization plus adapters
Workflow with PEFT and Hugging Face
Choosing hyperparameters
LoRA/QLoRA vs prompt engineering
Deploying LoRA adapters
Conclusion

Actualizado: 2026-05-03

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) democratised fine-tuning of large language models by solving the central problem that made it prohibitive: GPU memory. The idea behind LoRA is elegant in its simplicity — instead of updating all model parameters during training, low-rank adaptation matrices are added that learn the delta between the base model’s behaviour and the desired behaviour. QLoRA takes that idea further by combining quantization of the base model with LoRA adapters, making fine-tuning of 70 B models viable on consumer hardware.

Key takeaways

LoRA adds low-rank matrices (typically rank 4-64) to attention layers; the base model stays frozen.
QLoRA quantizes the base model to 4 bits (NF4) and keeps only the LoRA adapters in full precision, reducing required memory by 65-75%.
A Llama 3 8B requires 128 GB of VRAM for full fine-tuning; with QLoRA it fits on a 24 GB RTX 3090.
Quality loss vs full fine-tuning is 1-3% on standard benchmarks — acceptable for most domain-specific use cases.
The decision between LoRA/QLoRA and prompt engineering depends on available data volume and domain specificity.

The problem they solve

Traditional fine-tuning of a Llama 3 8B model requires simultaneously storing:

Model parameters: 8B × 4 bytes = 32 GB just for the weights.
Training gradients: another 32 GB.
Adam optimizer state: 64 GB additional (two moments per parameter).
Total: 128 GB of VRAM — equivalent to four 80 GB A100s.

At 70B scale, the cost multiplies by 8.75x and exceeds what any reasonable experimentation budget can buy.

LoRA solves the problem with a theoretical observation: the changes fine-tuning induces in model weights have intrinsically low rank. You do not need to update a dense matrix of millions of parameters; you can approximate that change with the product of two much smaller matrices:

ΔW ≈ A × B
where A ∈ ℝ^(d×r) and B ∈ ℝ^(r×k), with r << d and r << k

If rank r is 8 and the original dimension is 4096, the number of trainable parameters drops from 4096² ≈ 16M to 2 × 4096 × 8 = 65k. A 250x reduction in trainable parameters, implying a proportional reduction in gradients and optimizer state.

QLoRA: quantization plus adapters

QLoRA adds one more step: quantizes the base model to NF4 (Normal Float 4-bit), a quantization format that better preserves the distribution of LLM weights than standard integer quantization. Over that quantized model, LoRA adapters are trained in full BFloat16.

The combination enables:

Loading a 70B model on four 24 GB GPUs (instead of ten 80 GB).
Loading a 7B model on a single 8 GB GPU.
Fine-tuning 13B models on a laptop with a 16 GB RTX 3080.

The quality loss of NF4 vs FP16 is well documented: 1-3% on reasoning and coding benchmarks — acceptable for most domain-specific use cases.

Workflow with PEFT and Hugging Face

The Hugging Face ecosystem makes the workflow surprisingly straightforward. The PEFT (Parameter-Efficient Fine-Tuning) library manages LoRA adapters; transformers loads the base model; trl provides SFTTrainer for supervised fine-tuning:

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Load quantized model (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Configure LoRA adapters
lora_config = LoraConfig(
    r=16,                     # matrix rank
    lora_alpha=32,            # scaling factor
    target_modules=["q_proj", "v_proj"],  # layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 5,242,880 || all params: 8,035,885,056 || trainable%: 0.0653

The result: fewer than 0.07% of parameters are trainable, with a proportional memory reduction in gradients and optimizer state.

Full Transformer architecture with encoder and decoder blocks, multi-head attention layers, and feed-forward layers — the architecture in which LoRA inserts low-rank adaptation matrices during fine-tuning

Choosing hyperparameters

The LoRA hyperparameters that most impact results:

Rank (r): low values (4-8) for simple style or format tasks; medium values (16-32) for domain adaptation; high values (64-128) for deeper behaviour changes. Higher rank means more trainable parameters and greater capacity, but also greater overfitting risk.
lora_alpha: typically 2× the rank. Controls the scaling factor of the adapters.
target_modules: query and value projection layers (q_proj, v_proj) are the minimum; adding key (k_proj) and output (o_proj) projections improves quality at the cost of more parameters.
Learning rate: lower than full fine-tuning, typically in the 1e-4 to 3e-4 range.

For models you will later serve with vLLM in production, ensure the LoRA adapter format is compatible with the vLLM version you use; Multi-LoRA support lets you serve several adapters over the same base model.

LoRA/QLoRA vs prompt engineering

The decision between fine-tuning and prompt engineering depends on three factors:

When fine-tuning pays:

You have more than 500-1,000 high-quality training examples.
The domain has specific vocabulary, format, or reasoning the base model does not know.
The task requires very high format consistency (prompt engineering can give unwanted variability).
Token cost in production is significant; a fine-tuned model needs shorter prompts.

When prompt engineering is sufficient:

Training data is scarce or hard to obtain.
The task is reasonably general and the base model handles it well with instructions.
You need to iterate quickly without training cycles.
Base model hallucination is acceptable for the use case.

The most common mistake is fine-tuning on insufficient data (<200 examples) expecting large improvements. With little data, the model learns the format but loses generalisation. With LoRA/QLoRA there is little technical friction to experiment, but that does not eliminate the requirement for quality data.

Deploying LoRA adapters

Once trained, the LoRA adapter is saved as a file independent of the base model — typically 10-200 MB vs the several GBs of the full model. Deployment options:

Merge: merge_and_unload() merges the adapter weights into the base model, generating a standard model any runtime can serve.
Multi-LoRA with vLLM: serve several adapters over the same base model, switching per request. Very efficient when you have several specialised fine-tunes.
Separate adapter with transformers: load the base model plus adapter dynamically; useful for experimentation.

For observability of the fine-tuned model in production, the specific challenge of LoRA adapters is that tracking which adapter was used in each inference requires explicit instrumentation — most observability tools do not capture it by default.

Conclusion

LoRA and QLoRA eliminated the hardware barrier that made fine-tuning prohibitive for most teams. With QLoRA, a 7B model fits on an 8 GB GPU, and quality loss vs full fine-tuning is marginal in practice. The real bottleneck is no longer hardware or framework — it is quality training data and rigorous evaluation of whether fine-tuning actually improves what matters for your use case.

Was this useful?

[Total: 12 · Average: 4.6]

Post Views: 65

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

LoRA and QLoRA: Efficient Fine-Tuning on a Single Laptop

Key takeaways

The problem they solve

QLoRA: quantization plus adapters

Workflow with PEFT and Hugging Face

Choosing hyperparameters

LoRA/QLoRA vs prompt engineering

Deploying LoRA adapters

Conclusion

Related posts

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step

Claude Code vs Cursor vs GitHub Copilot in 2026: a comparison with measured tasks

MCP (Model Context Protocol) in 2026: the complete guide for engineering teams