In 2023 the question “should we fine-tune our own LLM?” reaches architecture tables almost monthly. The short answer, almost always, is not yet. The long answer is that there are legitimate cases, costs have come down — but remain considerable — and alternatives like RAG or prompt engineering solve 80% of needs without training’s operational overhead.
The Three Customisation Levels
To frame the problem, there are three layers of LLM customisation, from lowest to highest cost:
- Prompt engineering: tune instructions, few-shot examples, chain-of-thought. Marginal cost, iteration in minutes. Covers the vast majority of well-defined tasks.
- Retrieval-Augmented Generation (RAG): retrieve relevant chunks from a knowledge base and pass them into the model’s context. Medium cost (embeddings + vector store), iteration in days.
- Fine-tuning: modify model weights with your own examples. High cost (data, GPUs, validation), iteration in weeks.
Jumping directly to fine-tuning is the most common mistake. Most teams that try could have gotten equivalent or better results with a well-designed RAG.
When Fine-Tuning Genuinely Makes Sense
Three cases where fine-tuning justifies its cost:
- Very specific style/voice. If you need the model to respond with an exact brand personality — idioms, grammar, a tone you can’t capture in a long system prompt — fine-tuning internalises it.
- Very structured output format. Models fine-tuned to always return a specific JSON, or to follow a proprietary markup schema, are more reliable than prompted ones — the format becomes “sewn into” the model.
- Cost and latency reduction with small models. A 7B-parameter model fine-tuned on your domain can match or beat GPT-3.5 for that specific task, at 10-20% of the cost per token and better latency.
Outside these cases, RAG usually wins.
LoRA and QLoRA: Accessible Fine-Tuning
The big 2022-2023 shift is that fine-tuning went from “you need 8 A100s” to “you can do it on an RTX 4090”. The key technique is LoRA (Low-Rank Adaptation): instead of training all weights, you add low-rank matrices over the frozen model. The result is practically identical to full fine-tuning at 1% of the GPU cost.
QLoRA, published in May 2023, combines LoRA with 4-bit quantisation. It lets you fine-tune 65-billion-parameter models on a single 48 GB VRAM GPU. Six months ago this was unthinkable.
Libraries like PEFT from Hugging Face and axolotl wrap these methods with declarative config. A LoRA pipeline over Llama 2 7B fits in a 30-line YAML.
What Actually Costs
The real cost of fine-tuning isn’t GPUs — it’s everything else:
- Preparing the dataset. Between 500 and 5000 quality examples (prompt + ideal response) require substantial manual investment. Poorly designed examples poison the model with biases and failures.
- Iteration and evaluation. A bad fine-tune can look good on the happy path and fail catastrophically on edge cases. You need automated evals before and after.
- Production operation. Your own model means managing inference, updates, drift monitoring. This isn’t just “calling an API” anymore.
Realistic budget for a first serious fine-tune: 2-3 engineering weeks + 1-5k USD in GPU + a basic MLOps pipeline for evaluation.
Alternatives Before Deciding
Before fine-tuning, exhaust these options:
- RAG over your domain. With pgvector or Pinecone plus good reranking, you cover “the model needs to know company-specific data” without training anything.
- Longer prompts with careful examples. GPT-4 with 16 careful few-shot examples often beats a fine-tuned 7B model if examples are good.
- Function calling with structured response. If you’re after structure, as we saw in prompt engineering as a mature discipline, function calling solves most cases without training.
- Existing specialised models. For common tasks (code, medical, legal) the community already has fine-tuned models: CodeLlama, Med-PaLM, and others.
Also see vector DB comparative as the base of the RAG pipeline that almost always solves the specific-knowledge case.
Conclusion
Fine-tuning has become technically democratised thanks to LoRA and QLoRA, but operationally it’s still a serious investment. For the vast majority of teams in 2023, starting with prompt engineering + RAG is the right path; fine-tuning is reserved for problems where the other two have clearly hit a ceiling.
Follow us on jacar.es for more on MLOps, production LLMs, and AI strategy.