Open-weight models in enterprise: one year on
Actualizado: 2026-05-03
A little over a year ago, when many of us started putting open-weight models into real enterprise environments, the conversation was cautious. We had a decent Llama 2, a brilliant-for-size Mistral 7B, and a handful of alternatives that fell behind the GPT-4 of the time. A year on, the situation has changed enough to deserve an honest balance.
Key takeaways
- Parity with closed models is per task, not global: classification, extraction, and well-scoped summarization are hard to tell apart blind; complex reasoning and long code still give closed APIs an edge.
- The real production cost is serving with acceptable latencies under reasonable concurrency; training graphs are misleading.
- Three patterns that have settled: multi-model router, RAG with open embeddings, and LoRA fine-tuning only for closed domains.
- Three recurring failures: underestimating operational cost, always going to the biggest model, anchoring to a specific model without abstractions.
- Clearly pay off for: high sustained volume, regulated data, edge/offline. In other cases, the API remains the rational choice by total operational cost.
The quality jump was real, but uneven
The distance to closed models has clearly shrunk:
- Llama 3 and 3.1 405B: showed an open lab could reach GPT-4 Turbo levels on standard evaluations.
- DeepSeek V3 and R1: competitive performance at notably lower training budgets; R1 with explicit chain-of-thought reasoning.
- Qwen 2.5: most solid for Asian languages and most predictable for code.
- Mistral Large 2: European alternative with flexible commercial licensing.
The important nuance is that parity is per task, not global. For classification, extraction, and well-scoped summarization, the difference between a well-served Llama 3.1 70B and a closed API is hard to perceive blind. For most real enterprise work, open weights are already enough. For the higher tier (complex autonomous agents, high-quality large-context code generation), it still pays to compare.
What they really cost to serve
The real production cost is serving with acceptable latencies under reasonable concurrency. A Llama 3.1 70B quantized to 4 bits fits on one 80 GB H100 and gives good latencies for one user at a time. Scaling to tens of concurrent requests is the hard part.
Systems that have worked: vLLM and TGI on Nvidia, and SGLang when aggressive sequence parallelism was needed. Ollama is excellent for local development but isn’t a serious production tool.
Cost versus closed API depends heavily on volume. For tens of millions of tokens monthly, renting A100 or H100 GPUs and serving with vLLM runs at half the OpenAI or Anthropic cost, with non-trivial operational expenses. For low volumes, the API wins purely from absence of human overhead.
Where they fit in real architectures
Patterns that have settled:
The “router” model: a thin layer deciding which model gets each request based on objective criteria (input length, data sensitivity, budgeted cost, quality required).
RAG with open embeddings: models like BGE-M3, jina-embeddings-v3, and Nomic’s match closed ones and serve in tens of MB of RAM. For many cases, the expensive piece of a RAG is no longer the LLM but the indexing pipeline and chunk quality.
Domain fine-tuning with LoRA: serious fine-tuning only pays when you have a closed domain with distinct vocabulary and thousands of curated examples. Otherwise, prompt engineering with RAG performs better.
Failures worth remembering
- Underestimating operational cost. A self-hosted model needs care: monitoring, software updates, GPU management, capacity reservation, node-failure resilience.
- Always going to the biggest model. The 405B is impressive but for most enterprise cases offers no perceptible improvement over a well-quantized 70B.
- Rigidity against evolution. Models improve every two or three months. Teams that survived a year well built abstractions to swap the model behind and treat it as an interchangeable component.
When they pay off
Open weights clearly pay off in three situations:
- High sustained volume where token cost dominates total cost.
- Regulated or sensitive data where leaving the perimeter is friction or impossible.
- Edge/offline: private mobile apps, industrial deployments, offline scenarios.
In other cases (low volume, non-sensitive data, non-critical latency), APIs remain the rational choice. Not for quality, which is already equivalent, but for total operational cost.
For teams deciding today, investing in an abstraction layer and your own evaluations is worth more than the concrete choice of which model to use this quarter. Models change; the engineering around them, not as much.