Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Metodologías

autoalojado deepseek ia llama mistral modelos pesos abiertos qwen

Open-weight models in enterprise: one year on

February 26, 2025 10 min read 60 reads

Table of contents

Key takeaways
The quality jump was real, but uneven
What they really cost to serve
Where they fit in real architectures
Failures worth remembering
When they pay off

Actualizado: 2026-05-03

A little over a year ago, when many of us started putting open-weight models into real enterprise environments, the conversation was cautious. We had a decent Llama 2, a brilliant-for-size Mistral 7B, and a handful of alternatives that fell behind the GPT-4 of the time. A year on, the situation has changed enough to deserve an honest balance.

Key takeaways

Parity with closed models is per task, not global: classification, extraction, and well-scoped summarization are hard to tell apart blind; complex reasoning and long code still give closed APIs an edge.
The real production cost is serving with acceptable latencies under reasonable concurrency; training graphs are misleading.
Three patterns that have settled: multi-model router, RAG with open embeddings, and LoRA fine-tuning only for closed domains.
Three recurring failures: underestimating operational cost, always going to the biggest model, anchoring to a specific model without abstractions.
Clearly pay off for: high sustained volume, regulated data, edge/offline. In other cases, the API remains the rational choice by total operational cost.

The quality jump was real, but uneven

The distance to closed models has clearly shrunk:

Llama 3 and 3.1 405B: showed an open lab could reach GPT-4 Turbo levels on standard evaluations.
DeepSeek V3 and R1: competitive performance at notably lower training budgets; R1 with explicit chain-of-thought reasoning.
Qwen 2.5: most solid for Asian languages and most predictable for code.
Mistral Large 2: European alternative with flexible commercial licensing.

The important nuance is that parity is per task, not global. For classification, extraction, and well-scoped summarization, the difference between a well-served Llama 3.1 70B and a closed API is hard to perceive blind. For most real enterprise work, open weights are already enough. For the higher tier (complex autonomous agents, high-quality large-context code generation), it still pays to compare.

What they really cost to serve

The real production cost is serving with acceptable latencies under reasonable concurrency. A Llama 3.1 70B quantized to 4 bits fits on one 80 GB H100 and gives good latencies for one user at a time. Scaling to tens of concurrent requests is the hard part.

Systems that have worked: vLLM and TGI on Nvidia, and SGLang when aggressive sequence parallelism was needed. Ollama is excellent for local development but isn’t a serious production tool.

Cost versus closed API depends heavily on volume. For tens of millions of tokens monthly, renting A100 or H100 GPUs and serving with vLLM runs at half the OpenAI or Anthropic cost, with non-trivial operational expenses. For low volumes, the API wins purely from absence of human overhead.

Where they fit in real architectures

Patterns that have settled:

The “router” model: a thin layer deciding which model gets each request based on objective criteria (input length, data sensitivity, budgeted cost, quality required).

RAG with open embeddings: models like BGE-M3, jina-embeddings-v3, and Nomic’s match closed ones and serve in tens of MB of RAM. For many cases, the expensive piece of a RAG is no longer the LLM but the indexing pipeline and chunk quality.

Domain fine-tuning with LoRA: serious fine-tuning only pays when you have a closed domain with distinct vocabulary and thousands of curated examples. Otherwise, prompt engineering with RAG performs better.

Failures worth remembering

Underestimating operational cost. A self-hosted model needs care: monitoring, software updates, GPU management, capacity reservation, node-failure resilience.
Always going to the biggest model. The 405B is impressive but for most enterprise cases offers no perceptible improvement over a well-quantized 70B.
Rigidity against evolution. Models improve every two or three months. Teams that survived a year well built abstractions to swap the model behind and treat it as an interchangeable component.

When they pay off

Open weights clearly pay off in three situations:

High sustained volume where token cost dominates total cost.
Regulated or sensitive data where leaving the perimeter is friction or impossible.
Edge/offline: private mobile apps, industrial deployments, offline scenarios.

In other cases (low volume, non-sensitive data, non-critical latency), APIs remain the rational choice. Not for quality, which is already equivalent, but for total operational cost.

For teams deciding today, investing in an abstraction layer and your own evaluations is worth more than the concrete choice of which model to use this quarter. Models change; the engineering around them, not as much.

Was this useful?

[Total: 15 · Average: 4.3]

Post Views: 60

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Open-weight models in enterprise: one year on

Key takeaways

The quality jump was real, but uneven

What they really cost to serve

Where they fit in real architectures

Failures worth remembering

When they pay off

Related posts

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step

Claude Code vs Cursor vs GitHub Copilot in 2026: a comparison with measured tasks

MCP (Model Context Protocol) in 2026: the complete guide for engineering teams