Inference routers: choosing a model based on the request

Diagrama de un clúster con balanceo de carga mediante traducción de direcciones de red en Wikimedia Commons, ilustra conceptualmente la misma idea aplicada a enrutadores de inferencia donde un nodo frontal recibe peticiones y las distribuye entre modelos de lenguaje distintos en función de reglas de coste y latencia, manteniendo una interfaz única y ocultando la heterogeneidad del conjunto de proveedores que responde al usuario final

An inference router is the piece that decides, for each request arriving at your language-model application, which specific model to send it to. Throughout 2023 and most of 2024 the dominant architecture was an application pointing to a single model, usually GPT-4 or Claude, and that simplicity felt reassuring. In 2025, however, serious teams realized a single model is not optimal for cost, latency or fit to request type, and inference routers became a standard piece of production deployments. Worth understanding what they do, which patterns work, which caveats matter and when their complexity pays off.

What problem they solve

The starting point is that language-model requests aren’t homogeneous. A simple classification question, a known support conversation, entity extraction from a short document can be perfectly resolved by a cheap and fast model. An open question requiring complex reasoning, a long code analysis, an exhaustive technical write-up benefit from a large model. Sending everything to the big one is expensive; sending everything to the small one is bad on complex cases. The middle solution is routing.

The typical router receives the request, quickly evaluates it, decides which model should handle it, and forwards the request to that model returning the answer to the client. That decision can use simple heuristics, a trained classifier, a small auxiliary model acting as triage, or a combination. The goal is the client notices no difference and the bill drops significantly.

In a typical well-designed use case, a sensible router cuts total token cost by thirty to seventy percent while preserving perceived quality. That’s real money in high-volume productions, and one of the reasons routers went from curiosity to assumed infrastructure.

Decision patterns

The most basic pattern is length-based routing. If the request is short and the expected answer is short, small model; if long, big model. This heuristic captures quite a bit of value with minimal complexity, because simple tasks are usually short and complex ones usually long. Not perfect but a good starting point.

A more sophisticated pattern is task-type routing. The application knows which function is being invoked: summarization, extraction, creative generation, reasoning. Each function has a preconfigured optimal model. This pattern requires code to differentiate request types explicitly, but when the structure allows it, it’s very effective because the decision is deterministic and auditable.

An interesting hybrid pattern uses a small model as classifier. The request arrives, a very cheap small model decides if it’s simple or complex, and based on that decision routes to an executor model. The classifier model can be something like GPT-4o mini or Claude Haiku, costs a fraction of the executor, adds a few hundred milliseconds of latency. If the classifier is right ninety percent of the time, the savings more than compensate.

The most advanced pattern is learned routing. Production logs of requests and results are collected, which were answered well by the cheap model versus which needed the big one is labeled, and a specialized classifier is trained. This pattern cuts the most cost but also demands the most operational effort: retrain periodically, watch drift, maintain data-collection infrastructure.

Providers and options

The 2025 ecosystem offers several options to avoid building a router from scratch. LiteLLM has consolidated as the reference abstraction layer: a local proxy that speaks to most large providers with a uniform interface, allowing switching models without touching application code. Many teams use it as the base on which to build their routing logic.

OpenRouter goes further offering routing as a service: it connects to dozens of providers and lets you pick policies like optimize for cost, for latency or for balance. Useful for teams that want flexibility without operating infrastructure, though it introduces dependency on one more provider.

Portkey and Helicone occupy an adjacent space with a focus more toward observability and governance: request traces, aggregated metrics, budget control. Less focused on pure routing decisions but covering an important space in serious applications.

For teams wanting total control, building on LiteLLM or directly on official SDKs is reasonable. A router’s core logic isn’t complex; the complexity lives in observability details, controlled failure and retry policy.

Common traps

The most frequent mistake is testing the router only with synthetic requests, not with real traffic. Real requests have a different distribution: more diversity, more atypical cases, more unexpected formats. A router tuned against a hundred-example test set can behave very differently with ten thousand real requests, and discovering failures only once they’re in production hurts.

Another mistake is not measuring quality degradation. Routing to the cheap model lowers cost but can drop quality in cases where the heuristic fails. Without a systematic way to measure that degradation, the team might be saving money at the cost of worse user experience without realizing it. Automated evaluations regularly run on production samples are the only reliable way to detect this.

A third mistake is failure asymmetry. When the cheap model fails or returns something inadequate, there must be an escalation flow to the big model. If that flow doesn’t exist, part of the users get a poor answer and never recover. The correct pattern: try cheap model, if the answer meets acceptable criteria use it, otherwise retry with big model and return that. This has latency cost but preserves quality.

A fourth mistake is ignoring conversation context. In a chat, deciding which model handles the current turn should consider history. If the user has asked six complex questions, the seventh probably is too, even if it looks brief. Routers deciding turn by turn without context tend to miss more than they should.

Minimal router example

def route_request(prompt: str, history: list) -> str:
    tokens = estimate_tokens(prompt) + sum(
        estimate_tokens(m) for m in history[-6:]
    )
    has_code = "```" in prompt or any("```" in m for m in history[-3:])
    complexity_keywords = ["analyze", "explain why", "compare", "design"]
    is_complex = any(k in prompt.lower() for k in complexity_keywords)

    if tokens > 2000 or has_code or is_complex:
        return "sonnet-4-5"
    return "haiku-3-5"

This twenty-line router captures useful patterns without complexity: length, presence of code, keywords suggesting complexity. Not optimal but in real applications I’ve seen it cuts cost by around forty percent, a return that justifies the effort of writing and maintaining it. More sophisticated variants are only worth it when marginal savings justify the work.

Minimal observability

A blind router is dangerous. For each request you must log which model was chosen, by what rule, what latency it took, what cost it incurred and what outcome it produced. That telemetry later allows analyzing whether heuristics were correct, detecting previously invisible patterns and tuning the router with data. Without telemetry the router can be making bad decisions for months without anyone noticing.

Also expose aggregate metrics to the team: distribution of models used, average cost per request, proportion of escalations to the big model, latency distribution. These metrics detect early drifts, like when a change in request types makes the cheap model insufficient and the escalation proportion rises. A simple alert on that proportion catches such drift before it becomes a perceived problem.

When it pays off

An inference router makes no sense if your application uses a single model for everything and volume is low. Added complexity isn’t paid off by savings. It makes sense when volume is reasonable, requests are heterogeneous in complexity, and the team can invest some time in measuring and tuning. In that combination savings are consistent and degradation risk is kept under control with proper observability.

My reading

Inference routers are one of those pieces that in 2023 looked like over-engineering and in 2025 are assumed practice. Model technology has created a space with much price and capability variation, and treating that space as homogeneous leaves too much money on the table. Well-designed, they’re a massive efficiency lever; poorly designed, a silent source of degradation.

The operational recommendation is to start simple, with clear heuristics and observability from day one. Iterate with real data, resist premature sophistication, and only move to learned approaches when heuristics have been exhausted. A fifty-line router with good telemetry almost always beats a trained classifier without metrics or review, because the first adapts and the second rusts. That discipline is what separates real savings from apparent savings.

Entradas relacionadas