LiteLLM: A Proxy to Unify Model Providers

Router de fibra con cables conectados representando orquestación de tráfico entre proveedores

The first integration with an LLM is always easy: one key, one SDK, three lines and a prompt. The second, six months later, is no longer so. A second provider appears because Claude reasons better on long tasks, or because a self-hosted model is needed for data that cannot leave the perimeter, or because someone discovers that Cohere multilingual embeddings cost a fraction of what OpenAI charges for equivalent work. At that point the application code stops being clean. Each SDK has its own client, its own message format, its own streaming semantics, its own errors, its own rules for function calling. The team starts writing adapters, and every new cross-cutting requirement — rate limiting, observability, per-tenant budget, fallback when a provider is down — has to be implemented twice or three times over.

The pattern that solves this is old and familiar in infrastructure: a proxy. Instead of each application talking directly to each provider, they all talk to a single internal service that talks to the outside world on their behalf. LiteLLM is, as of early 2024, the most serious open-source project for doing this in the LLM space. It offers an OpenAI-compatible API over more than a hundred providers, it can be deployed as a library or as an HTTP server, and it comes with most of the things you would eventually end up writing yourself.

Why Proxy at All

The question is not trivial, because any proxy adds latency, another component to maintain, and another failure point. The justification has to be concrete. There are four reasons, and they usually arrive together.

The first is homogeneity. A single OpenAI-compatible client across all applications, pointing at an internal endpoint, replaces half a dozen SDKs. Switching model becomes a configuration field, not a refactor. Migrating an entire app from GPT-4 to Claude 3 Opus is reduced to repointing an alias.

The second is governance. As soon as more than one team uses LLMs, someone asks how much each is spending, and ideally wants to cap it before next month’s invoice surprises finance. A central proxy issues virtual keys per team, per user or per service, with budget and expiry attached. The real provider keys live in exactly one place.

The third is resilience. LLM providers go down, rate-limit, and serve degraded responses more often than one would expect from services at their price point. A proxy can declare fallbacks — if GPT-4 returns 429 or 5xx, retry on Claude 3 Sonnet; if Anthropic is saturated, fall back to the self-hosted Mistral — without the applications noticing. This turns provider incidents into silent degradations rather than product outages.

The fourth is observability. Cost, latency and token metrics by model, tenant and route, emitted from a single point to Prometheus or Langfuse, avoid having to instrument every call in every application. It is also the natural place to insert caching, PII redaction, auditing and compliance.

Library or Server

LiteLLM can be used in two modes, and the choice shapes everything else. In library mode you import litellm.completion inside the application code and enjoy the unified API without deploying anything new. This is reasonable for monoliths, prototypes or one-off scripts, but it loses almost all the cross-cutting benefits: every instance of the app needs the keys, every team does its own rate limiting, every service emits metrics its own way.

In proxy mode you deploy a separate binary — container, pod, systemd unit — and the applications talk to it as if it were OpenAI. This is the default configuration for any serious use. Its cost is an internal network hop of order 5-20 ms, negligible compared with the hundreds or thousands of milliseconds of an actual LLM call. Its benefit is concentrating all cross-cutting logic in one place.

What the Proxy Declares

A typical configuration is a YAML with three blocks. The first, model_list, maps logical names like gpt-4, claude-3-sonnet or mistral-local to concrete provider configurations: the prefix openai/, anthropic/ or ollama/ identifies the backend, the key is read from an environment variable, and api_base can point at an internal Ollama. The second, router_settings, declares routing policy and fallbacks: an ordered list per logical model indicates which others to jump to when the first fails, and a global strategy such as least-busy, lowest-cost or lowest-latency decides the tie-breaker when several candidates qualify. The third, general_settings, sets the master key used by an administrator to mint virtual keys via API, points at a Postgres to persist budgets and usage, and optionally wires a Redis for caching of semantically equivalent responses.

The minimum fragment — the only one worth the space here — captures the three pieces together:

model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-3-sonnet
    litellm_params:
      model: anthropic/claude-3-sonnet-20240229
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - gpt-4: ["claude-3-sonnet"]
  routing_strategy: least-busy

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

The rest of the surface — per-key budgets, Redis caching with TTL, tagging by environment, Langfuse or Helicone integration, Prometheus metrics — is described in the same file with analogous blocks and applied without touching application code.

What Not to Expect

LiteLLM translates between different APIs, and the translation is not always perfect. The most provider-specific features — structured output with complex schemas, OpenAI function calling versus Anthropic tool use, the reasoning modes of certain models — sometimes do not map 1:1. It is worth reading the changelog before trusting a critical flow to a non-trivial translation. Added latency is small but not zero, and for high-volume embedding workloads it can be more noticeable than expected. The proxy itself is yet another piece to maintain, with its own database, upgrades and metrics. And if real usage is a single provider with no plan to change, the complexity does not pay for itself: a local abstraction layer in the backend is enough.

A Pattern That Works

The deployment I have seen stabilise in several teams is always similar. Two replicas of the proxy behind an internal service, a shared Postgres for keys and usage, a Redis for semantic cache, virtual keys per team or service with a monthly budget, fallbacks declared for the two or three critical models, a Prometheus scrape with model, tenant and route labels, and alerts on per-provider error rate. Applications see a single OpenAI-compatible endpoint and send their virtual key in a header; everything else happens inside the proxy.

Conclusion

An LLM proxy is not a revolutionary idea, it is the same indirection layer already placed between applications and databases, between applications and queues, between applications and identity. It earns its place for the same reasons: it isolates decisions that change often, it concentrates governance and observability, and it lets the application ignore the details of the provider. LiteLLM is today the most complete open-source implementation, stable enough for production and flexible enough to absorb the changes that will keep arriving in the model stack over the next quarters. With a single provider and no foreseeable second one, the component is dispensable. From the second model onwards, giving up on hand-written adapters and delegating to a proxy stops being a matter of taste and becomes basic hygiene.

Entradas relacionadas