Constrained Decoding for Structured LLM Outputs

Sistema de tuberías industriales con flujos controlados representando decodificación restringida

Asking an LLM to return valid JSON can fail: the model invents closed braces, omits commas, adds comments. Constrained decoding solves this by changing how tokens are generated: at each step, only tokens compatible with the desired grammar are allowed. Result: mathematical guarantee that output meets format.

Outlines, Guidance and jsonformer are the main libraries. This article covers how they work, when they beat OpenAI’s json_mode, and how to integrate them.

The Problem

Prompt: “Respond with JSON: {name, age}”.

Model may:

  • Forget a comma.
  • Add ```json at start.
  • Add a “here you go:” before.
  • Omit closing brace.

After many careful prompts and few 100%-reliable successes, you seek real robustness.

How Constrained Decoding Works

At each generation step:

  1. LLM produces probability distribution over vocabulary (~50k tokens).
  2. Mask based on grammar: invalid tokens get probability 0.
  3. Sample or argmax only over valid tokens.

Result: output respects grammar perfectly.

Applied with JSON Schema, regex, or context-free grammars (CFG).

Python library working with many models (HF Transformers, llama.cpp, vLLM):

from outlines import models, generate

model = models.transformers("meta-llama/Llama-3-8B-Instruct")

# JSON with Pydantic
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

generator = generate.json(model, User)
user = generator("Extract user: Ana is 25 years old")
print(user)  # User(name='Ana', age=25)

Supports:

  • Pydantic models for typed structure.
  • Direct JSON Schema.
  • Regex for specific formats.
  • CFG grammars for custom DSLs.

Guidance: More General

Microsoft’s Guidance allows more complex templates:

import guidance
from guidance import models, gen, select

llama = models.LlamaCpp("path/to/model.gguf")

lm = llama + "Favorite colour is " + select(['red', 'blue', 'green']) + "."

Allows mixing fixed text with constrained-generated regions, ideal for complex prompts.

jsonformer: Simple and Focused

Only JSON, but very simple:

from jsonformer import Jsonformer

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
    }
}

jsonformer = Jsonformer(model, tokenizer, schema, prompt)
result = jsonformer()

Less flexible but easier to start.

vs OpenAI json_mode

OpenAI added response_format={"type": "json_object"} and later Structured Outputs with JSON Schema. Comparison:

Aspect Local Outlines OpenAI Structured Outputs
Grammar guarantee 100% 100% (schema)
Models Open (Llama, Mistral) GPT-4o+
Cost Local (GPU) API pricing
Privacy Local OpenAI retention
Latency Variable GPU API-dependent

For SaaS services, OpenAI is OK. For self-hosted, Outlines/Guidance.

Regex Mode for Specific Formats

Outlines regex:

from outlines import generate

phone_gen = generate.regex(
    model,
    r"[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}"
)
phone_gen("Call me at ")  # "555-123-4567"

Useful for:

  • Dates in specific format.
  • Serial numbers.
  • Codes (SKU, references).
  • Domain identifiers.

CFG Grammars

For custom structured languages. Example: limited SQL, your DSL queries.

from outlines import generate

grammar = """
?start: expression
?expression: NUMBER | expression OP expression | "(" expression ")"
OP: "+" | "-" | "*" | "/"
%import common.NUMBER
"""
calc_gen = generate.cfg(model, grammar)

Useful when you need the LLM to produce output processable by another system.

Trade-offs

Advantages:

  • Absolute format guarantee.
  • Less application-side post-processing.
  • Reduces format hallucination.
  • Usable with small models (sometimes better than prompt in large).

Disadvantages:

  • Compute overhead: every token needs masking. 10-30% slower.
  • Integration: adding to existing stack requires work.
  • Doesn’t help semantic quality: JSON will be valid but content may be bad.

When It’s Worth It

Cases where constrained decoding clearly wins:

  • Function calling / tool use: guarantee valid argument JSON.
  • Massive structured extraction: batch of thousands of docs to DB.
  • Custom-DSL agents: guarantee valid syntax.
  • Data generation: synthetic data with fixed schema.

When not:

  • Conversational chat: prompting suffices.
  • Cases where large model with detailed prompts already works >99%.

Integration with vLLM and TGI

These runtimes support constrained decoding natively:

  • vLLM integrates Outlines from v0.4+.
  • TGI has GuidanceGrammar feature.
  • llama.cpp has grammar mode (--grammar).

Self-hosting with constrained decoding no longer requires your own pipeline.

Real Examples

Pattern we’ve seen:

  • Invoice data extraction: JSON schema with items, totals.
  • SQL generation bounded to your DB.
  • Agent tool selection: {"tool": "search", "args": {...}}.
  • Structured classification: {"category": "X", "confidence": 0.85}.

Conclusion

Constrained decoding is an underused tool in the LLM ecosystem. For any case needing guaranteed valid output, it’s worth it. Outlines is the most mature open-source option; OpenAI Structured Outputs covers SaaS. The 10-30% slowdown overhead is acceptable trade-off for robustness guarantee. Adopting significantly reduces validation and retry code in your application — better to make it work at decoding than patch post-hoc.

Follow us on jacar.es for more on LLMs, structured outputs, and advanced decoding.

Entradas relacionadas