Asking an LLM to return valid JSON can fail: the model invents closed braces, omits commas, adds comments. Constrained decoding solves this by changing how tokens are generated: at each step, only tokens compatible with the desired grammar are allowed. Result: mathematical guarantee that output meets format.
Outlines, Guidance and jsonformer are the main libraries. This article covers how they work, when they beat OpenAI’s json_mode, and how to integrate them.
The Problem
Prompt: “Respond with JSON: {name, age}”.
Model may:
- Forget a comma.
- Add
```jsonat start. - Add a “here you go:” before.
- Omit closing brace.
After many careful prompts and few 100%-reliable successes, you seek real robustness.
How Constrained Decoding Works
At each generation step:
- LLM produces probability distribution over vocabulary (~50k tokens).
- Mask based on grammar: invalid tokens get probability 0.
- Sample or argmax only over valid tokens.
Result: output respects grammar perfectly.
Applied with JSON Schema, regex, or context-free grammars (CFG).
Outlines: The Popular One
Python library working with many models (HF Transformers, llama.cpp, vLLM):
from outlines import models, generate
model = models.transformers("meta-llama/Llama-3-8B-Instruct")
# JSON with Pydantic
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
generator = generate.json(model, User)
user = generator("Extract user: Ana is 25 years old")
print(user) # User(name='Ana', age=25)
Supports:
- Pydantic models for typed structure.
- Direct JSON Schema.
- Regex for specific formats.
- CFG grammars for custom DSLs.
Guidance: More General
Microsoft’s Guidance allows more complex templates:
import guidance
from guidance import models, gen, select
llama = models.LlamaCpp("path/to/model.gguf")
lm = llama + "Favorite colour is " + select(['red', 'blue', 'green']) + "."
Allows mixing fixed text with constrained-generated regions, ideal for complex prompts.
jsonformer: Simple and Focused
Only JSON, but very simple:
from jsonformer import Jsonformer
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
}
}
jsonformer = Jsonformer(model, tokenizer, schema, prompt)
result = jsonformer()
Less flexible but easier to start.
vs OpenAI json_mode
OpenAI added response_format={"type": "json_object"} and later Structured Outputs with JSON Schema. Comparison:
| Aspect | Local Outlines | OpenAI Structured Outputs |
|---|---|---|
| Grammar guarantee | 100% | 100% (schema) |
| Models | Open (Llama, Mistral) | GPT-4o+ |
| Cost | Local (GPU) | API pricing |
| Privacy | Local | OpenAI retention |
| Latency | Variable GPU | API-dependent |
For SaaS services, OpenAI is OK. For self-hosted, Outlines/Guidance.
Regex Mode for Specific Formats
Outlines regex:
from outlines import generate
phone_gen = generate.regex(
model,
r"[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}"
)
phone_gen("Call me at ") # "555-123-4567"
Useful for:
- Dates in specific format.
- Serial numbers.
- Codes (SKU, references).
- Domain identifiers.
CFG Grammars
For custom structured languages. Example: limited SQL, your DSL queries.
from outlines import generate
grammar = """
?start: expression
?expression: NUMBER | expression OP expression | "(" expression ")"
OP: "+" | "-" | "*" | "/"
%import common.NUMBER
"""
calc_gen = generate.cfg(model, grammar)
Useful when you need the LLM to produce output processable by another system.
Trade-offs
Advantages:
- Absolute format guarantee.
- Less application-side post-processing.
- Reduces format hallucination.
- Usable with small models (sometimes better than prompt in large).
Disadvantages:
- Compute overhead: every token needs masking. 10-30% slower.
- Integration: adding to existing stack requires work.
- Doesn’t help semantic quality: JSON will be valid but content may be bad.
When It’s Worth It
Cases where constrained decoding clearly wins:
- Function calling / tool use: guarantee valid argument JSON.
- Massive structured extraction: batch of thousands of docs to DB.
- Custom-DSL agents: guarantee valid syntax.
- Data generation: synthetic data with fixed schema.
When not:
- Conversational chat: prompting suffices.
- Cases where large model with detailed prompts already works >99%.
Integration with vLLM and TGI
These runtimes support constrained decoding natively:
- vLLM integrates Outlines from v0.4+.
- TGI has
GuidanceGrammarfeature. - llama.cpp has grammar mode (
--grammar).
Self-hosting with constrained decoding no longer requires your own pipeline.
Real Examples
Pattern we’ve seen:
- Invoice data extraction: JSON schema with items, totals.
- SQL generation bounded to your DB.
- Agent tool selection:
{"tool": "search", "args": {...}}. - Structured classification:
{"category": "X", "confidence": 0.85}.
Conclusion
Constrained decoding is an underused tool in the LLM ecosystem. For any case needing guaranteed valid output, it’s worth it. Outlines is the most mature open-source option; OpenAI Structured Outputs covers SaaS. The 10-30% slowdown overhead is acceptable trade-off for robustness guarantee. Adopting significantly reduces validation and retry code in your application — better to make it work at decoding than patch post-hoc.
Follow us on jacar.es for more on LLMs, structured outputs, and advanced decoding.