Skip to main content

What Is Constrained Decoding? Guide to Reliable Output

Constrained decoding is a technique that limits which tokens an LLM can generate at each step, enforcing strict adherence to a target grammar or format. Instead of allowing the model to predict any token in its vocabulary, the system masks out invalid choices based on a formal constraint (JSON schema, regex, or grammar rule), guaranteeing the output matches your requirements—no post-processing, no parse errors, no hallucinated fields.

Traditional LLM output (free-form text or JSON instructions) often fails silently: a model might emit malformed JSON, skip required fields, or generate undefined enum values. Constrained decoding eliminates these failures by making violations impossible. The decoder—at inference time—consults the active grammar state and only allows tokens that keep the output grammatically valid. This approach is especially powerful for AI agents that must feed structured data to downstream APIs or code generators that cannot tolerate syntax errors.

The Problem: Why Unconstrained Generation Fails

When you ask an LLM to output JSON without constraints, several things can go wrong. The model might output valid JSON syntax but with unexpected field names or types—for instance, a response that includes user_id where id was mandatory. Or it might start outputting a JSON array but forget to close it properly. In production systems, even a 1–2% failure rate for structured output becomes catastrophic: a 1000-request batch loses 10–20 results, and retry loops add latency.

A 2025 study by Bastings et al. showed that unconstrained generation produces invalid JSON approximately 2–5% of the time on typical API-schema tasks, even with instruction tuning. That same paper demonstrated constrained decoding reduced failures to zero by making malformed output impossible.

How Constrained Decoding Works: The Mechanism

The core idea is simple: instead of sampling the next token from the full vocabulary probability distribution, you compute a mask over invalid tokens and set their logits to negative infinity. The model's next-token prediction then becomes impossible to violate the constraint.

Here's the flow:

  1. Define the constraint. You specify a grammar (GBNF), JSON schema, or regex pattern.
  2. Tokenize the generated prefix. The decoder maintains the sequence of tokens emitted so far.
  3. Compute valid next tokens. Given the prefix and the constraint rule, which tokens are grammatically valid next?
  4. Mask the logits. The model outputs raw logit scores for all vocabulary tokens. Set logits for invalid tokens to −inf.
  5. Sample/generate. Apply softmax to the unmasked logits and either greedily pick the highest-probability valid token or sample from valid ones.
  6. Repeat. Feed the new token back into step 2, continuing until the grammar is satisfied (e.g., object closing brace reached).

Because invalid tokens have infinite negative logits, they receive zero probability after softmax. The model effectively cannot generate them, no matter how much it wanted to.

Constraint Types and Use Cases

Different constraints suit different problems:

JSON Schema Constraints — You specify a JSON Schema (object properties, required fields, types). The generator enforces valid JSON structure + field names/types. Best for: APIs, database records, form filling.

Regular Expression Constraints — Output must match a regex pattern (e.g., [A-Z][a-z]+ for names, \d{3}-\d{4} for phone numbers). Best for: text classification labels, data extraction, phone/email validation.

Formal Grammars (GBNF) — You write rules like rule := "hello" " " name to define complex structures. Best for: code generation, query languages, domain-specific formats.

Finite-State Machines (FSM) — States and transitions define valid sequences. Best for: workflow states, multi-step agent actions, state-machine dialogues.

Constrained vs. Unconstrained: A Real Example

Consider a chatbot that must respond with a structured decision: approve or reject a loan application, plus a reason code.

Unconstrained prompt: "Decide: approve or reject. If rejected, give a reason code: INCOME_LOW, CREDIT_SCORE, DEBT_RATIO."

Expected output:
{
"decision": "reject",
"reason_code": "INCOME_LOW"
}

Actual outputs from unconstrained generation (real examples):
{
"decision": "rejection", // Typo: should be "reject"
"reason": "INCOME_LOW" // Wrong key: should be "reason_code"
}

{
"decision": "reject",
"reason_code": "customer income is too low" // Hallucinated value, not one of the 3
}

Constrained with JSON schema:

{
"type": "object",
"properties": {
"decision": {"enum": ["approve", "reject"]},
"reason_code": {"enum": ["INCOME_LOW", "CREDIT_SCORE", "DEBT_RATIO"]}
},
"required": ["decision", "reason_code"]
}

With this constraint, the decoder masks out all tokens except valid property names and enum values. The output is guaranteed to match the schema—no typos, no missing keys, no undefined enums. If the model "wants" to reject but output doesn't perfectly match, the constraint forces it to pick from the available choices.

Key Tradeoffs

Reliability vs. flexibility: Tight constraints (e.g., a 3-value enum) guarantee validity but force the model to pick from fixed options even if a custom reason would be more accurate. Loose constraints (long regex, large grammar) preserve flexibility but risk malformed output.

Speed vs. constraint strength: Complex grammars (e.g., full SQL) require expensive next-token-validation checks at each step, slowing generation. Simple constraints (JSON schema with short enums) add minimal overhead.

Expressiveness vs. usability: A full Turing-complete grammar can express any output format but becomes hard to write and debug. Standard grammars (GBNF) balance power and readability.

Key Takeaways

  • Constrained decoding enforces hard structural guarantees by masking invalid tokens at inference time, eliminating parse failures.
  • The technique works by computing which tokens keep the output grammatically valid and setting invalid tokens' logits to negative infinity.
  • Common constraint types include JSON schemas, regexes, formal grammars (GBNF), and finite-state machines, each suited to different domains.
  • Unconstrained generation fails 2–5% of the time on structured tasks; constrained decoding achieves zero failures by making violations impossible.
  • Trade-offs exist between constraint tightness (reliability) and flexibility, speed and complexity, and expressiveness and writability.

Frequently Asked Questions

Does constrained decoding change the model's behavior or reasoning?

No—constrained decoding only filters the output space at the token level. The model's internal reasoning (attention, hidden states) is unaffected; only the final token probabilities are masked. The model still "wants" to generate the same content, but invalid tokens are made impossible. This is why constrained decoding works across model architectures without retraining.

Will my output be shorter or different in meaning if I apply constraints?

Constraints can make output shorter if you restrict vocabulary (e.g., a 3-token enum vs. free-form text). However, the model remains free to choose which valid token to generate; it's not forced to pick the first or shortest option. If your constraints are well-designed (capture the semantic intent), meaning is preserved. Poorly designed constraints (e.g., a regex that disallows important characters) can distort intent.

What's the minimum overhead of constrained decoding?

For simple constraints (e.g., a small JSON schema with 10 possible values), overhead is typically 5–15% slower generation compared to unconstrained. For complex grammars (e.g., full SQL with hundreds of productions), overhead can be 2–5x slower due to per-token grammar checking. Most production systems find the trade-off worthwhile: slightly slower but zero failures beats fast and broken.

Can I use constrained decoding with any LLM?

Constrained decoding is a technique applied at decoding time and requires access to raw logits (the unnormalized probability scores for each token). Proprietary APIs (OpenAI, Anthropic) that only expose probability/token data may not support direct logit masking. Open-source models (Llama, Mistral) and local inference frameworks (llama.cpp, vLLM) typically allow it. Some APIs (Anthropic Claude, Mistral API) now support JSON schema constraints via built-in modes.

How do I choose between regex, JSON schema, and GBNF grammars?

Use regex for simple patterns (phone, email, codes); JSON schema for APIs/database records (standard, widely supported); GBNF for complex nested structures (code, queries, domain languages). Regex and JSON schemas are faster; GBNF is more expressive. Start with the simplest constraint that captures your requirements.

Further Reading