XGrammar and llama.cpp Integration: Fast Constraints
XGrammar is a high-performance framework for grammar-constrained text generation, designed for production deployments where inference speed matters. It optimizes constraint checking via compiled state machines and token filtering, reducing per-token overhead from 30–50% (naive logit masking) to 5–15%. Combined with llama.cpp (a C++ inference engine for quantized models on CPU and GPU), XGrammar enables fast, reliable structured generation without heavyweight server infrastructure.
This article covers deploying constrained generation in production: using XGrammar with llama.cpp, GGUF model conversion, grammar file handling, and real-world performance optimization.
Background: Why XGrammar?
As models grow larger and deployments scale, constraint-checking overhead becomes critical. Naive logit masking tests every vocabulary token (100K+) at each step, adding latency. XGrammar solves this by:
- Pre-compiling grammars to optimized state machines (instead of on-the-fly checks).
- Vocabulary pruning — building an index of which tokens are reachable from each state, reducing checks from 100K to 10–1K per token.
- Efficient DFA simulation — O(1) next-state lookup via memoization and caching.
Result: constrained generation with 5–15% overhead instead of 30–50%, making it practical for real-time applications.
Installation: XGrammar + llama.cpp
Option 1: Use pre-built llama.cpp with XGrammar
llama.cpp includes XGrammar support in recent builds. Download a release:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with XGrammar support (and GPU acceleration)
make clean
LLAMA_CUDA=1 make -j4
# Verify
./main --help | grep -i grammar
Option 2: Python bindings (llama-cpp-python)
For Python integration:
pip install llama-cpp-python xgrammar
Verify:
import llama_cpp
import xgrammar as xg
print(f"llama-cpp-python: {llama_cpp.__version__}")
print(f"XGrammar: {xg.__version__}")
Getting GGUF Models
XGrammar works with GGUF-format models (quantized for CPU efficiency). Download from HuggingFace:
# Example: Mistral 7B quantized
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf
# Or use a tool to download
# pip install huggingface-hub
# huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF Mistral-7B-Instruct-v0.2.Q4_K_M.gguf --local-dir ./models
Store the GGUF file in a models directory for reuse.
Using XGrammar with llama.cpp Command Line
Basic JSON constraint:
# Define grammar in a file or inline
cat > json.gbnf << 'EOF'
root := object
object := "{" ws "}" | "{" ws member ("," ws member)* ws "}"
member := string ws ":" ws json_value
json_value := object | array | string | number | "true" | "false" | "null"
array := "[" ws "]" | "[" ws json_value ("," ws json_value)* ws "]"
string := "\"" [^"]* "\""
number := "-"? [0-9] ("." [0-9]+)? ([eE] [+-]? [0-9]+)?
ws := ([ \t\n])*
EOF
# Run inference with grammar constraint
./main -m models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf \
-p "Extract person info as JSON: Name is Alice, age 30. " \
--grammar-file json.gbnf \
-n 200 \
-t 4 \
--top-p 0.9 \
--temp 0.7
Output (guaranteed valid JSON):
{"name": "Alice", "age": 30}
SQL constraint:
cat > sql_select.gbnf << 'EOF'
root := select_stmt
select_stmt := "SELECT" ws columns ws "FROM" ws table_name
columns := identifier ("," identifier)*
table_name := identifier
identifier := [a-zA-Z_][a-zA-Z0-9_]*
ws := [ \t\n]*
EOF
./main -m models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf \
-p "Write a SQL query to fetch users: " \
--grammar-file sql_select.gbnf \
-n 100
Output (guaranteed valid SQL SELECT):
SELECT name, email FROM users
Python Integration with llama-cpp-python
For Python applications:
from llama_cpp import Llama
import xgrammar as xg
# Load model
llm = Llama(
model_path="models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf",
n_gpu_layers=32, # Offload to GPU
n_ctx=2048, # Context window
verbose=True
)
# Define grammar
json_grammar = r"""
root := object
object := "{" ws "}" | "{" ws member ("," ws member)* ws "}"
member := string ws ":" ws json_value
json_value := object | array | string | number | "true" | "false" | "null"
array := "[" ws "]" | "[" ws json_value ("," ws json_value)* ws "]"
string := "\"" [^"]* "\""
number := "-"? [0-9] ("." [0-9]+)? ([eE] [+-]? [0-9]+)?
ws := ([ \t\n])*
"""
# Parse grammar (creates optimized FSM)
grammar = xg.from_ebnf(json_grammar)
# Generate with constraint
response = llm(
"Extract data as JSON: John, age 28, engineer.",
grammar=grammar,
max_tokens=200,
temperature=0.7
)
print(response["choices"][0]["text"])
# Output: {"name": "John", "age": 28, "job": "engineer"}
Loading Grammar from Files
For reusability, store grammars in files:
from llama_cpp import Llama
import xgrammar as xg
llm = Llama(model_path="models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf")
# Load grammar from file
with open("grammars/person.gbnf", "r") as f:
grammar_text = f.read()
grammar = xg.from_ebnf(grammar_text)
response = llm(
"Generate a person record: Alice, 30, engineer.",
grammar=grammar,
max_tokens=200
)
print(response["choices"][0]["text"])
File: grammars/person.gbnf
root := object
object := "{" ws "name" ws ":" ws string ws "," ws "age" ws ":" ws number ws "," ws "role" ws ":" ws string ws "}"
string := "\"" [^"]* "\""
number := [0-9]+
ws := ([ \t\n])*
Performance Tuning
1. Model Quantization
Different quantization levels trade speed for quality:
| Format | Size | Speed | Quality |
|---|---|---|---|
| FP32 | Full | Slow | Best |
| FP16 | 50% | Medium | Excellent |
| Q8 | 25% | Fast | Excellent |
| Q5 | 16% | Faster | Very good |
| Q4 | 12% | Very fast | Good |
| Q3 | 10% | Fastest | Fair |
For production, Q4_K_M (quantized 4-bit, with K-quant) is a good balance.
# Download Q4_K_M quantized model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf
2. Context and Token Limits
llm = Llama(
model_path="models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf",
n_ctx=2048, # Context window (larger = more memory)
n_batch=512, # Batch size for processing
n_threads=4 # CPU threads
)
response = llm(
prompt,
max_tokens=150, # Limit output length
grammar=grammar
)
3. Grammar Complexity
Simpler grammars compile and run faster:
Fast grammar (5 states):
root := "yes" | "no"
Slow grammar (50+ states, deep nesting):
root := nested_object (nested_object)*
nested_object := "{" object_content "}"
object_content := ...
4. GPU Offloading
llm = Llama(
model_path="models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf",
n_gpu_layers=32, # Offload all layers to GPU (for 7B model)
# Use n_gpu_layers=-1 to offload everything
)
On NVIDIA GPUs, this can 5–10x speedup generation.
Real-World Example: Form Submission API
Build a service that validates form submissions against a schema:
from llama_cpp import Llama
import xgrammar as xg
from pydantic import BaseModel
import json
class FormData(BaseModel):
name: str
email: str
age: int
# Grammar for form output
form_grammar = r"""
root := object
object := "{" ws "name" ws ":" ws string ws "," ws "email" ws ":" ws string ws "," ws "age" ws ":" ws number ws "}"
string := "\"" [^"]* "\""
number := [0-9]+
ws := ([ \t\n])*
"""
llm = Llama(model_path="models/Mistral-7B-Instruct-v0.2.Q4_K_M.gguf", n_gpu_layers=-1)
grammar = xg.from_ebnf(form_grammar)
def submit_form(user_input: str) -> FormData:
"""Extract form data from user input, guaranteed valid."""
response = llm(
f"Extract form data from: {user_input}",
grammar=grammar,
max_tokens=200,
temperature=0.5
)
data_str = response["choices"][0]["text"]
data_dict = json.loads(data_str)
return FormData(**data_dict)
# Usage
result = submit_form("My name is Bob, email [email protected], I'm 35 years old.")
print(result)
# Output: FormData(name='Bob', email='[email protected]', age=35)
Debugging and Monitoring
Check grammar compilation:
import xgrammar as xg
grammar_text = "root := 'hello' | 'hi'"
try:
grammar = xg.from_ebnf(grammar_text)
print(f"Grammar compiled successfully. States: {grammar.num_states}")
except Exception as e:
print(f"Grammar error: {e}")
Monitor generation speed:
import time
start = time.time()
response = llm(prompt, grammar=grammar, max_tokens=100)
elapsed = time.time() - start
tokens = len(response["choices"][0]["text"].split())
tokens_per_sec = tokens / elapsed
print(f"Generated {tokens} tokens in {elapsed:.2f}s ({tokens_per_sec:.1f} tok/s)")
Expected throughput on GPU: 50–200 tokens/sec (depending on model size and quantization).
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Grammar won't compile | Syntax error in GBNF | Validate grammar with xg.from_ebnf() and check error message |
| Very slow generation | Complex grammar, large context | Simplify grammar, reduce context window, use GPU offloading |
| Model memory error | Model too large for GPU | Use smaller quantization (Q4, Q3) or CPU-only |
| Token masking is incorrect | Grammar bug or state machine error | Test grammar with simple inputs first |
| CUDA not detected | CUDA libraries not installed | Install CUDA toolkit, rebuild llama.cpp with LLAMA_CUDA=1 |
Key Takeaways
- XGrammar optimizes constraint checking via compiled state machines and vocabulary pruning, reducing overhead to 5–15%.
- llama.cpp runs GGUF models (quantized, efficient) on CPU and GPU with native XGrammar support.
- Use
--grammar-filein CLI orgrammar=parameter in Python for constraint enforcement. - Q4_K_M quantization balances speed and quality; GPU offloading can 5–10x accelerate generation.
- Monitor tokens/sec and adjust batch size, context window, and GPU layers for performance.
Frequently Asked Questions
Can I use XGrammar with non-English languages?
Yes. XGrammar works with any tokenizer and language. However, grammar rules must account for the tokenizer's behavior (e.g., whitespace handling, character sequences). Test with your target language's tokenizer.
What's the maximum grammar complexity supported?
XGrammar handles most practical grammars (100–1000 states). Very large grammars (10,000+ states) may slow compilation and generation. For extreme complexity, break into multiple generations.
Can I update the grammar dynamically during generation?
Not directly. Grammar is fixed at the start of generation. To switch grammars, finish generation and start a new call with a different grammar.
How does quantization affect constrained generation accuracy?
Quantization reduces model precision but doesn't affect grammar correctness. The grammar still guarantees valid syntax. Model behavior (reasoning, accuracy) may slightly degrade with aggressive quantization (Q3), but Q4–Q5 have minimal impact.
Can I run XGrammar on CPU only?
Yes. llama.cpp runs on CPU; omit n_gpu_layers. Performance will be slower (5–20 tok/s on CPU vs. 50–200 tok/s on GPU), but suitable for non-real-time applications.
Further Reading
- XGrammar GitHub Repository — Source code and detailed docs.
- llama.cpp Grammar Support — GBNF spec and examples.
- GGUF Model Format — How GGUF quantization works.
- Constrained Decoding Performance Benchmarks — XGrammar performance vs. alternatives.