90
0

Pre3^3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

Main:9 Pages
7 Figures
Bibliography:2 Pages
3 Tables
Appendix:4 Pages
Abstract

Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre3^3 that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre3^3 enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre3^3 introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre3^3 can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available atthis https URL.

View on arXiv
@article{chen2025_2506.03887,
  title={ Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation },
  author={ Junyi Chen and Shihao Bai and Zaijun Wang and Siyu Wu and Chuheng Du and Hailong Yang and Ruihao Gong and Shengzhong Liu and Fan Wu and Guihai Chen },
  journal={arXiv preprint arXiv:2506.03887},
  year={ 2025 }
}
Comments on this paper