S $^4$ C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

17 June 2025

Main:8 Pages

7 Figures

Bibliography:2 Pages

2 Tables

Abstract

Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S $^4$ C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S $^4$ C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S $^4$ C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.

View on arXiv

@article{he2025_2506.14158,
  title={ S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models },
  author={ Tao He and Guang Huang and Yu Yang and Tianshi Xu and Sicheng Zhao and Guiguang Ding and Pengyang Wang and Feng Tian },
  journal={arXiv preprint arXiv:2506.14158},
  year={ 2025 }
}

Comments on this paper

S4^44C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

S $^4$ C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models