G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

11 March 2026

Jing Peng

Ziyi Chen

Haoyu Li

Yucheng Wang

Duo Ma

Mengtian Li

Yunfan Du

Dezhu Xu

Kai Yu

Shuai Wang

BDL

ArXiv (abs)PDF HTML Github (8434★)

Main:4 Pages

1 Figures

Bibliography:1 Pages

3 Tables

Abstract

We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.

View on arXiv

Comments on this paper