Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization
- Mamba
Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available atthis https URL.
View on arXiv