SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

31 March 2026

Mingyeong Song

Seoyeon Ko

Junhyug Noh

MDE

ArXiv (abs)PDF HTML Github

Main:4 Pages

1 Figures

Bibliography:1 Pages

3 Tables

Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

View on arXiv

Comments on this paper