Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for speech-based depression detection, rendering such tools more clinically applicable.
View on arXiv@article{deng2025_2406.03138, title={ An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech }, author={ Qingkun Deng and Saturnino Luz and Sofia de la Fuente Garcia }, journal={arXiv preprint arXiv:2406.03138}, year={ 2025 } }