Multimodal Conversation Structure Understanding

Conversations are usually structured by roles -- who is speaking, who's being addressed, and who's listening -- and unfold in threads that break with changes in speaker floor or topical focus. While large language models (LLMs) have shown incredible capabilities in dialogue and reasoning, their ability to understand fine-grained conversational structure, especially in multi-modal, multi-party settings, remains underexplored. To address this gap, we introduce a suite of tasks focused on conversational role attribution (speaker, addressees, side-participants) and conversation threading (utterance linking and clustering), drawing on conversation analysis and sociolinguistics. To support those tasks, we present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.
View on arXiv@article{chang2025_2505.17536, title={ Multimodal Conversation Structure Understanding }, author={ Kent K. Chang and Mackenzie Hanh Cramer and Anna Ho and Ti Ti Nguyen and Yilin Yuan and David Bamman }, journal={arXiv preprint arXiv:2505.17536}, year={ 2025 } }