ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.22952
63
1

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

29 March 2025
Yanjie Wang
Yansen Wang
Bo Chen
Tong Wu
Dongyan Zhao
Zilong Zheng
    VLM
    MLLM
ArXivPDFHTML
Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

View on arXiv
@article{wang2025_2503.22952,
  title={ OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts },
  author={ Yuxuan Wang and Yueqian Wang and Bo Chen and Tong Wu and Dongyan Zhao and Zilong Zheng },
  journal={arXiv preprint arXiv:2503.22952},
  year={ 2025 }
}
Comments on this paper