ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.15071
49
15
v1v2 (latest)

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

26 January 2024
Chaochao Lu
Chao Qian
Guodong Zheng
Hongxing Fan
Hongzhi Gao
Jie Zhang
Jing Shao
Jingyi Deng
Jinlan Fu
Kexin Huang
Kunchang Li
Lijun Li
Limin Wang
Lu Sheng
Meiqiu Chen
Ming Zhang
Qibing Ren
SI-YIN Chen
Tao Gui
Wanli Ouyang
Yali Wang
Yan Teng
Yaru Wang
Yi Wang
Yinan He
Yingchun Wang
Yixu Wang
Yongting Zhang
Yu Qiao
Yujiong Shen
Yurong Mou
Yuxi Chen
Zaibin Zhang
Zhelun Shi
Zhen-fei Yin
Zhipin Wang
ArXiv (abs)PDFHTML
Abstract

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

View on arXiv
Comments on this paper