ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.10391
63
17

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

17 February 2025
Yuhui Zhang
Tao Yu
Haochen Tian
Chaoyou Fu
Peiyan Li
Jianshu Zeng
Wulin Xie
Yang Shi
Huanyu Zhang
Junkang Wu
Xue Wang
Yihan Hu
Bin Wen
Fan Yang
Zhang Zhang
Yan Li
Di Zhang
Liang Wang
Rong Jin
Tieniu Tan
ArXivPDFHTML
Abstract

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing 120k\mathbf{120k}120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10\mathbf{10}10 distinct dimensions and 27\mathbf{27}27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a 19.5\mathbf{19.5}19.5% increase in conversational abilities and a 60\mathbf{60}60% improvement in safety.We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page:this https URL.

View on arXiv
@article{zhang2025_2502.10391,
  title={ MM-RLHF: The Next Step Forward in Multimodal LLM Alignment },
  author={ Yi-Fan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Fan Yang and Zhang Zhang and Tingting Gao and Di Zhang and Liang Wang and Rong Jin and Tieniu Tan },
  journal={arXiv preprint arXiv:2502.10391},
  year={ 2025 }
}
Comments on this paper