
v1v2 (latest)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Papers citing "HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding"
50 / 99 papers shown
Title |
---|
![]() Emu3: Next-Token Prediction is All You Need Xinlong Wang Xiaosong Zhang Zhengxiong Luo Quan-Sen Sun Yufeng Cui ...Xi Yang Jingjing Liu Yonghua Lin Tiejun Huang Zhongyuan Wang |
![]() MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong ...Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin Jiaqi Wang |
![]() Learning 1D Causal Visual Representation with De-focus Attention
Networks Chenxin Tao Xizhou Zhu Shiqian Su Lewei Lu Changyao Tian ...Gao Huang Hongsheng Li Ping Luo Jie Zhou Jifeng Dai |
![]() Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences Xiyao Wang Yuhang Zhou Xiaoyu Liu Hongjin Lu Yuancheng Xu ...Taixi Lu Gedas Bertasius Mohit Bansal Huaxiu Yao Furong Huang |