Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.12793
Cited By
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
21 November 2023
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Jiaqi Wang
Feng Zhao
Dahua Lin
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ShareGPT4V: Improving Large Multi-Modal Models with Better Captions"
50 / 472 papers shown
Title
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
88
12
0
09 Oct 2024
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Phu Pham
Phu Pham
Kun Wan
Yu-Jhe Li
Zeliang Zhang
Daniel Miranda
Ajinkya Kale
Ajinkya Kale
Chenliang Xu
35
5
0
08 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Yifei Xing
Xiangyuan Lan
Ruiping Wang
D. Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
40
0
0
08 Oct 2024
R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?
Chunyi Li
J. Zhang
Zicheng Zhang
H. Wu
Yuan Tian
...
Guo Lu
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
AAML
57
3
0
07 Oct 2024
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
Wei Wu
Kecheng Zheng
Shuailei Ma
Fan Lu
Yuxin Guo
Yifei Zhang
Wei Chen
Qingpei Guo
Yujun Shen
Zheng-Jun Zha
VLM
38
9
0
07 Oct 2024
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta
Shreyas Verma
Ujjwala Anantheswaran
Kevin Scaria
Mihir Parmar
Swaroop Mishra
Chitta Baral
ReLM
LRM
32
7
0
06 Oct 2024
Gamified crowd-sourcing of high-quality data for visual fine-tuning
Shashank Yadav
Rohan Tomar
Garvit Jain
Chirag Ahooja
Shubham Chaudhary
Charles Elkan
33
0
0
05 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRM
AI4CE
33
6
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
89
26
0
04 Oct 2024
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Zhengfeng Lai
Vasileios Saveris
Chen Chen
Hong-You Chen
Haotian Zhang
...
Wenze Hu
Zhe Gan
Peter Grasch
Meng Cao
Yinfei Yang
VLM
41
4
0
03 Oct 2024
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu
MLLM
30
0
0
03 Oct 2024
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
Zhipei Xu
Xuanyu Zhang
Runyi Li
Zecheng Tang
Qing Huang
Jian Zhang
AAML
47
18
0
03 Oct 2024
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia
Wenhao Yu
Kaixin Ma
Tianqing Fang
Zhihan Zhang
Siru Ouyang
Hongming Zhang
Meng Jiang
Dong Yu
VLM
42
5
0
02 Oct 2024
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Jiapeng Wang
Chengyu Wang
Kunzhe Huang
Jun Huang
Lianwen Jin
CLIP
VLM
45
3
0
01 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
42
33
1
30 Sep 2024
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Jiacong Wang
Bohong Wu
Haiyong Jiang
Xun Zhou
Xin Xiao
Haoyuan Guo
Jun Xiao
VLM
VGen
48
4
0
30 Sep 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
49
7
0
28 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
45
6
0
27 Sep 2024
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
47
166
0
27 Sep 2024
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
Yuki Imajuku
Yoko Yamakata
Kiyoharu Aizawa
44
1
0
27 Sep 2024
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Guokan Shang
Hadi Abdine
Yousef Khoubrane
Amr Mohamed
Yassine Abbahaddou
...
Xuguang Ren
Eric Moulines
Preslav Nakov
Michalis Vazirgiannis
Eric Xing
31
4
0
26 Sep 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
Xun Zhu
Ying Hu
Fanbin Mo
Miao Li
Ji Wu
56
8
0
26 Sep 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
...
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
AuLLM
MLLM
VLM
82
23
0
26 Sep 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLM
VLM
47
9
0
25 Sep 2024
SynChart: Synthesizing Charts from Language Models
Mengchen Liu
Qixiu Li
Dongdong Chen
Dong Chen
Jianmin Bao
Yunsheng Li
MLLM
28
0
0
25 Sep 2024
Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee
Sangyun Chung
Chae Won Kim
Beomchan Park
Yong Man Ro
VLM
LRM
49
6
0
23 Sep 2024
Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization
Minyi Zhao
Jie Wang
Zerui Li
Jiyuan Zhang
Zhenbang Sun
Shuigeng Zhou
MLLM
VLM
47
0
0
22 Sep 2024
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Jiashuo Sun
Jihai Zhang
Yucheng Zhou
Zhaochen Su
Xiaoye Qu
Yu Cheng
56
12
0
21 Sep 2024
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
Jing Hao
Yuxiang Zhao
Song Chen
Yanpeng Sun
Qiang Chen
Gang Zhang
Kun Yao
Errui Ding
Jingdong Wang
VLM
VGen
MLLM
51
5
0
20 Sep 2024
OmniGen: Unified Image Generation
Shitao Xiao
Yueze Wang
Yueze Wang
Huaying Yuan
Xingrun Xing
Ruiran Yan
Shuting Wang
Tiejun Huang
Zheng Liu
DiffM
VLM
SyDa
64
66
0
17 Sep 2024
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan
Pengcheng Li
Yang Li
Hao Chen
Qingguo Chen
Weihua Luo
Wei Dong
Qingsen Yan
Haokui Zhang
Chunhua Shen
3DV
VLM
51
4
0
15 Sep 2024
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
Zhen Yang
Jinhao Chen
Zhengxiao Du
Wenmeng Yu
Weihan Wang
Wenyi Hong
Zhihuan Jiang
Bin Xu
Yuxiao Dong
Jie Tang
VLM
LRM
32
8
0
10 Sep 2024
POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu
Zhongyin Zhao
Ziyuan Zhuang
Le Tian
Xiao Zhou
Jie Zhou
VLM
43
5
0
07 Sep 2024
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Yi Zhu
Yanpeng Zhou
Chunwei Wang
Yang Cao
Jianhua Han
Lu Hou
Hang Xu
ViT
VLM
36
4
0
06 Sep 2024
Multi-modal Situated Reasoning in 3D Scenes
Xiongkun Linghu
Jiangyong Huang
Xuesong Niu
Xiaojian Ma
Baoxiong Jia
Siyuan Huang
48
12
0
04 Sep 2024
VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
Zaiwei Zhang
Gregory P. Meyer
Zhichao Lu
Ashish Shrivastava
Avinash Ravichandran
Eric M. Wolff
VLM
51
2
0
29 Aug 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu
Yue Liao
Le Zhuo
Chenning Xu
Guanghao Zhang
...
Bolin Li
Zhelun Yu
Si Liu
Hongsheng Li
Hao Jiang
VLM
MoE
32
9
0
28 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
40
55
0
28 Aug 2024
ParGo: Bridging Vision-Language with Partial and Global Views
An-Lan Wang
Bin Shan
Wei Shi
Kun-Yu Lin
Xiang Fei
Guozhi Tang
Lei Liao
Jingqun Tang
Can Huang
Wei-Shi Zheng
MLLM
VLM
90
15
0
23 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
Liwen Wang
Rong Jin
Tieniu Tan
OffRL
69
36
0
23 Aug 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
46
63
0
22 Aug 2024
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie
Weijia Mao
Zechen Bai
David Junhao Zhang
Weihao Wang
Kevin Qinghong Lin
Yuchao Gu
Zhijie Chen
Zhenheng Yang
Mike Zheng Shou
59
166
0
22 Aug 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
Yaqi Zhao
Yajie Zhang
Ke Lin
Jiahao Wang
Xin Tao
Pengfei Wan
Di Zhang
Baoqun Yin
Wentao Zhang
LRM
41
6
0
21 Aug 2024
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Feipeng Ma
Yizhou Zhou
Hebei Li
Zilong He
Siying Wu
Fengyun Rao
Siying Wu
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
41
4
0
21 Aug 2024
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
MLLM
40
18
0
19 Aug 2024
FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant
Zhengchao Huang
Bin Xia
Zicheng Lin
Zhun Mou
Wenming Yang
CVBM
44
18
0
19 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
44
92
0
16 Aug 2024
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang
Xin Huang
Xiantao Zhang
Jin Zeng
LRM
45
20
0
16 Aug 2024
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark
Minxuan Zhou
Hao Liang
Tianpeng Li
Zhiyu Wu
Mingan Lin
...
Yujing Qiao
Weipeng Chen
Bin Cui
Wentao Zhang
Guosheng Dong
51
5
0
14 Aug 2024
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
Fan Yang
Sicheng Zhao
Yanhao Zhang
Haoxiang Chen
Hui Chen
Wenbo Tang
Guiguang Ding
38
4
0
14 Aug 2024
Previous
1
2
3
4
5
6
...
8
9
10
Next