ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.12793
  4. Cited By
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

21 November 2023
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Jiaqi Wang
Feng Zhao
Dahua Lin
    MLLM
    VLM
ArXivPDFHTML

Papers citing "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions"

50 / 471 papers shown
Title
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
MLLM
87
13
0
27 Nov 2024
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang
Jingdi Lei
Junxian Li
Xunzhi Wang
Yong Liu
...
Steve Yang
Jianbo Wu
Peng Ye
Wanli Ouyang
Dongzhan Zhou
OffRL
LRM
107
6
0
27 Nov 2024
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in
  MLLMs
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs
Lehan He
Zeren Chen
Zhelun Shi
Tianyu Yu
Jing Shao
Lu Sheng
MLLM
113
1
0
26 Nov 2024
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image
  Synthesis
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
Boming Miao
C. Li
Xidong Wang
Andi Zhang
Rui Sun
Zizhe Wang
Yao Zhu
DiffM
81
0
0
25 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
  through Tree-Based Image Exploration
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Jianwei Yin
97
4
0
25 Nov 2024
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
  Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
83
9
0
21 Nov 2024
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large
  Language Models in Autonomous Driving
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
91
12
0
20 Nov 2024
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment
Zhendong Liu
Yuanbi Nie
Yingshui Tan
Xiangyu Yue
Qiushi Cui
Chongjun Wang
Xiaoyong Zhu
Jian Xu
Bo Zheng
78
0
0
18 Nov 2024
Mitigating Hallucination in Multimodal Large Language Model via
  Hallucination-targeted Direct Preference Optimization
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
Yuhan Fu
Ruobing Xie
Xingchen Sun
Zhanhui Kang
Xirong Li
MLLM
40
4
0
15 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Zechao Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
31
2
0
14 Nov 2024
Multimodal Instruction Tuning with Hybrid State Space Models
Multimodal Instruction Tuning with Hybrid State Space Models
Jianing Zhou
Han Li
Shuai Zhang
Ning Xie
Ruijie Wang
Xiaohan Nie
Sheng Liu
Lingyun Wang
46
0
0
13 Nov 2024
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Moran Yanuka
Assaf Ben-Kish
Yonatan Bitton
Idan Szpektor
Raja Giryes
VLM
47
2
0
13 Nov 2024
HumanVLM: Foundation for Human-Scene Vision-Language Model
HumanVLM: Foundation for Human-Scene Vision-Language Model
Dawei Dai
Xu Long
Li Yutang
Zhang YuanHui
Shuyin Xia
VLM
MLLM
39
1
0
05 Nov 2024
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
Jian Chen
R. Zhang
Yufan Zhou
Tong Yu
Franck Dernoncourt
J. Gu
Ryan Rossi
Changyou Chen
Tong Sun
39
0
0
02 Nov 2024
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via
  Existing MLLM Structures
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures
Tianxiang Wu
Minxin Nie
Ziqiang Cao
MLLM
42
0
0
30 Oct 2024
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Arash Marioriyad
Parham Rezaei
M. Baghshah
M. Rohban
CoGe
228
0
0
30 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
48
0
0
26 Oct 2024
Rethinking Visual Dependency in Long-Context Reasoning for Large
  Vision-Language Models
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
Yucheng Zhou
Zhi Rao
Jun Wan
Jianbing Shen
LRM
23
17
0
25 Oct 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Haocheng Xi
Han Cai
Ligeng Zhu
Yaojie Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
75
9
0
25 Oct 2024
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Shuhao Gu
Jialing Zhang
Siyuan Zhou
Kevin Yu
Zhaohu Xing
...
Yufeng Cui
Xinlong Wang
Yaoqi Liu
Fangxiang Feng
Guang Liu
SyDa
VLM
MLLM
32
21
0
24 Oct 2024
WAFFLE: Multi-Modal Model for Automated Front-End Development
WAFFLE: Multi-Modal Model for Automated Front-End Development
Shanchao Liang
Nan Jiang
Shangshu Qian
Lin Tan
19
0
0
24 Oct 2024
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu
Zihan Zhou
Jiayou Lu
Yuanzhi Zhu
A. Kong
WIGM
97
11
0
24 Oct 2024
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric
  Reasoning in Large Multimodal Models
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
Linger Deng
Yuliang Liu
Bohan Li
Dongliang Luo
Liang Wu
...
Ziyang Zhang
Gang Zhang
Errui Ding
Yingying Zhu
Xiang Bai
ReLM
3DV
LRM
34
10
0
23 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Zeang Sheng
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
53
29
0
22 Oct 2024
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Shota Onohara
Atsuyuki Miyai
Yuki Imajuku
Kazuki Egashira
Jeonghun Baek
Xiang Yue
Graham Neubig
Kiyoharu Aizawa
OSLM
121
1
0
22 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
  Parameters and 90% Performance
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
74
25
0
21 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM
  Pretraining
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
Bin Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
35
1
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large
  Multimodal Models
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
43
1
0
21 Oct 2024
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
...
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
Jiankang Deng
MLLM
VLM
48
1
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
74
22
0
18 Oct 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
60
82
0
17 Oct 2024
Harnessing Webpage UIs for Text-Rich Visual Understanding
Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu
Tianyue Ou
Yifan Song
Yuxiao Qu
Wai Lam
Chenyan Xiong
Wenhu Chen
Graham Neubig
Xiang Yue
82
6
0
17 Oct 2024
Improving Multi-modal Large Language Model through Boosting Vision
  Capabilities
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun
Han Zhang
Qiang Chen
Xinyu Zhang
Nong Sang
Gang Zhang
Jingdong Wang
Zechao Li
34
5
0
17 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Zichen Liu
Shiwei Li
...
Yiming Lei
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
53
18
0
16 Oct 2024
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
  Embodied AI
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Sijie Cheng
Kechen Fang
Yangyang Yu
Sicheng Zhou
Yangqiu Song
Ye Tian
Tingguang Li
Lei Han
Yang Liu
56
8
0
15 Oct 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
35
6
0
15 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video
  Understanding in Large Language Models
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
54
2
0
15 Oct 2024
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Luping Liu
Chao Du
Tianyu Pang
Zehan Wang
Chongxuan Li
Dong Xu
VLM
53
5
0
15 Oct 2024
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection
Adapt-∞\infty∞: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection
A. Maharana
Jaehong Yoon
Tianlong Chen
Joey Tianyi Zhou
34
1
0
14 Oct 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Peng Xia
Siwei Han
Shi Qiu
Yiyang Zhou
Zhaoyang Wang
...
Chenhang Cui
Mingyu Ding
Linjie Li
Lijuan Wang
Huaxiu Yao
67
10
0
14 Oct 2024
SynFER: Towards Boosting Facial Expression Recognition with Synthetic
  Data
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data
Xilin He
Cheng Luo
Xiaole Xian
Bing Li
Siyang Song
Muhammad Haris Khan
Weicheng Xie
Linlin Shen
Zongyuan Ge
46
3
0
13 Oct 2024
TULIP: Token-length Upgraded CLIP
TULIP: Token-length Upgraded CLIP
Ivona Najdenkoska
Mohammad Mahdi Derakhshani
Yuki M. Asano
Nanne van Noord
Marcel Worring
Cees G. M. Snoek
VLM
53
3
0
13 Oct 2024
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language
  Models Alignment
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li
Zhihui Xie
Mukai Li
Shunian Chen
Peiyi Wang
L. Chen
Yazheng Yang
Benyou Wang
Lingpeng Kong
Qiang Liu
VLM
ALM
38
17
0
12 Oct 2024
Unraveling and Mitigating Safety Alignment Degradation of
  Vision-Language Models
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu
Chao Shang
Ling Liu
Nikolaos Pappas
Jie Ma
Neha Anna John
Srikanth Doss Kadarundalagi Raghuram Doss
Lluís Marquez
Miguel Ballesteros
Yassine Benajiba
47
4
0
11 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
73
26
0
10 Oct 2024
ElasticTok: Adaptive Tokenization for Image and Video
ElasticTok: Adaptive Tokenization for Image and Video
Wilson Yan
Matei A. Zaharia
Volodymyr Mnih
Pieter Abbeel
Aleksandra Faust
Hao Liu
VGen
54
6
0
10 Oct 2024
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
  Modality Integration Rate
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Jiaqi Wang
Dahua Lin
Weiming Zhang
Nenghai Yu
57
6
0
09 Oct 2024
From Pixels to Tokens: Revisiting Object Hallucinations in Large
  Vision-Language Models
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang
Xinyi Zeng
Yutao Zhu
Xiao Yang
Zhengwei Fang
Jingyuan Zhang
Jiawei Chen
Zinan Liu
Yu Tian
VLM
MLLM
200
1
0
09 Oct 2024
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric
  Understanding
HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding
Keliang Li
Zaifei Yang
Jiahe Zhao
Hongze Shen
Ruibing Hou
Hong Chang
Shiguang Shan
Xilin Chen
VLM
31
0
0
09 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
88
12
0
09 Oct 2024
Previous
12345...8910
Next