Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.15195
Cited By
v1
v2 (latest)
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
27 June 2023
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (778★)
Papers citing
"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic"
50 / 163 papers shown
Title
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
MLLM
LRM
283
1
0
01 Jul 2025
ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
VOS
MLLM
VGen
LRM
105
0
0
01 Jul 2025
Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
Lexiang Tang
Xianwei Zhuang
Bang Yang
Zhiyuan Hu
Hongxiang Li
Lu Ma
Jinghan Ru
Yuexian Zou
32
0
0
14 Jun 2025
Domain-Constrained Diffusion Models to Synthesize Tabular Data: A Case Study in Power Systems
Milad Hoseinpour
Vladimir Dvorkin
DiffM
MedIm
22
0
0
12 Jun 2025
CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang
Yutao Cheng
Dexiang Hong
Maoke Yang
Gonglei Shi
Lei Ma
H. Zhang
Jie Shao
Xinglong Wu
DiffM
118
0
0
12 Jun 2025
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
Beomsik Cho
Jaehyung Kim
68
0
0
11 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
70
0
0
11 Jun 2025
Revolutionizing Clinical Trials: A Manifesto for AI-Driven Transformation
M. Schaar
Richard W. Peck
E. McKinney
Jim Weatherall
Stuart Bailey
...
Rafik Salama
Christina Gunther
Francesca Frau
Antoine Pugeat
Ramon Hernandez
MedIm
69
0
0
10 Jun 2025
Synthetic Visual Genome
J. S. Park
Zixian Ma
Linjie Li
Chenhao Zheng
Cheng-Yu Hsieh
...
Quan Kong
Norimasa Kobori
Ali Farhadi
Yejin Choi
Ranjay Krishna
23
0
0
09 Jun 2025
Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Tianyi Bai
Zengjie Hu
Fupeng Sun
Jiantao Qiu
Yizhen Jiang
Guangxin He
Bohan Zeng
Conghui He
Binhang Yuan
Wentao Zhang
OffRL
LRM
19
0
0
08 Jun 2025
CoMemo: LVLMs Need Image Context with Image Memory
Shi-Qi Liu
Weijie Su
Xizhou Zhu
Wenhai Wang
Jifeng Dai
VLM
60
0
0
06 Jun 2025
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction
Zesheng Ye
C. Cai
Ruijiang Dong
Jianzhong Qi
Lei Feng
Pin-Yu Chen
Feng Liu
226
0
0
05 Jun 2025
Refer to Anything with Vision-Language Prompts
Shengcao Cao
Zijun Wei
Jason Kuen
Kangning Liu
Lingzhi Zhang
Jiuxiang Gu
HyunJoon Jung
Liang-Yan Gui
Yu Wang
VLM
117
0
0
05 Jun 2025
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Qianhui Wu
Kanzhi Cheng
Rui Yang
Chaoyun Zhang
Jianwei Yang
...
Huan Zhang
Tong Zhang
Jianbing Zhang
Dongmei Zhang
J. Gao
LM&Ro
64
0
0
03 Jun 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
32
0
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
39
1
0
30 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
Haitao Mi
Dong Yu
VLM
100
0
0
28 May 2025
OASIS: Online Sample Selection for Continual Visual Instruction Tuning
Minjae Lee
Minhyuk Seo
Tingyu Qu
Tinne Tuytelaars
Jonghyun Choi
VLM
24
1
0
27 May 2025
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Xinmiao Hu
C. Wang
Ruihe An
ChenYu Shao
Xiaojun Ye
Sheng Zhou
Liangcheng Li
MLLM
LRM
65
0
0
26 May 2025
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang
Changle Zhou
Jiawei Kong
Kuofeng Gao
Bin Chen
Tao Liang
Guojun Ma
Shu-Tao Xia
MLLM
115
0
0
26 May 2025
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Lorenzo Baraldi
Davide Bucciarelli
Federico Betti
Marcella Cornia
Lorenzo Baraldi
N. Sebe
Rita Cucchiara
231
0
0
26 May 2025
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
Feilong Tang
Chengzhi Liu
Zhongxing Xu
Ming Hu
Zelin Peng
...
Minquan Lin
Yifan Peng
Xuelian Cheng
Imran Razzak
Zongyuan Ge
76
1
0
22 May 2025
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
Jiachen Jiang
Jinxin Zhou
Bo Peng
Xia Ning
Zhihui Zhu
105
0
0
22 May 2025
Expanding Zero-Shot Object Counting with Rich Prompts
Huilin Zhu
Senyao Li
Jingling Yuan
Zhengwei Yang
Yu Guo
Wenxuan Liu
Xian Zhong
Shengfeng He
VLM
102
0
0
21 May 2025
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
Yuqi Liu
Tianyuan Qu
Zhisheng Zhong
Bohao Peng
Shu Liu
Bei Yu
Jiaya Jia
VLM
LRM
132
3
0
17 May 2025
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Yifu Yuan
Haiqin Cui
Yibin Chen
Zibin Dong
Fei Ni
Longxin Kou
Jinyi Liu
Pengyi Li
Yan Zheng
Jianye Hao
155
0
0
13 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
84
0
0
11 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
109
1
0
03 May 2025
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma
Jing Ding
Xuejun Zhang
Dezhi Luo
Jiahe Ding
Sihan Xu
Yuchen Huang
Run Peng
Joyce Chai
243
0
0
22 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLM
VLM
223
132
1
14 Apr 2025
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Bosung Kim
Kyuhwan Lee
Isu Jeong
Jungmin Cheon
Yeojin Lee
Seulki Lee
VGen
102
1
0
31 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
184
1
0
29 Mar 2025
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Yichao Yan
VLM
242
1
0
26 Mar 2025
Visual Position Prompt for MLLM based Visual Grounding
Wei Tang
Yanpeng Sun
Qinying Gu
Zechao Li
VLM
95
0
0
19 Mar 2025
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
Hao Yin
Guangzong Si
Zilei Wang
418
1
0
17 Mar 2025
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Mingyang Song
Xiaoye Qu
Jiawei Zhou
Yu Cheng
VLM
174
1
0
17 Mar 2025
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Xinyu Ma
Ziyang Ding
Zhicong Luo
Chong Chen
Zonghao Guo
Derek F. Wong
Xiaoyi Feng
Maosong Sun
VLM
LRM
120
8
0
17 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
86
0
0
13 Mar 2025
Learning to Inference Adaptively for Multimodal Large Language Models
Zhuoyan Xu
Khoi Duc Nguyen
Preeti Mukherjee
Saurabh Bagchi
Somali Chaterji
Yingyu Liang
Yin Li
LRM
129
2
0
13 Mar 2025
Referring to Any Person
Qing Jiang
Lin Wu
Zhaoyang Zeng
Tianhe Ren
Yuda Xiong
Yihao Chen
Qin Liu
Lei Zhang
502
2
0
11 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
478
1
0
10 Mar 2025
Hallucinatory Image Tokens: A Training-free EAZY Approach on Detecting and Mitigating Object Hallucinations in LVLMs
Liwei Che
Tony Qingze Liu
Jing Jia
Weiyi Qin
Ruixiang Tang
Vladimir Pavlovic
MLLM
VLM
204
2
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
150
3
0
10 Mar 2025
GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Xiang Lan
Feng Wu
Kai He
Qinghao Zhao
Shenda Hong
Mengling Feng
AI4TS
123
7
0
08 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
172
5
0
05 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Pandeng Li
Yun Zheng
Liwei Wang
ObjD
VLM
134
1
0
03 Mar 2025
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
X. J. Yang
Jing Liu
Peng Wang
Guoqing Wang
Yue Yang
Jikang Cheng
ObjD
196
0
0
27 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
220
4
0
26 Feb 2025
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Qianqi Yan
Yue Fan
Hongquan Li
Shan Jiang
Yang Zhao
Xinze Guan
Ching-Chen Kuo
Xinze Wang
VLM
LRM
227
2
0
22 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
165
9
0
21 Feb 2025
1
2
3
4
Next