ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,277 papers shown
Title
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Chen Tang
Xinzhu Ma
Encheng Su
Xiufeng Song
Xiaohong Liu
Wei-Hong Li
Lei Bai
Wanli Ouyang
Xiangyu Yue
3DGS
AI4TS
74
0
0
26 Mar 2025
OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations
OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations
Christina Kassab
Sacha Morin
Martin Buchner
Matías Mattamala
Kumaraditya Gupta
Abhinav Valada
Liam Paull
Maurice F. Fallon
3DV
ELM
49
0
0
25 Mar 2025
LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
Yuyao Zhang
Jinghao Li
Yu-Wing Tai
DiffM
69
0
0
25 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
89
0
0
25 Mar 2025
Gemma 3 Technical Report
Gemma 3 Technical Report
Gemma Team
Aishwarya B Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
...
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
VLM
93
50
0
25 Mar 2025
Towards Online Multi-Modal Social Interaction Understanding
Towards Online Multi-Modal Social Interaction Understanding
Xuzhao Li
Shijian Deng
Bolin Lai
Weiguo Pian
James M. Rehg
Yapeng Tian
46
0
0
25 Mar 2025
ImageSet2Text: Describing Sets of Images through Text
ImageSet2Text: Describing Sets of Images through Text
Piera Riccio
F. Galati
Kajetan Schweighofer
Noa Garcia
Nuria Oliver
VLM
CoGe
79
0
0
25 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
66
0
0
25 Mar 2025
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Kexian Tang
Junyao Gao
Yanhong Zeng
Haodong Duan
Yanan Sun
Zhening Xing
Wenran Liu
Kaifeng Lyu
Kai-xiang Chen
ELM
LRM
61
2
0
25 Mar 2025
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
Mehdi Moshtaghi
Siavash H. Khajavi
Joni Pajarinen
VLM
61
0
0
25 Mar 2025
DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data
DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data
Liming Zheng
Feng Yan
Fanfan Liu
C. Feng
Yufeng Zhong
Yiyang Huang
Lin Ma
52
0
0
25 Mar 2025
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
Juntao Jian
Xiuping Liu
Z. Chen
Manyi Li
Jian Liu
Ruizhen Hu
46
1
0
25 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi
Boyi Li
Han Cai
Yaojie Lu
Sifei Liu
...
Jan Kautz
Enze Xie
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
211
0
0
25 Mar 2025
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Zhiyao Ren
Yibing Zhan
B. Yu
Dacheng Tao
DiffM
74
0
0
25 Mar 2025
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu
Diankun Zhang
Zongchuang Zhao
Jianfeng Cui
Dingkang Liang
Chong Zhang
Dingyuan Zhang
Hongwei Xie
Bing Wang
Xiang Bai
53
2
0
25 Mar 2025
Improved Alignment of Modalities in Large Vision Language Models
Improved Alignment of Modalities in Large Vision Language Models
Kartik Jangra
Aman Kumar Singh
Yashwani Mann
Geetanjali Rathee
VLM
55
0
0
25 Mar 2025
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Fucai Ke
Vijay Kumar B G
Xingjian Leng
Zhixi Cai
Zaid Khan
Weiqing Wang
P. D. Haghighi
H. Rezatofighi
Manmohan Chandraker
51
0
0
25 Mar 2025
LangBridge: Interpreting Image as a Combination of Language Embeddings
LangBridge: Interpreting Image as a Combination of Language Embeddings
Jiaqi Liao
Yuwei Niu
Fanqing Meng
Hao Li
Changyao Tian
...
Dianqi Li
X. Zhu
Li Yuan
Jifeng Dai
Yu Cheng
MLLM
74
0
0
25 Mar 2025
Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model
Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model
Pengcheng Huang
Dong Li
DiffM
VLM
46
0
0
25 Mar 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Hongcheng Gao
Jiashu Qu
Jingyi Tang
Baolong Bi
Yi Liu
Hongyu Chen
Li Liang
Li Su
Qingming Huang
MLLM
VLM
LRM
88
5
0
25 Mar 2025
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
Zhiqiang Zhang
Jia-Nan Li
Zunnan Xu
Hanhui Li
Yiji Cheng
Fa-Ting Hong
Qin Lin
Qinglin Lu
Xiaodan Liang
DiffM
76
1
0
25 Mar 2025
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
Yishen Liu
Shengda Liu
Hudan Pan
MedIm
55
0
0
24 Mar 2025
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
Zhaoqing Zhu
Chuwei Luo
Zirui Shao
Feiyu Gao
Hangdi Xing
Qi Zheng
Ji Zhang
60
0
0
24 Mar 2025
HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models
HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models
Mingzhen Huang
Fu-Jen Chu
Bugra Tekin
Kevin J Liang
Haoyu Ma
...
Hongfei Xue
Siwei Lyu
Kris Kitani
Matt Feiszli
Hao Tang
VLM
65
0
0
24 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yue Yang
Afshin Dehghan
69
2
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Xia Hu
Bo Yuan
VLM
64
0
0
24 Mar 2025
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Luyao Tang
Yuxuan Yuan
Chen Chen
Zeyu Zhang
Yue Huang
Kun Zhang
62
0
0
24 Mar 2025
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Wencheng Zhu
Yuexin Wang
Hongxuan Li
Pengfei Zhu
Q. Hu
CLIP
60
0
0
24 Mar 2025
MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing
MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing
Lingting Zhu
Jingrui Ye
Runze Zhang
Zeyu Hu
Yingda Yin
...
Jinnan Chen
Shengju Qian
Xin Wang
Qingmin Liao
L. Yu
63
2
0
24 Mar 2025
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Deepayan Das
Davide Talon
Yiming Wang
Massimiliano Mancini
Elisa Ricci
VLM
LRM
52
0
0
24 Mar 2025
Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs
Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs
Chang Gao
Kang Zhao
Jianfei Chen
Liping Jing
49
0
0
24 Mar 2025
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
Zichen Miao
Wei Chen
Qiang Qiu
92
1
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
100
1
0
24 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Jinjin Zhang
Guodong Wang
Yizhou Jin
Di Huang
47
1
0
24 Mar 2025
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Marco Garosi
Alessandro Conti
Gaowen Liu
Elisa Ricci
Massimiliano Mancini
ObjD
VLM
55
0
0
24 Mar 2025
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Bin Li
Dehong Gao
Yeyuan Wang
Linbo Jin
Shanqing Yu
Xiaoyan Cai
Libin Yang
VLM
51
0
0
24 Mar 2025
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Chenyangguang Zhang
Alexandros Delitzas
Fangjinhua Wang
Ruida Zhang
Xiangyang Ji
Marc Pollefeys
Francis Engelmann
3DV
3DPC
52
4
0
24 Mar 2025
MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
Dawei Yan
Yangfu Li
Qing-Guo Chen
Weihua Luo
Peng Wang
Han Zhang
Chunhua Shen
VGen
VLM
LRM
72
1
0
24 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
58
0
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
75
1
0
24 Mar 2025
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Wei Deng
Mengshi Qi
Huadong Ma
3DV
39
0
0
24 Mar 2025
On the Perception Bottleneck of VLMs for Chart Understanding
On the Perception Bottleneck of VLMs for Chart Understanding
Junteng Liu
Weihao Zeng
Xiwen Zhang
Yijun Wang
Zifei Shan
Junxian He
65
0
0
24 Mar 2025
Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions
Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions
Dong Jing
Nanyi Fei
Zhiwu Lu
53
0
0
24 Mar 2025
From Fragment to One Piece: A Survey on AI-Driven Graphic Design
From Fragment to One Piece: A Survey on AI-Driven Graphic Design
Xingxing Zou
Wen Zhang
Nanxuan Zhao
59
0
0
24 Mar 2025
Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned Adapters
Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned Adapters
Ruiqi Zhu
Endong Sun
Guanhe Huang
Oya Celiktutan
CLL
OnRL
72
0
0
24 Mar 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
56
0
0
23 Mar 2025
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Zefeng Zhang
Hengzhu Tang
Shuaiyi Nie
Zhenyu Zhang
Yiming Ren
Zhenyang Li
Dawei Yin
Duohe Ma
Tingwen Liu
55
0
0
23 Mar 2025
SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation
SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation
Haoliang Shang
Hanyu Wu
Guangyao Zhai
Boyang Sun
Fangjinhua Wang
F. Tombari
Marc Pollefeys
67
0
0
23 Mar 2025
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang
Yanjiang Liu
Xianpei Han
Yaojie Lu
Hongyu Lin
Jia Zheng
Xianpei Han
Le Sun
Yingfei Sun
44
0
0
23 Mar 2025
PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning
PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning
Ke Niu
Yuwen Chen
Haiyang Yu
Z. Chen
Xianghui Que
Bin Li
Xiangyang Xue
60
0
0
23 Mar 2025
Previous
123...8910...646566
Next