ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.12637
  4. Cited By
Building and better understanding vision-language models: insights and
  future directions

Building and better understanding vision-language models: insights and future directions

22 August 2024
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
    VLM
ArXivPDFHTML

Papers citing "Building and better understanding vision-language models: insights and future directions"

23 / 23 papers shown
Title
FG-CLIP: Fine-Grained Visual and Textual Alignment
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
53
0
0
08 May 2025
R^3-VQA: "Read the Room" by Video Social Reasoning
R^3-VQA: "Read the Room" by Video Social Reasoning
Lixing Niu
Jiapeng Li
Xingping Yu
Shu Wang
Ruining Feng
Bo Wu
Ping Wei
Yixuan Wang
Lifeng Fan
51
0
0
07 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
L. Zhang
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
64
0
0
02 May 2025
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Zhikai Wang
Jiashuo Sun
Wenbo Zhang
Zhiqiang Hu
Xin Li
F. Wang
Deli Zhao
VLM
LRM
75
0
0
24 Apr 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu
Mengjie Liu
Jianfei Chen
Jingwei Xu
Bin Cui
Conghui He
Wentao Zhang
MLLM
59
0
0
14 Apr 2025
A Token-level Text Image Foundation Model for Document Understanding
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei-Ming Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Xiaokang Yang
VLM
45
0
0
04 Mar 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
87
3
0
26 Feb 2025
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Cristiano Patrício
Isabel Rio-Torto
J. S. Cardoso
Luís F. Teixeira
João C. Neves
VLM
218
0
0
21 Jan 2025
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang
Yuchang Su
Yiming Liu
Xiaohan Wang
James Burgess
...
Josiah Aklilu
Alejandro Lozano
Anjiang Wei
Ludwig Schmidt
Serena Yeung-Levy
61
3
0
06 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
43
2
0
01 Jan 2025
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
M. Zhang
116
7
0
22 Dec 2024
Teaching VLMs to Localize Specific Objects from In-context Examples
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh
Nimrod Shabtay
Wei Lin
Eli Schwartz
Hilde Kuehne
...
Leonid Karlinsky
James Glass
Assaf Arbelle
S. Ullman
Muhammad Jehanzeb Mirza
VLM
103
1
0
20 Nov 2024
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
Eduardo R. Corral-Soto
Yang Liu
Tongtong Cao
Y. Ren
Liu Bingbing
52
0
0
14 Oct 2024
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang
Chi Chen
Fuwen Luo
Yurui Dong
Yuanchi Zhang
Yuzhuang Xu
Xiaolong Wang
Peng Li
Yang Liu
LRM
40
3
0
07 Oct 2024
Unlocking the conversion of Web Screenshots into HTML Code with the
  WebSight Dataset
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Hugo Laurençon
Léo Tronchon
Victor Sanh
VLM
55
36
0
14 Mar 2024
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
  Language Models
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Siddharth Karamcheti
Suraj Nair
Ashwin Balakrishna
Percy Liang
Thomas Kollar
Dorsa Sadigh
MLLM
VLM
57
98
0
12 Feb 2024
Silkie: Preference Distillation for Large Visual Language Models
Silkie: Preference Distillation for Large Visual Language Models
Lei Li
Zhihui Xie
Mukai Li
Shunian Chen
Peiyi Wang
Liang Chen
Yazheng Yang
Benyou Wang
Lingpeng Kong
MLLM
112
68
0
17 Dec 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
270
4,244
0
30 Jan 2023
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
  Understanding
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
166
263
0
07 Oct 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
A. Kalyan
ELM
ReLM
LRM
211
1,106
0
20 Sep 2022
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
167
152
0
07 Aug 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
290
1,084
0
17 Feb 2021
Aggregated Residual Transformations for Deep Neural Networks
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Z. Tu
Kaiming He
297
10,220
0
16 Nov 2016
1