Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2412.04424
Cited By
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
5 December 2024
Jiuhai Chen
Jianwei Yang
Haiping Wu
Dianqi Li
Jianfeng Gao
Tianyi Zhou
Bin Xiao
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (229★)
Papers citing
"Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion"
19 / 19 papers shown
Title
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
115
68
0
28 Aug 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
108
627
0
25 Apr 2024
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Zhiyang Xu
Chao Feng
Rulin Shao
Trevor Ashby
Ying Shen
dingnan jin
Yu Cheng
Qifan Wang
Lifu Huang
MLLM
VLM
60
51
0
18 Feb 2024
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
234
943
0
27 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
103
165
0
10 Nov 2023
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu
Hritik Bansal
Tony Xia
Jiacheng Liu
Chun-yue Li
Hannaneh Hajishirzi
Hao Cheng
Kai-Wei Chang
Michel Galley
Jianfeng Gao
LRM
MLLM
122
616
0
03 Oct 2023
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
560
4,861
0
17 Apr 2023
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
281
1,286
0
20 Sep 2022
DaViT: Dual Attention Vision Transformers
Mingyu Ding
Bin Xiao
Noel Codella
Ping Luo
Jingdong Wang
Lu Yuan
ViT
121
250
0
07 Apr 2022
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
132
680
0
19 Mar 2022
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
460
15,665
0
20 Dec 2021
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
122
169
0
22 Nov 2021
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
100
242
0
26 Apr 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
967
29,731
0
26 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
434
1,127
0
17 Feb 2021
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
142
739
0
01 Jul 2020
Towards VQA Models That Can Read
Amanpreet Singh
Vivek Natarajan
Meet Shah
Yu Jiang
Xinlei Chen
Dhruv Batra
Devi Parikh
Marcus Rohrbach
EgoV
97
1,244
0
18 Apr 2019
VizWiz Grand Challenge: Answering Visual Questions from Blind People
Danna Gurari
Qing Li
Abigale Stangl
Anhong Guo
Chi Lin
Kristen Grauman
Jiebo Luo
Jeffrey P. Bigham
CoGe
97
858
0
22 Feb 2018
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
345
3,270
0
02 Dec 2016
1