Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.08394
Cited By
v1
v2
v3 (latest)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
3 January 2025
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
Zhe Chen
Wenhai Wang
X. Zhu
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks"
50 / 224 papers shown
Title
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Manas
Pau Rodríguez López
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Aishwarya Agrawal
VLM
VPVLM
63
51
0
13 Oct 2022
A Generalist Framework for Panoptic Segmentation of Images and Videos
Ting-Li Chen
Lala Li
Saurabh Saxena
Geoffrey E. Hinton
David J. Fleet
VGen
MLLM
121
104
0
12 Oct 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
160
412
0
17 Jun 2022
A Unified Sequence Interface for Vision Tasks
Ting-Li Chen
Saurabh Saxena
Lala Li
Nayeon Lee
David J. Fleet
Geoffrey E. Hinton
VLM
MLLM
81
152
0
15 Jun 2022
APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking
Yuxiang Yang
Junjie Yang
Yufei Xu
Jing Zhang
Long Lan
Dacheng Tao
91
44
0
12 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
93
70
0
09 Jun 2022
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
Feng Li
Hao Zhang
Hu-Sheng Xu
Siyi Liu
Lei Zhang
L. Ni
H. Shum
ISeg
149
392
0
06 Jun 2022
SelfReformer: Self-Refined Network with Transformer for Salient Object Detection
Y. Yun
Weisi Lin
ViT
124
29
0
23 May 2022
OPT: Open Pre-trained Transformer Language Models
Susan Zhang
Stephen Roller
Naman Goyal
Mikel Artetxe
Moya Chen
...
Daniel Simig
Punit Singh Koura
Anjali Sridhar
Tianlu Wang
Luke Zettlemoyer
VLM
OSLM
AI4CE
384
3,707
0
02 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
420
3,617
0
29 Apr 2022
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding
Xun Long Ng
Kian Eng Ong
Qichen Zheng
Yun Ni
S. Yeo
Jing Liu
VGen
79
88
0
18 Apr 2022
Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li
Hanzi Mao
Ross B. Girshick
Kaiming He
ViT
106
818
0
30 Mar 2022
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Shangbang Long
Siyang Qin
Dmitry Panteleev
Alessandro Bissacco
Yasuhisa Fujii
Michalis Raptis
97
97
0
28 Mar 2022
High-resolution Iterative Feedback Network for Camouflaged Object Detection
Xiaobin Hu
Deng-Ping Fan
Xuebin Qin
Hang Dai
Wenqi Ren
Ying Tai
Chengjie Wang
Ling Shao
109
121
0
22 Mar 2022
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
134
687
0
19 Mar 2022
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection
Youwei Pang
Xiaoqi Zhao
Tian-Zhu Xiang
Zhang Lihe
Huchuan Lu
ObjD
101
226
0
05 Mar 2022
CM3: A Causal Masked Multimodal Model of the Internet
Armen Aghajanyan
Po-Yao (Bernie) Huang
Candace Ross
Vladimir Karpukhin
Hu Xu
...
Dmytro Okhonko
Mandar Joshi
Gargi Ghosh
M. Lewis
Luke Zettlemoyer
114
158
0
19 Jan 2022
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
564
15,835
0
20 Dec 2021
RegionCLIP: Region-based Language-Image Pretraining
Yiwu Zhong
Jianwei Yang
Pengchuan Zhang
Chunyuan Li
Noel Codella
...
Luowei Zhou
Xiyang Dai
Lu Yuan
Yin Li
Jianfeng Gao
VLM
CLIP
153
583
0
16 Dec 2021
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjD
VLM
153
1,070
0
07 Dec 2021
Masked-attention Mask Transformer for Universal Image Segmentation
Bowen Cheng
Ishan Misra
Alex Schwing
Alexander Kirillov
Rohit Girdhar
ISeg
286
2,397
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
124
133
0
02 Dec 2021
OCR-free Document Understanding Transformer
Geewook Kim
Teakgyu Hong
Moonbin Yim
Jeongyeon Nam
Jinyoung Park
Jinyeong Yim
Wonseok Hwang
Sangdoo Yun
Dongyoon Han
Seunghyun Park
ViT
141
274
0
30 Nov 2021
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
Junjue Wang
Zhuo Zheng
A. Ma
Xiaoyan Lu
Yanfei Zhong
109
342
0
17 Oct 2021
AP-10K: A Benchmark for Animal Pose Estimation in the Wild
Hang Yu
Yufei Xu
Jing Zhang
Wei Zhao
Ziyu Guan
Dacheng Tao
103
113
0
28 Aug 2021
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
559
10,625
0
17 Jun 2021
Anabranch Network for Camouflaged Object Segmentation
Trung-Nghia Le
Tam V. Nguyen
Zhongliang Nie
M. Tran
Akihiro Sugimoto
115
507
0
20 May 2021
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Amanpreet Singh
Guan Pang
Mandy Toh
Jing Huang
Wojciech Galuba
Tal Hassner
77
174
0
12 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
203
897
0
26 Apr 2021
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
108
242
0
26 Apr 2021
Visual Saliency Transformer
Nian Liu
Ni Zhang
Kaiyuan Wan
Ling Shao
Junwei Han
ViT
322
363
0
25 Apr 2021
Benchmarking Representation Learning for Natural World Image Collections
Grant Van Horn
Elijah Cole
Sara Beery
Kimberly Wilber
Serge J. Belongie
Oisin Mac Aodha
SSL
VLM
76
179
0
30 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
492
21,752
0
25 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
1.0K
30,029
0
26 Feb 2021
Concealed Object Detection
Deng-Ping Fan
Ge-Peng Ji
Ming-Ming Cheng
Ling Shao
83
436
0
20 Feb 2021
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
297
5,129
0
08 Oct 2020
Label Decoupling Framework for Salient Object Detection
Junhang Wei
Shuhui Wang
Zhe Wu
Chi Su
Qingming Huang
Q. Tian
73
277
0
25 Aug 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
124
382
0
30 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
116
501
0
11 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
971
42,651
0
28 May 2020
TextCaps: a Dataset for Image Captioning with Reading Comprehension
Oleksii Sidorov
Ronghang Hu
Marcus Rohrbach
Amanpreet Singh
99
418
0
24 Mar 2020
UniPose: Unified Human Pose Estimation in Single Images and Videos
Bruno Artacho
Andreas E. Savakis
199
138
0
22 Jan 2020
Gradient Surgery for Multi-Task Learning
Tianhe Yu
Saurabh Kumar
Abhishek Gupta
Sergey Levine
Karol Hausman
Chelsea Finn
197
1,234
0
19 Jan 2020
ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling -- RRC-LSVT
Yipeng Sun
Zihan Ni
Chee-Kheng Chng
Yuliang Liu
Canjie Luo
...
Errui Ding
Jingtuo Liu
Dimosthenis Karatzas
Chee Seng Chan
Lianwen Jin
3DV
109
158
0
17 Sep 2019
AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces
M. H. Khan
J. McDonagh
Salman Khan
M. Shahabuddin
Aditya Arora
Fahad Shahbaz Khan
Ling Shao
Georgios Tzimiropoulos
CVBM
67
48
0
11 Sep 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
204
1,671
0
22 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
216
907
0
16 Aug 2019
LVIS: A Dataset for Large Vocabulary Instance Segmentation
Agrim Gupta
Piotr Dollár
Ross B. Girshick
ISeg
VLM
125
1,379
0
08 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
288
3,711
0
06 Aug 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
139
1,095
0
31 May 2019
Previous
1
2
3
4
5
Next