ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.08394
  4. Cited By
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
v1v2v3 (latest)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

3 January 2025
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
Zhe Chen
Wenhai Wang
X. Zhu
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
    MLLMVLMLRM
ArXiv (abs)PDFHTML

Papers citing "VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks"

50 / 224 papers shown
Title
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for
  Vision-Language Few-Shot Prompting
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Manas
Pau Rodríguez López
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Aishwarya Agrawal
VLMVPVLM
63
51
0
13 Oct 2022
A Generalist Framework for Panoptic Segmentation of Images and Videos
A Generalist Framework for Panoptic Segmentation of Images and Videos
Ting-Li Chen
Lala Li
Saurabh Saxena
Geoffrey E. Hinton
David J. Fleet
VGenMLLM
121
104
0
12 Oct 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjDVLMMLLM
160
412
0
17 Jun 2022
A Unified Sequence Interface for Vision Tasks
A Unified Sequence Interface for Vision Tasks
Ting-Li Chen
Saurabh Saxena
Lala Li
Nayeon Lee
David J. Fleet
Geoffrey E. Hinton
VLMMLLM
81
152
0
15 Jun 2022
APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking
APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking
Yuxiang Yang
Junjie Yang
Yufei Xu
Jing Zhang
Long Lan
Dacheng Tao
91
44
0
12 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional
  MoEs
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMeMoE
93
70
0
09 Jun 2022
Mask DINO: Towards A Unified Transformer-based Framework for Object
  Detection and Segmentation
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
Feng Li
Hao Zhang
Hu-Sheng Xu
Siyi Liu
Lei Zhang
L. Ni
H. Shum
ISeg
149
392
0
06 Jun 2022
SelfReformer: Self-Refined Network with Transformer for Salient Object
  Detection
SelfReformer: Self-Refined Network with Transformer for Salient Object Detection
Y. Yun
Weisi Lin
ViT
124
29
0
23 May 2022
OPT: Open Pre-trained Transformer Language Models
OPT: Open Pre-trained Transformer Language Models
Susan Zhang
Stephen Roller
Naman Goyal
Mikel Artetxe
Moya Chen
...
Daniel Simig
Punit Singh Koura
Anjali Sridhar
Tianlu Wang
Luke Zettlemoyer
VLMOSLMAI4CE
384
3,707
0
02 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
420
3,617
0
29 Apr 2022
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior
  Understanding
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding
Xun Long Ng
Kian Eng Ong
Qichen Zheng
Yun Ni
S. Yeo
Jing Liu
VGen
79
88
0
18 Apr 2022
Exploring Plain Vision Transformer Backbones for Object Detection
Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li
Hanzi Mao
Ross B. Girshick
Kaiming He
ViT
106
818
0
30 Mar 2022
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Shangbang Long
Siyang Qin
Dmitry Panteleev
Alessandro Bissacco
Yasuhisa Fujii
Michalis Raptis
97
97
0
28 Mar 2022
High-resolution Iterative Feedback Network for Camouflaged Object
  Detection
High-resolution Iterative Feedback Network for Camouflaged Object Detection
Xiaobin Hu
Deng-Ping Fan
Xuebin Qin
Hang Dai
Wenqi Ren
Ying Tai
Chengjie Wang
Ling Shao
109
121
0
22 Mar 2022
ChartQA: A Benchmark for Question Answering about Charts with Visual and
  Logical Reasoning
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
134
687
0
19 Mar 2022
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object
  Detection
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection
Youwei Pang
Xiaoqi Zhao
Tian-Zhu Xiang
Zhang Lihe
Huchuan Lu
ObjD
101
226
0
05 Mar 2022
CM3: A Causal Masked Multimodal Model of the Internet
CM3: A Causal Masked Multimodal Model of the Internet
Armen Aghajanyan
Po-Yao (Bernie) Huang
Candace Ross
Vladimir Karpukhin
Hu Xu
...
Dmytro Okhonko
Mandar Joshi
Gargi Ghosh
M. Lewis
Luke Zettlemoyer
114
158
0
19 Jan 2022
High-Resolution Image Synthesis with Latent Diffusion Models
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
564
15,835
0
20 Dec 2021
RegionCLIP: Region-based Language-Image Pretraining
RegionCLIP: Region-based Language-Image Pretraining
Yiwu Zhong
Jianwei Yang
Pengchuan Zhang
Chunyuan Li
Noel Codella
...
Luowei Zhou
Xiyang Dai
Lu Yuan
Yin Li
Jianfeng Gao
VLMCLIP
153
583
0
16 Dec 2021
Grounded Language-Image Pre-training
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjDVLM
153
1,070
0
07 Dec 2021
Masked-attention Mask Transformer for Universal Image Segmentation
Masked-attention Mask Transformer for Universal Image Segmentation
Bowen Cheng
Ishan Misra
Alex Schwing
Alexander Kirillov
Rohit Girdhar
ISeg
286
2,397
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
124
133
0
02 Dec 2021
OCR-free Document Understanding Transformer
OCR-free Document Understanding Transformer
Geewook Kim
Teakgyu Hong
Moonbin Yim
Jeongyeon Nam
Jinyoung Park
Jinyeong Yim
Wonseok Hwang
Sangdoo Yun
Dongyoon Han
Seunghyun Park
ViT
141
274
0
30 Nov 2021
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic
  Segmentation
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
Junjue Wang
Zhuo Zheng
A. Ma
Xiaoyan Lu
Yanfei Zhong
109
342
0
17 Oct 2021
AP-10K: A Benchmark for Animal Pose Estimation in the Wild
AP-10K: A Benchmark for Animal Pose Estimation in the Wild
Hang Yu
Yufei Xu
Jing Zhang
Wei Zhao
Ziyu Guan
Dacheng Tao
103
113
0
28 Aug 2021
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
559
10,625
0
17 Jun 2021
Anabranch Network for Camouflaged Object Segmentation
Anabranch Network for Camouflaged Object Segmentation
Trung-Nghia Le
Tam V. Nguyen
Zhongliang Nie
M. Tran
Akihiro Sugimoto
115
507
0
20 May 2021
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
  scene text
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Amanpreet Singh
Guan Pang
Mandy Toh
Jing Huang
Wojciech Galuba
Tal Hassner
77
174
0
12 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjDVLM
203
897
0
26 Apr 2021
InfographicVQA
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
108
242
0
26 Apr 2021
Visual Saliency Transformer
Visual Saliency Transformer
Nian Liu
Ni Zhang
Kaiyuan Wan
Ling Shao
Junwei Han
ViT
322
363
0
25 Apr 2021
Benchmarking Representation Learning for Natural World Image Collections
Benchmarking Representation Learning for Natural World Image Collections
Grant Van Horn
Elijah Cole
Sara Beery
Kimberly Wilber
Serge J. Belongie
Oisin Mac Aodha
SSLVLM
76
179
0
30 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
492
21,752
0
25 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
1.0K
30,029
0
26 Feb 2021
Concealed Object Detection
Concealed Object Detection
Deng-Ping Fan
Ge-Peng Ji
Ming-Ming Cheng
Ling Shao
83
436
0
20 Feb 2021
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
297
5,129
0
08 Oct 2020
Label Decoupling Framework for Salient Object Detection
Label Decoupling Framework for Salient Object Detection
Junhang Wei
Shuhui Wang
Zhe Wu
Chi Su
Qingming Huang
Q. Tian
73
277
0
25 Aug 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
  Scene Graph
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
124
382
0
30 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation
  Learning
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjDVLM
116
501
0
11 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
971
42,651
0
28 May 2020
TextCaps: a Dataset for Image Captioning with Reading Comprehension
TextCaps: a Dataset for Image Captioning with Reading Comprehension
Oleksii Sidorov
Ronghang Hu
Marcus Rohrbach
Amanpreet Singh
99
418
0
24 Mar 2020
UniPose: Unified Human Pose Estimation in Single Images and Videos
UniPose: Unified Human Pose Estimation in Single Images and Videos
Bruno Artacho
Andreas E. Savakis
199
138
0
22 Jan 2020
Gradient Surgery for Multi-Task Learning
Gradient Surgery for Multi-Task Learning
Tianhe Yu
Saurabh Kumar
Abhishek Gupta
Sergey Levine
Karol Hausman
Chelsea Finn
197
1,234
0
19 Jan 2020
ICDAR 2019 Competition on Large-scale Street View Text with Partial
  Labeling -- RRC-LSVT
ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling -- RRC-LSVT
Yipeng Sun
Zihan Ni
Chee-Kheng Chng
Yuliang Liu
Canjie Luo
...
Errui Ding
Jingtuo Liu
Dimosthenis Karatzas
Chee Seng Chan
Lianwen Jin
3DV
109
158
0
17 Sep 2019
AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces
AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces
M. H. Khan
J. McDonagh
Salman Khan
M. Shahabuddin
Aditya Arora
Fahad Shahbaz Khan
Ling Shao
Georgios Tzimiropoulos
CVBM
67
48
0
11 Sep 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLMMLLMSSL
204
1,671
0
22 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSLVLMMLLM
216
907
0
16 Aug 2019
LVIS: A Dataset for Large Vocabulary Instance Segmentation
LVIS: A Dataset for Large Vocabulary Instance Segmentation
Agrim Gupta
Piotr Dollár
Ross B. Girshick
ISegVLM
125
1,379
0
08 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
288
3,711
0
06 Aug 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External
  Knowledge
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
139
1,095
0
31 May 2019
Previous
12345
Next