Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
v1
v2 (latest)
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (2502★)
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 656 papers shown
Title
Universal Instance Perception as Object Discovery and Retrieval
B. Yan
Yi Jiang
Jiannan Wu
D. Wang
Ping Luo
Zehuan Yuan
Huchuan Lu
VOS
VLM
LRM
148
175
0
12 Mar 2023
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining
Shixiang Tang
Cheng Chen
Qingsong Xie
Meilin Chen
Yizhou Wang
...
Feng Zhu
Haiyang Yang
Li Yi
Rui Zhao
Wanli Ouyang
VLM
105
36
0
10 Mar 2023
Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
K. Okada
Masayuki Inaba
LM&Ro
64
12
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
143
77
0
10 Mar 2023
Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors
Mesut Erhan Unal
Adriana Kovashka
VLM
73
5
0
09 Mar 2023
VQA-based Robotic State Recognition Optimized with Genetic Algorithm
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
K. Okada
Masayuki Inaba
48
16
0
09 Mar 2023
Models See Hallucinations: Evaluating the Factuality in Video Captioning
Hui Liu
Xiaojun Wan
HILM
58
11
0
06 Mar 2023
UniHCP: A Unified Model for Human-Centric Perceptions
Yuanzheng Ci
Yizhou Wang
Meilin Chen
Shixiang Tang
Lei Bai
Feng Zhu
Rui Zhao
F. Yu
Donglian Qi
Wanli Ouyang
135
52
0
06 Mar 2023
Prismer: A Vision-Language Model with Multi-Task Experts
Shikun Liu
Linxi Fan
Edward Johns
Zhiding Yu
Chaowei Xiao
Anima Anandkumar
VLM
MLLM
139
25
0
04 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
78
45
0
04 Mar 2023
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu
Xuecheng Ouyang
Zhenwei Shao
Mei Wang
Jun Yu
MLLM
186
11
0
03 Mar 2023
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jingjing Jiang
Nanning Zheng
MoE
114
6
0
02 Mar 2023
EVJVQA Challenge: Multilingual Visual Question Answering
Ngan Luu-Thuy Nguyen
Nghia Hieu Nguyen
Duong T.D. Vo
K. Tran
Kiet Van Nguyen
82
7
0
23 Feb 2023
Backdoor Attacks to Pre-trained Unified Foundation Models
Zenghui Yuan
Yixin Liu
Kai Zhang
Pan Zhou
Lichao Sun
AAML
88
11
0
18 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
127
33
0
17 Feb 2023
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Jiang Liu
Hui Ding
Zhaowei Cai
Yuting Zhang
R. Satzoda
Vijay Mahadevan
R. Manmatha
ObjD
123
133
0
14 Feb 2023
IC3: Image Captioning by Committee Consensus
David M. Chan
Austin Myers
Sudheendra Vijayanarasimhan
David A. Ross
John F. Canny
67
18
0
02 Feb 2023
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Alon Albalak
Colin Raffel
William Yang Wang
100
12
0
01 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
116
171
0
01 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
104
32
0
01 Feb 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
525
4,670
0
30 Jan 2023
BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models
Ali Borji
CoGe
52
1
0
28 Jan 2023
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim
Jin-Hwa Kim
Jiyoung Lee
Minjoon Seo
VGen
80
14
0
27 Jan 2023
Characterizing the Entities in Harmful Memes: Who is the Hero, the Villain, the Victim?
Shivam Sharma
Atharva Kulkarni
Tharun Suresh
Himanshi Mathur
Preslav Nakov
Md. Shad Akhtar
Tanmoy Chakraborty
100
17
0
26 Jan 2023
Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering
Chenxi Whitehouse
Tillman Weyde
Pranava Madhyastha
LRM
91
3
0
25 Jan 2023
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan-George Pasca
Alexey Gavryushin
Muhammad Hamza
Yen-Ling Kuo
Kaichun Mo
Luc Van Gool
Otmar Hilliges
Xi Wang
169
14
0
22 Jan 2023
Towards Models that Can See and Read
Roy Ganz
Oren Nuriel
Aviad Aberdam
Yair Kittenplon
Shai Mazor
Ron Litman
71
13
0
18 Jan 2023
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Xinsong Zhang
Yan Zeng
Jipeng Zhang
Hang Li
VLM
AI4CE
LRM
122
17
0
12 Jan 2023
Scaling Laws for Generative Mixed-Modal Language Models
Armen Aghajanyan
L. Yu
Alexis Conneau
Wei-Ning Hsu
Karen Hambardzumyan
Susan Zhang
Stephen Roller
Naman Goyal
Omer Levy
Luke Zettlemoyer
MoE
VLM
100
110
0
10 Jan 2023
SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph
Yuxing Long
Binyuan Hui
Fulong Ye
Yanyang Li
Zhuoxin Han
Caixia Yuan
Yongbin Li
Xiaojie Wang
LLMAG
65
8
0
05 Jan 2023
When are Lemons Purple? The Concept Association Bias of Vision-Language Models
Yutaro Yamada
Yingtian Tang
Yoyo Zhang
Ilker Yildirim
CoGe
64
15
0
22 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
124
259
0
21 Dec 2022
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Zhiyang Xu
Ying Shen
Lifu Huang
MLLM
139
120
0
21 Dec 2022
Planning-oriented Autonomous Driving
Yi Hu
Jiazhi Yang
Li Chen
Keyu Li
Chonghao Sima
...
Xiaosong Jia
Qiang Liu
Jifeng Dai
Yu Qiao
Hongyang Li
96
663
0
20 Dec 2022
Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Chen Ju
Kunhao Zheng
Jinxian Liu
Peisen Zhao
Ya Zhang
Jianlong Chang
Yanfeng Wang
Qi Tian
65
11
0
19 Dec 2022
Transferring General Multimodal Pretrained Models to Text Recognition
Junyang Lin
Xuancheng Ren
Yichang Zhang
Gao Liu
Peng Wang
An Yang
Chang Zhou
69
4
0
19 Dec 2022
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Qiucheng Wu
Yujian Liu
Handong Zhao
Ajinkya Kale
T. Bui
Tong Yu
Zhe Lin
Yang Zhang
Shiyu Chang
DiffM
CoGe
91
104
0
16 Dec 2022
Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
Qian Yang
Qian Chen
Wen Wang
Baotian Hu
Min Zhang
103
27
0
16 Dec 2022
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Jinze Bai
Rui Men
Han Yang
Xuancheng Ren
Kai Dang
...
Wenhang Ge
Jianxin Ma
Junyang Lin
Jingren Zhou
Chang Zhou
88
16
0
08 Dec 2022
Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations
Björn Plüster
Jakob Ambsdorf
Lukas Braach
Jae Hee Lee
S. Wermter
76
6
0
08 Dec 2022
Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning
Ukyo Honda
Taro Watanabe
Yuji Matsumoto
63
9
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
174
332
0
06 Dec 2022
Adaptive Testing of Computer Vision Models
Irena Gao
Gabriel Ilharco
Scott M. Lundberg
Marco Tulio Ribeiro
VLM
84
43
0
06 Dec 2022
Unifying Vision, Text, and Layout for Universal Document Processing
Zineng Tang
Ziyi Yang
Guoxin Wang
Yuwei Fang
Yang Liu
Chenguang Zhu
Michael Zeng
Chao-Yue Zhang
Joey Tianyi Zhou
VLM
131
115
0
05 Dec 2022
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Xinlong Wang
Wen Wang
Yue Cao
Chunhua Shen
Tiejun Huang
VLM
MLLM
159
262
0
05 Dec 2022
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen
Ronghang Hu
Xinlei Chen
Matthias Nießner
Angel X. Chang
120
54
0
01 Dec 2022
What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes
Shivam Sharma
Siddhant Agarwal
Tharun Suresh
Preslav Nakov
Md. Shad Akhtar
Tanmoy Charkraborty
VLM
100
22
0
01 Dec 2022
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
Zhuowan Li
Cihang Xie
Benjamin Van Durme
Alan Yuille
VLM
SSL
56
2
0
01 Dec 2022
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
Runyu Ding
Jihan Yang
Chuhui Xue
Wenqing Zhang
Song Bai
Xiaojuan Qi
VLM
80
154
0
29 Nov 2022
MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition
Xiaohuan Zhou
Jiaming Wang
Zeyu Cui
Shiliang Zhang
Zhijie Yan
Jingren Zhou
Chang Zhou
93
12
0
29 Nov 2022
Previous
1
2
3
...
11
12
13
14
Next