Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
48 / 648 papers shown
Title
VIMA: General Robot Manipulation with Multimodal Prompts
Yunfan Jiang
Agrim Gupta
Zichen Zhang
Guanzhi Wang
Yongqiang Dou
Yanjun Chen
Li Fei-Fei
Anima Anandkumar
Yuke Zhu
Linxi Fan
LM&Ro
22
334
0
06 Oct 2022
Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors
Mohammad Reza Taesiri
Finlay Macklon
Yihe Wang
Hengshuo Shen
C. Bezemer
ELM
LLMAG
MLLM
29
13
0
05 Oct 2022
DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
Ivan Kapelyukh
Vitalis Vosylius
Edward Johns
LM&Ro
DiffM
110
144
0
05 Oct 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
17
23
0
28 Sep 2022
VIPHY: Probing "Visible" Physical Commonsense Knowledge
Shikhar Singh
Ehsan Qasemi
Muhao Chen
35
6
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
24
8
0
14 Sep 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Xiao Wang
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
26
682
0
14 Sep 2022
Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest
Jack Hessel
Ana Marasović
Jena D. Hwang
Lillian Lee
Jeff Da
Rowan Zellers
Robert Mankoff
Yejin Choi
VLM
8
85
0
13 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
31
8
0
12 Sep 2022
Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe
Hongyang Li
Chonghao Sima
Jifeng Dai
Wenhai Wang
Lewei Lu
...
Xiaosong Jia
Siqian Liu
Jianping Shi
Dahua Lin
Yu Qiao
88
138
0
12 Sep 2022
How good are deep models in understanding the generated images?
Ali Borji
OOD
19
6
0
23 Aug 2022
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Manuel Brack
P. Schramowski
Bjorn Deiseroth
Kristian Kersting
VLM
MLLM
19
3
0
17 Aug 2022
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
21
21
0
13 Aug 2022
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Zhuo Chen
Yufen Huang
Jiaoyan Chen
Yuxia Geng
Yin Fang
Jeff Z. Pan
Ningyu Zhang
Wen Zhang
15
35
0
26 Jul 2022
Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?
P. Bongini
Federico Becattini
A. Bimbo
10
13
0
25 Jul 2022
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang
20
10
0
23 Jul 2022
Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?
Sangmyeong Woh
Jaemin Lee
Hoki Kim
Jinsuk Lee
16
0
0
18 Jul 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
18
81
0
14 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Yujia Xie
Luowei Zhou
Xiyang Dai
Lu Yuan
Nguyen Bach
Ce Liu
Michael Zeng
VLM
MLLM
29
28
0
03 Jun 2022
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
Yan Zeng
Wangchunshu Zhou
Ao Luo
Ziming Cheng
Xinsong Zhang
VLM
17
30
0
01 Jun 2022
An Efficient Modern Baseline for FloodNet VQA
Aditya Kane
Sahil Khose
19
4
0
30 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
27
527
0
27 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
16
4
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
21
38
0
23 May 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang
Manling Li
Ruochen Xu
Luowei Zhou
Jie Lei
...
Chenguang Zhu
Derek Hoiem
Shih-Fu Chang
Mohit Bansal
Heng Ji
MLLM
VLM
170
137
0
22 May 2022
A Survey on Unsupervised Anomaly Detection Algorithms for Industrial Images
Yajie Cui
Zhaoxiang Liu
Shiguo Lian
OOD
DRL
30
42
0
24 Apr 2022
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
Gen Luo
Yiyi Zhou
Jiamu Sun
Xiaoshuai Sun
Rongrong Ji
ObjD
10
10
0
17 Apr 2022
Multimodal Quasi-AutoRegression: Forecasting the visual popularity of new fashion products
Stefanos-Iordanis Papadopoulos
C. Koutlis
Symeon Papadopoulos
Y. Kompatsiaris
19
20
0
08 Apr 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
20
16
0
27 Mar 2022
Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
Shijie Geng
Shuchang Liu
Zuohui Fu
Yingqiang Ge
Yongfeng Zhang
VLM
AI4TS
23
444
0
24 Mar 2022
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Zhou Yu
Zitian Jin
Jun Yu
Mingliang Xu
Hongbo Wang
Jianping Fan
25
4
0
24 Mar 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Penglei Sun
Xuwu Wang
Yanghua Xiao
N. Yuan
26
154
0
11 Feb 2022
Survey of Hallucination in Natural Language Generation
Ziwei Ji
Nayeon Lee
Rita Frieske
Tiezheng Yu
D. Su
...
Delong Chen
Wenliang Dai
Ho Shu Chan
Andrea Madotto
Pascale Fung
HILM
LRM
38
2,232
0
08 Feb 2022
A Survey of Natural Language Generation
Chenhe Dong
Yinghui Li
Haifan Gong
M. Chen
Junxin Li
Ying Shen
Min Yang
3DV
19
43
0
22 Dec 2021
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
302
7,434
0
11 Nov 2021
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels
Subhabrata Choudhury
Iro Laina
Christian Rupprecht
Andrea Vedaldi
VLM
24
9
0
05 Nov 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
213
1,656
0
15 Oct 2021
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Junyang Lin
An Yang
Jinze Bai
Chang Zhou
Le Jiang
...
Jie M. Zhang
Yong Li
Wei Lin
Jingren Zhou
Hongxia Yang
MoE
84
43
0
08 Oct 2021
Pix2seq: A Language Modeling Framework for Object Detection
Ting-Li Chen
Saurabh Saxena
Lala Li
David J. Fleet
Geoffrey E. Hinton
MLLM
ViT
VLM
233
344
0
22 Sep 2021
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
303
5,773
0
29 Apr 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,844
0
18 Apr 2021
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
253
4,774
0
24 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
273
1,081
0
17 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho
Jie Lei
Hao Tan
Mohit Bansal
MLLM
249
525
0
04 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen
Haoqi Fan
Ross B. Girshick
Kaiming He
SSL
246
3,369
0
09 Mar 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
927
0
24 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,950
0
20 Apr 2018
Previous
1
2
3
...
11
12
13