Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 648 papers shown
Title
Zero-shot Visual Question Answering with Language Model Feedback
Yifan Du
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Ji-Rong Wen
21
13
0
26 May 2023
Learning to Imagine: Visually-Augmented Natural Language Generation
Tianyi Tang
Yushuo Chen
Yifan Du
Junyi Li
Wayne Xin Zhao
Ji-Rong Wen
DiffM
16
9
0
26 May 2023
Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark
Yuxing Long
Binyuan Hui
Caixia Yuan1
Fei Huang
Yongbin Li
Xiaojie Wang
29
4
0
26 May 2023
AlignScore: Evaluating Factual Consistency with a Unified Alignment Function
Yuheng Zha
Yichi Yang
Ruichen Li
Zhiting Hu
HILM
21
180
0
26 May 2023
Weakly Supervised Vision-and-Language Pre-training with Relative Representations
Chi Chen
Peng Li
Maosong Sun
Yang Liu
27
1
0
24 May 2023
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim
Hodong Lee
D. Kim
Haeji Jung
S. Park
Yoon Kim
Sangdoo Yun
Taeho Kil
Bado Lee
Seunghyun Park
VLM
37
4
0
24 May 2023
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang
Jingchang Chen
Wangchunshu Zhou
Haichao Zhu
Jiafeng Liang
Liping Shan
Ming Liu
Dongliang Xu
Qing Yang
Bing Qin
VLM
26
4
0
24 May 2023
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li
Binyuan Hui
Zhichao Yin
Min Yang
Fei Huang
Yongbin Li
MoE
35
19
0
24 May 2023
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
Tuhin Chakrabarty
Arkadiy Saakyan
Olivia Winn
Artemis Panagopoulou
Yue Yang
Marianna Apidianaki
Smaranda Muresan
DiffM
33
41
0
24 May 2023
Vision + Language Applications: A Survey
Yutong Zhou
N. Shimada
VLM
30
6
0
24 May 2023
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Haoxuan You
Rui Sun
Zhecan Wang
Long Chen
Gengyu Wang
Hammad A. Ayyubi
Kai-Wei Chang
Shih-Fu Chang
VLM
MLLM
LRM
52
43
0
24 May 2023
VIP5: Towards Multimodal Foundation Models for Recommendation
Shijie Geng
Juntao Tan
Shuchang Liu
Zuohui Fu
Yongfeng Zhang
32
70
0
23 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
30
5
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
23
25
0
23 May 2023
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
Vaishnavi Himakunthala
Andy Ouyang
Daniel Philip Rose
Ryan He
Alex Mei
Yujie Lu
Chinmay Sonar
Michael Stephen Saxon
William Y. Wang
MLLM
LRM
35
2
0
23 May 2023
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
Taiki Miyanishi
Daich Azuma
Shuhei Kurita
M. Kawanabe
36
2
0
23 May 2023
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Sherzod Hakimov
David Schlangen
VLM
36
5
0
23 May 2023
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao Yang
Can Gao
Hao Liu
Xinyan Xiao
Yanyan Zhao
Bing Qin
31
2
0
23 May 2023
Preconditioned Visual Language Inference with Weak Supervision
Ehsan Qasemi
Amani Maina-Kilaas
Devadutta Dash
Khalid Alsaggaf
Muhao Chen
25
0
0
22 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Jiaheng Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
Text-based Person Search without Parallel Image-Text Data
Yang Bai
Wenwen Qiang
Min Cao
Cheng Chen
Ziqiang Cao
Liqiang Nie
Min Zhang
38
13
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
17
2
0
21 May 2023
A request for clarity over the End of Sequence token in the Self-Critical Sequence Training
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
32
6
0
20 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Jiaheng Liu
15
1
0
19 May 2023
Generating Visual Spatial Description via Holistic 3D Scene Understanding
Yu Zhao
Hao Fei
Wei Ji
Jianguo Wei
Meishan Zhang
Hao Fei
Tat-Seng Chua
28
33
0
19 May 2023
A Topic-aware Summarization Framework with Different Modal Side Information
Xiuying Chen
Mingzhe Li
Shen Gao
Xin Cheng
Qiang Yang
Qishen Zhang
Xin Gao
Xiangliang Zhang
31
13
0
19 May 2023
LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation
Suhyeon Lee
Won Jun Kim
Jinho Chang
Jong Chul Ye
MedIm
32
48
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
48
115
0
18 May 2023
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners
Xuehai He
Weixi Feng
Tsu-jui Fu
Varun Jampani
Arjun Reddy Akula
P. Narayana
Sugato Basu
William Yang Wang
Qing Guo
DiffM
54
7
0
18 May 2023
Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality
Jialing Yuan
Ye Yu
Gaurav Mittal
Matthew Hall
Sandra Sajeev
Mei Chen
24
9
0
17 May 2023
IMAD: IMage-Augmented multi-modal Dialogue
Viktor Moskvoretskii
Anton Frolov
Denis Kuznetsov
24
4
0
17 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
59
74
0
17 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLM
LRM
122
702
0
17 May 2023
Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding
ShuWei Feng
Tianyang Zhan
Zhanming Jie
Trung Quoc Luong
Xiaoran Jin
19
1
0
16 May 2023
Simple Token-Level Confidence Improves Caption Correctness
Suzanne Petryk
Spencer Whitehead
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
31
7
0
11 May 2023
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
Zhaoyang Zhang
Yantao Shen
Kunyu Shi
Zhaowei Cai
Jun Fang
Siqi Deng
Hao Yang
Davide Modolo
Z. Tu
Stefano Soatto
VLM
28
2
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
54
130
0
11 May 2023
Combo of Thinking and Observing for Outside-Knowledge VQA
Q. Si
Yuchen Mo
Zheng Lin
Huishan Ji
Weiping Wang
46
13
0
10 May 2023
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues
Yunxin Li
Baotian Hu
Xinyu Chen
Yuxin Ding
Lin Ma
Min Zhang
LRM
48
14
0
08 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
26
3
0
04 May 2023
Multi-Modality Deep Network for JPEG Artifacts Reduction
Xuhao Jiang
Weimin Tan
Qing Lin
Chenxi Ma
Bo Yan
Liquan Shen
43
2
0
04 May 2023
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Daniel Philip Rose
Vaishnavi Himakunthala
Andy Ouyang
Ryan He
Alex Mei
Yujie Lu
Michael Stephen Saxon
Chinmay Sonar
Diba Mirza
William Yang Wang
LRM
72
38
0
03 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
47
2
0
03 May 2023
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
Yunxin Li
Baotian Hu
Yuxin Ding
Lin Ma
Hao Fei
25
5
0
03 May 2023
Multimodal Procedural Planning via Dual Text-Image Prompting
Yujie Lu
Pan Lu
Zhiyu Zoey Chen
Wanrong Zhu
Qing Guo
William Yang Wang
LM&Ro
62
43
0
02 May 2023
Multimodal Neural Databases
Giovanni Trappolini
Andrea Santilli
Emanuele Rodolà
A. Halevy
Fabrizio Silvestri
50
10
0
02 May 2023
π
π
π
-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
Chengyue Wu
Teng Wang
Yixiao Ge
Zeyu Lu
Rui-Zhi Zhou
Ying Shan
Ping Luo
MoMe
88
35
0
27 Apr 2023
Retrieval-based Knowledge Augmented Vision Language Pre-training
Jiahua Rao
Zifei Shan
Long Liu
Yao Zhou
Yuedong Yang
VLM
88
13
0
27 Apr 2023
Understand the Dynamic World: An End-to-End Knowledge Informed Framework for Open Domain Entity State Tracking
Mingchen Li
Lifu Huang
54
9
0
26 Apr 2023
Multi-Modality Deep Network for Extreme Learned Image Compression
Xuhao Jiang
Weimin Tan
Tian Tan
Bo Yan
Liquan Shen
19
17
0
26 Apr 2023
Previous
1
2
3
...
10
11
12
13
9
Next