Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
v1
v2 (latest)
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (2502★)
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 656 papers shown
Title
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
197
59
0
07 Feb 2024
Multimodal Rationales for Explainable Visual Question Answering
Kun Li
G. Vosselman
Michael Ying Yang
132
2
0
06 Feb 2024
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi
Ming Ding
Weihan Wang
Yushi Bai
Qingsong Lv
...
Bin Xu
Lei Hou
Juanzi Li
Yuxiao Dong
Jie Tang
VLM
LRM
50
17
0
06 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
79
3
0
04 Feb 2024
LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement
Renyuan Peng
Xinyue Cai
Hang Xu
Jiachen Lu
Feng Wen
Wei Zhang
Li Zhang
87
4
0
31 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
66
5
0
30 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
102
5
0
29 Jan 2024
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren
Shilong Liu
Ailing Zeng
Jing Lin
Kunchang Li
...
Feng Li
Jie Yang
Hongyang Li
Qing Jiang
Lei Zhang
VLM
148
449
0
25 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
164
217
0
24 Jan 2024
Small Language Model Meets with Reinforced Vision Vocabulary
Haoran Wei
Lingyu Kong
Jinyue Chen
Liang Zhao
Zheng Ge
En Yu
Jian‐Yuan Sun
Chunrui Han
Xiangyu Zhang
VLM
121
41
0
23 Jan 2024
UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Qingdong He
Jinlong Peng
Zhengkai Jiang
Kai Wu
Xiaozhong Ji
Jiangning Zhang
Yabiao Wang
Chengjie Wang
Mingang Chen
Yunsheng Wu
3DPC
62
8
0
21 Jan 2024
Prompting Large Vision-Language Models for Compositional Reasoning
Timothy Ossowski
Ming Jiang
Junjie Hu
CoGe
VLM
LRM
102
3
0
20 Jan 2024
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually
Mazal Bethany
Brandon Wherry
Nishant Vishwamitra
Peyman Najafirad
DiffM
58
4
0
19 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
145
49
0
18 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
103
4
0
12 Jan 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang
Bohan Zhuang
Qi Wu
66
12
0
12 Jan 2024
AffordanceLLM: Grounding Affordance from Vision Language Models
Shengyi Qian
Weifeng Chen
Min Bai
Xiong Zhou
Zhuowen Tu
Li Erran Li
112
24
0
12 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
117
4
0
06 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
94
10
0
03 Jan 2024
Social Media Ready Caption Generation for Brands
Himanshu Maheshwari
Koustava Goswami
Apoorv Saxena
Balaji Vasan Srinivasan
51
1
0
03 Jan 2024
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever
Zhichao Yin
Binyuan Hui
Min Yang
Fei Huang
Yongbin Li
VLM
76
3
0
02 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
117
6
0
29 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
100
175
0
28 Dec 2023
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Guanqun Wang
Jiaming Liu
Chenxuan Li
Junpeng Ma
Yuan Zhang
...
Kevin Zhang
Maurice Chong
Ray Zhang
Yijiang Liu
Shanghang Zhang
109
8
0
26 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
106
18
0
25 Dec 2023
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Kun Yan
Lei Ji
Zeyu Wang
Yuntao Wang
Nan Duan
Shuai Ma
122
10
0
22 Dec 2023
Generative Multimodal Models are In-Context Learners
Quan-Sen Sun
Yufeng Cui
Xiaosong Zhang
Fan Zhang
Qiying Yu
...
Yueze Wang
Yongming Rao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
LRM
155
291
0
20 Dec 2023
Object Attribute Matters in Visual Question Answering
Peize Li
Q. Si
Peng Fu
Zheng Lin
Yan Wang
78
0
0
20 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
167
36
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
101
26
0
19 Dec 2023
MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising
Bingyuan Wang
Hengyu Meng
Zeyu Cai
Lanjiong Li
Yue Ma
Qifeng Chen
Zeyu Wang
DiffM
89
3
0
18 Dec 2023
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu
Xinyun Zhang
Peng Xu
Peiyu Liao
Xufeng Yao
Bei Yu
VLM
37
0
0
17 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
100
15
0
15 Dec 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
ViT
VLM
82
37
0
14 Dec 2023
Pixel Aligned Language Models
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
132
15
0
14 Dec 2023
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu
Yi Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOS
VLM
111
41
0
14 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
73
15
0
13 Dec 2023
DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models
Shengguang Wu
Mei Yuan
Qi Su
DiffM
59
0
0
12 Dec 2023
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLM
SyDa
107
7
0
11 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
96
17
0
08 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
77
13
0
08 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
128
37
0
05 Dec 2023
UPOCR: Towards Unified Pixel-Level OCR Interface
Dezhi Peng
Zhenhua Yang
Jiaxin Zhang
Chongyu Liu
Yongxin Shi
Kai Ding
Fengjun Guo
Lianwen Jin
127
11
0
05 Dec 2023
Uni3DL: Unified Model for 3D and Language Understanding
Xiang Li
Jian Ding
Zhaoyang Chen
Mohamed Elhoseiny
117
5
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
91
1
0
05 Dec 2023
Lenna: Language Enhanced Reasoning Detection Assistant
Fei Wei
Xinyu Zhang
Ailing Zhang
Bo Zhang
Xiangxiang Chu
MLLM
LRM
99
25
0
05 Dec 2023
Aligning and Prompting Everything All at Once for Universal Visual Perception
Yunhang Shen
Chaoyou Fu
Peixian Chen
Mengdan Zhang
Ke Li
Xing Sun
Yunsheng Wu
Shaohui Lin
Rongrong Ji
VLM
ObjD
116
39
0
04 Dec 2023
Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario
Yimin Sun
Chao Wang
Yan Peng
85
9
0
04 Dec 2023
Hulk: A Universal Knowledge Translator for Human-Centric Tasks
Yizhou Wang
YiXuan Wu
Shixiang Tang
Weizhen He
Xun Guo
...
Lei Bai
Rui Zhao
Jian Wu
Tong He
Wanli Ouyang
VLM
202
14
0
04 Dec 2023
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren
Zhicheng Huang
Yunchao Wei
Yao-Min Zhao
Dongmei Fu
Jiashi Feng
Xiaojie Jin
VLM
MLLM
LRM
118
109
0
04 Dec 2023
Previous
1
2
3
...
5
6
7
...
12
13
14
Next