Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 648 papers shown
Title
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Yuxin Tian
Mouxing Yang
Yunfan Li
Dayiheng Liu
Xingzhang Ren
Xiaocui Peng
Jiancheng Lv
VLM
45
0
0
13 Mar 2024
Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization
Kento Kawaharazuka
Naoaki Kanazawa
Yoshiki Obinata
K. Okada
Masayuki Inaba
32
5
0
13 Mar 2024
Masked AutoDecoder is Effective Multi-Task Vision Generalist
Han Qiu
Jiaxing Huang
Peng Gao
Lewei Lu
Xiaoqin Zhang
Shijian Lu
51
4
0
12 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh Nguyen Nhat To
Johan W. Verjans
70
11
0
12 Mar 2024
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yueping Jiang
MLLM
45
18
0
12 Mar 2024
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
Xinyao Li
Yuke Li
Zhekai Du
Fengling Li
Ke Lu
Jingjing Li
VLM
54
4
0
11 Mar 2024
Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts
Jiawen Zhu
Guansong Pang
VLM
61
35
0
11 Mar 2024
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Qing Wang
22
0
0
05 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
67
12
0
05 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
47
3
0
04 Mar 2024
Adversarial Testing for Visual Grounding via Image-Aware Property Reduction
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Boyu Wu
Fanjiang Xu
Qing Wang
AAML
36
0
0
02 Mar 2024
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
A. Soroa
Eneko Agirre
Frank Keller
EGVM
29
2
0
01 Mar 2024
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
VLM
35
0
0
01 Mar 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
52
47
0
29 Feb 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
34
24
0
28 Feb 2024
VCD: Knowledge Base Guided Visual Commonsense Discovery in Images
Xiangqing Shen
Yurun Song
Siwei Wu
Rui Xia
33
6
0
27 Feb 2024
Measuring Vision-Language STEM Skills of Neural Models
Jianhao Shen
Ye Yuan
Srbuhi Mirzoyan
Ming Zhang
Chenguang Wang
VLM
33
8
0
27 Feb 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLM
VLM
48
39
0
26 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
52
1
0
23 Feb 2024
Towards Robust Instruction Tuning on Multimodal Large Language Models
Wei Han
Hui Chen
Soujanya Poria
MLLM
46
0
0
22 Feb 2024
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
Antoine Chaffin
Ewa Kijak
Vincent Claveau
42
0
0
21 Feb 2024
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction
Jie Xu
Hanbo Zhang
Xinghang Li
Huaping Liu
Xuguang Lan
Tao Kong
LM&Ro
38
3
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
35
2
0
18 Feb 2024
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
Wenxuan Wang
Yisi Zhang
Xingjian He
Yichen Yan
Zijia Zhao
Xinlong Wang
Jing Liu
LM&Ro
27
4
0
17 Feb 2024
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Jinyuan Li
Han Li
Di Sun
Jiahao Wang
Wenkun Zhang
Zan Wang
Gang Pan
42
7
0
15 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
40
5
0
14 Feb 2024
Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Yeongjae Cho
Taehee Kim
Heejun Shin
Sungzoon Cho
Dongmyung Shin
15
1
0
14 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLM
MLLM
27
12
0
13 Feb 2024
SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers
Zheng Ning
Brianna L Wimer
Kaiwen Jiang
Keyi Chen
Jerrick Ban
Yapeng Tian
Yuhang Zhao
T. Li
44
15
0
11 Feb 2024
CLIP-Loc: Multi-modal Landmark Association for Global Localization in Object-based Maps
Shigemichi Matsuzaki
Takuma Sugino
Kazuhito Tanaka
Zijun Sha
Shintaro Nakaoka
Shintaro Yoshizawa
Kazuhiro Shintani
VLM
36
5
0
08 Feb 2024
Real-World Robot Applications of Foundation Models: A Review
Kento Kawaharazuka
T. Matsushima
Andrew Gambardella
Jiaxian Guo
Chris Paxton
Andy Zeng
OffRL
VLM
LM&Ro
51
45
0
08 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
44
20
0
08 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
123
47
0
07 Feb 2024
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
44
1
0
06 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
34
2
0
04 Feb 2024
LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement
Renyuan Peng
Xinyue Cai
Hang Xu
Jiachen Lu
Feng Wen
Wei Zhang
Li Zhang
30
4
0
31 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
30
5
0
30 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
40
5
0
29 Jan 2024
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren
Shilong Liu
Ailing Zeng
Jing Lin
Kunchang Li
...
Feng Li
Jie-jin Yang
Hongyang Li
Qing Jiang
Lei Zhang
VLM
46
380
0
25 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
52
181
0
24 Jan 2024
Small Language Model Meets with Reinforced Vision Vocabulary
Haoran Wei
Lingyu Kong
Jinyue Chen
Liang Zhao
Zheng Ge
En Yu
Jian‐Yuan Sun
Chunrui Han
Xiangyu Zhang
VLM
57
40
0
23 Jan 2024
UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Qingdong He
Jinlong Peng
Zhengkai Jiang
Kai Wu
Xiaozhong Ji
Jiangning Zhang
Yabiao Wang
Chengjie Wang
Mingang Chen
Yunsheng Wu
3DPC
34
8
0
21 Jan 2024
Prompting Large Vision-Language Models for Compositional Reasoning
Timothy Ossowski
Ming Jiang
Junjie Hu
CoGe
VLM
LRM
43
3
0
20 Jan 2024
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually
Mazal Bethany
Brandon Wherry
Nishant Vishwamitra
Peyman Najafirad
DiffM
31
4
0
19 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
85
42
0
18 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
37
4
0
12 Jan 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang
Bohan Zhuang
Qi Wu
14
11
0
12 Jan 2024
AffordanceLLM: Grounding Affordance from Vision Language Models
Shengyi Qian
Weifeng Chen
Min Bai
Xiong Zhou
Zhuowen Tu
Li Erran Li
23
21
0
12 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
32
4
0
06 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
27
9
0
03 Jan 2024
Previous
1
2
3
4
5
6
...
11
12
13
Next