Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
v1
v2 (latest)
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (2502★)
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 656 papers shown
Title
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Sucheng Ren
Xiaoke Huang
Xianhang Li
Junfei Xiao
Jieru Mei
Zeyu Wang
Alan Yuille
Yuyin Zhou
MedIm
89
9
0
08 Jun 2024
Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti
Ian Stewart
Sameera Horawalavithana
Henry Kvinge
Tegan H. Emerson
Sandra E Thompson
Karl Pazdernik
107
2
0
08 Jun 2024
Flexible and Adaptable Summarization via Expertise Separation
Preslav Nakov
Mingzhe Li
Shen Gao
Xin Cheng
Qingqing Zhu
Rui Yan
Xin Gao
Xiangliang Zhang
MoE
72
5
0
08 Jun 2024
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
Yanjie Li
Weijun Li
Lina Yu
Min Wu
Jingyi Liu
Wenqiang Li
Shu Wei
Yusong Deng
OffRL
96
3
0
08 Jun 2024
LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
Dongkai Wang
Shiyu Xuan
Shiliang Zhang
LRM
68
6
0
07 Jun 2024
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Qiaomu Miao
Alexandros Graikos
Jingwei Zhang
Sounak Mondal
Minh Hoai
Dimitris Samaras
149
0
0
04 Jun 2024
Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges
Daniel A. P. Oliveira
Eugénio Ribeiro
David Martins de Matos
VGen
53
3
0
04 Jun 2024
ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models
Thanh-Dat Truong
Xin Li
Bhiksha Raj
Jackson Cothren
Khoa Luu
DiffM
VLM
98
1
0
03 Jun 2024
Image Captioning via Dynamic Path Customization
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Yiyi Zhou
Xiaopeng Hong
Yongjian Wu
Rongrong Ji
77
1
0
01 Jun 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
112
7
0
31 May 2024
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
Honglin Lin
Siyu Li
Gu Nan
Chaoyue Tang
Xueting Wang
...
Yankai Rong
Zhili Zhou
Yutong Gao
Qimei Cui
Xiaofeng Tao
50
0
0
29 May 2024
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
Laura Fieback
Jakob Spiegelberg
Hanno Gottschalk
MLLM
232
5
0
29 May 2024
ARC: A Generalist Graph Anomaly Detector with In-Context Learning
Yixin Liu
Shiyuan Li
Yu Zheng
Qingfeng Chen
Chengqi Zhang
Shirui Pan
85
14
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
141
5
0
26 May 2024
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text
Han Yu
Peikun Guo
Akane Sano
80
19
0
26 May 2024
Planted: a dataset for planted forest identification from multi-satellite time series
L. M. Pazos-Outón
Cristina Nader Vasconcelos
Anton Raichuk
Anurag Arnab
Dan Morris
Maxim Neumann
80
5
0
24 May 2024
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Se-eun Yoon
Hyunsik Jeon
Julian McAuley
109
0
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
335
54
0
23 May 2024
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angeline Pouget
Lucas Beyer
Emanuele Bugliarello
Xiao Wang
Andreas Steiner
Xiao-Qi Zhai
Ibrahim Alabdulmohsin
VLM
91
9
0
22 May 2024
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Junzhang Liu
Zhecan Wang
Hammad A. Ayyubi
Haoxuan You
Chris Thomas
Rui Sun
Shih-Fu Chang
Kai-Wei Chang
155
0
0
18 May 2024
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
VLM
104
5
0
16 May 2024
G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
Zeyu Wang
Yuanchun Shi
Yuntao wang
Yuchen Yao
Kun Yan
Yuhan Wang
Lei Ji
Xuhai Xu
Chun Yu
90
9
0
13 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
111
11
0
09 May 2024
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Jiaxin Zhang
Dezhi Peng
Chongyu Liu
Peirong Zhang
Lianwen Jin
VLM
85
13
0
07 May 2024
Language-Image Models with 3D Understanding
Jang Hyun Cho
Boris Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
72
19
0
06 May 2024
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge
Fangyin Wei
Siddharth Gururani
Nayeon Lee
Xuan Li
Huayu Chen
CoGe
DiffM
72
17
0
30 Apr 2024
UniFS: Universal Few-shot Instance Perception with Point Representations
Sheng Jin
Ruijie Yao
Lumin Xu
Wentao Liu
Chao Qian
Ji Wu
Ping Luo
119
2
0
30 Apr 2024
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIP
VLM
175
23
0
30 Apr 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
73
3
0
26 Apr 2024
What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini
Mustafa Shukor
Matthieu Cord
Laure Soulier
Benjamin Piwowarski
138
23
0
24 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
92
12
0
23 Apr 2024
Graphic Design with Large Multimodal Model
Yutao Cheng
Zhao Zhang
Maoke Yang
Hui Nie
Chunyuan Li
Xinglong Wu
Jie Shao
98
15
0
22 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
65
1
0
22 Apr 2024
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma
Yi Jiang
Jiannan Wu
Zehuan Yuan
Xiaojuan Qi
VLM
ObjD
113
65
0
19 Apr 2024
The Solution for the CVPR2024 NICE Image Captioning Challenge
Longfei Huang
Shupeng Zhong
Xiangyu Wu
Ruoxuan Li
62
0
0
19 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
124
8
0
16 Apr 2024
In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation
Han Xue
Qianru Sun
Li Song
Wenjun Zhang
Zhiwu Huang
MLLM
72
0
0
15 Apr 2024
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
112
5
0
11 Apr 2024
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Dejia Xu
Vidit Goel
Xingqian Xu
Zhangyang Wang
Humphrey Shi
N. Sebe
117
37
0
11 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjD
MLLM
155
51
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
80
32
0
10 Apr 2024
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Fulong Ma
Weiqing Qi
Guoyang Zhao
Linwei Zheng
Sheng Wang
Yuxuan Liu
Ming-Yuan Liu
135
10
0
10 Apr 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
169
36
0
07 Apr 2024
Mixed-Query Transformer: A Unified Image Segmentation Architecture
Pei Wang
Zhaowei Cai
Hao Yang
Ashwin Swaminathan
R. Manmatha
Stefano Soatto
121
2
0
06 Apr 2024
ALOHa: A New Measure for Hallucination in Captioning Models
Suzanne Petryk
David M. Chan
Anish Kachinthaya
Haodi Zou
John F. Canny
Joseph E. Gonzalez
Trevor Darrell
HILM
112
17
0
03 Apr 2024
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
Yunsong Wang
Hanlin Chen
Gim Hee Lee
124
6
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
40
2
0
01 Apr 2024
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Rongjie Li
Songyang Zhang
Dahua Lin
Kai-xiang Chen
Xuming He
VLM
121
19
0
01 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
53
0
0
29 Mar 2024
Semantic Map-based Generation of Navigation Instructions
Chengzu Li
Chao Zhang
Simone Teufel
R. Doddipatla
Svetlana Stoyanchev
73
2
0
28 Mar 2024
Previous
1
2
3
4
5
...
12
13
14
Next