Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2202.03052
Cited By
v1
v2 (latest)
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
Re-assign community
ArXiv (abs)
PDF
HTML
Github (2502★)
Papers citing
"OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"
50 / 656 papers shown
Title
Multi-encoder nnU-Net outperforms transformer models with self-supervised pretraining
Seyedeh Sahar Taheri Otaghsara
Reza Rahmanzadeh
ViT
73
0
0
01 Jul 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
15
0
0
18 Jun 2025
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu
Jiuhai Chen
Zhaojiang Lin
Xichen Pan
Lifu Huang
...
Di Jin
Michihiro Yasunaga
Lili Yu
Xi Lin
Shaoliang Nie
121
1
0
12 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
68
0
0
11 Jun 2025
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui
Liang Xie
Yu Zhao
Jiawei Sun
Erwei Yin
17
0
0
10 Jun 2025
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
Yutong Chen
Yiqian Wu
Kaifeng Zhao
Marc Pollefeys
Siyu Tang
EgoV
VLM
38
0
0
09 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
Divya J. Bajpai
M. Hanawal
VLM
17
0
0
07 Jun 2025
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
Jisu An
Junseok Lee
Jeoungeun Lee
Yongseok Son
149
0
0
05 Jun 2025
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
77
0
0
04 Jun 2025
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Jiwan Chung
Janghan Yoon
J. S. Park
Sangeyl Lee
Joowon Yang
Sooyeon Park
Youngjae Yu
40
0
0
30 May 2025
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man
De-An Huang
Guilin Liu
Shiwei Sheng
Shilong Liu
Liang-Yan Gui
Jan Kautz
Yu Wang
Zhiding Yu
MLLM
LRM
76
0
0
29 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLM
LRM
85
1
0
26 May 2025
RemoteSAM: Towards Segment Anything for Earth Observation
Liang Yao
Fan Liu
Delong Chen
Chuanyi Zhang
Yijun Wang
Ziyun Chen
Wei Xu
Shimin Di
Yuhui Zheng
236
0
0
23 May 2025
Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning
Cheng Peng
Kai Zhang
Mengxian Lyu
Hongfang Liu
Lichao Sun
Yonghui Wu
LM&MA
MedIm
VLM
278
0
0
23 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Taewon Kang
Ming C. Lin
DiffM
VGen
83
0
0
22 May 2025
Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports
Francesco Dalla Serra
Patrick Schrempf
Chaoyang Wang
Zaiqiao Meng
Fani Deligianni
Alison Q. OÑeil
43
0
0
22 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
81
0
0
13 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
84
0
0
11 May 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
164
1
0
28 Apr 2025
TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation
Shintaro Ozaki
Kazuki Hayashi
Yusuke Sakai
Jingun Kwon
Hidetaka Kamigaito
Katsuhiko Hayashi
Manabu Okumura
Taro Watanabe
VLM
132
0
0
25 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
78
0
0
24 Apr 2025
Generalized Visual Relation Detection with Diffusion Models
Kaifeng Gao
Siqi Chen
Hanwang Zhang
Jun Xiao
Yueting Zhuang
Qianru Sun
95
0
0
16 Apr 2025
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
Qi Zhi Lim
C. Lee
K. Lim
Kalaiarasi Sonai Muthu Anbananthen
75
0
0
11 Apr 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu
Kangheng Lin
Liang Zhao
Jisheng Yin
Yana Wei
...
Zheng Ge
Xiangyu Zhang
Daxin Jiang
Jingyu Wang
Wenbing Tao
VLM
OffRL
LRM
109
18
0
10 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
73
0
0
09 Apr 2025
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Ruotian Peng
Haiying He
Yake Wei
Yandong Wen
D. Hu
VLM
72
0
0
09 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Yufan Zhou
Franck Dernoncourt
Wanrong Zhu
Dinesh Manocha
Tong Sun
107
3
0
07 Apr 2025
Efficient Adaptation For Remote Sensing Visual Grounding
Hasan Moughnieh
Mohamad Chalhoub
Hasan Nasrallah
Cristiano Nattero
Paolo Campanella
Giovanni Nico
A. Ghandour
113
0
0
29 Mar 2025
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Chen Tang
Xinzhu Ma
Encheng Su
Xiufeng Song
Xiaohong Liu
Wei-Hong Li
Lei Bai
Wanli Ouyang
Xiangyu Yue
3DGS
AI4TS
102
0
0
26 Mar 2025
Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
Zhaochong An
Guolei Sun
Yun Liu
Runjia Li
Junlin Han
Ender Konukoglu
Serge Belongie
VLM
171
2
0
20 Mar 2025
Visual Position Prompt for MLLM based Visual Grounding
Wei Tang
Yanpeng Sun
Qinying Gu
Zechao Li
VLM
95
0
0
19 Mar 2025
SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints
Weiwen Hu
Niccolò Parodi
Marcus Zepp
I. Feldmann
O. Schreer
Peter Eisert
VLM
469
0
0
19 Mar 2025
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang
Changxu Cheng
Lingfeng Wang
Senda Chen
Wuyue Zhao
VLM
100
1
0
17 Mar 2025
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
Shunqi Mao
Chaoyi Zhang
Weidong Cai
MLLM
456
1
0
13 Mar 2025
Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis
N. Islam
Dongao Ma
Jiaxuan Pang
Shivasakthi Senthil Velan
Michael B. Gotway
Jianming Liang
95
0
0
12 Mar 2025
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Yan Tai
Luhao Zhu
Zhiqiang Chen
Ynan Ding
Yiying Dong
Xiaohong Liu
Guodong Guo
MLLM
ObjD
97
0
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
150
3
0
10 Mar 2025
Secure On-Device Video OOD Detection Without Backpropagation
Li Li
Peilin Cai
Yuxiao Zhou
Zhiyu Ni
Renjie Liang
You Qin
Yi Nian
Zhuowen Tu
Xiyang Hu
Yue Zhao
OODD
FedML
123
4
0
08 Mar 2025
Generative Artificial Intelligence in Robotic Manipulation: A Survey
Kun Zhang
Peng Yun
Jun Cen
Junhao Cai
DiDi Zhu
...
Qifeng Chen
Jia Pan
Wei Zhang
Bo Yang
Hua Chen
177
1
0
05 Mar 2025
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
Rui Zhao
Weijia Mao
Mike Zheng Shou
107
1
0
05 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Pandeng Li
Yun Zheng
Liwei Wang
ObjD
VLM
134
1
0
03 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
124
4
0
01 Mar 2025
Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models
Bushi Xiao
Michael Bennie
Jayetri Bardhan
Daisy Zhe Wang
113
0
0
24 Feb 2025
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
130
0
0
20 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Zehan Li
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Yansen Wang
83
0
0
19 Feb 2025
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition
Jielong Tang
Zhenxing Wang
Ziyang Gong
Jianxing Yu
Shuang Wang
Jian Yin
156
0
0
28 Jan 2025
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Shivalika Singh
Nakul Sharma
Manish Gupta
Anand Mishra
143
1
0
28 Jan 2025
RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Yang Bai
Christan Earl Grant
Daisy Zhe Wang
RALM
124
1
0
23 Jan 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
106
0
0
21 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
85
0
0
20 Jan 2025
1
2
3
4
...
12
13
14
Next