ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.03052
  4. Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
v1v2 (latest)

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
    MLLMObjD
ArXiv (abs)PDFHTMLGithub (2502★)

Papers citing "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"

50 / 656 papers shown
Title
DiffDis: Empowering Generative Diffusion Model with Cross-Modal
  Discrimination Capability
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
Runhu Huang
Jianhua Han
Guansong Lu
Xiaodan Liang
Yihan Zeng
Wei Zhang
Hang Xu
DiffM
62
2
0
18 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language
  Tasks
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
50
5
0
17 Aug 2023
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
  and Semantic Credit Assignment
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment
Qi Chen
Chaorui Deng
Zixiong Huang
Bowen Zhang
Mingkui Tan
Qi Wu
EGVM
105
0
0
16 Aug 2023
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
VLMCLIP
87
52
0
16 Aug 2023
Exploring Transfer Learning in Medical Image Segmentation using
  Vision-Language Models
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
K. Poudel
Manish Dhakal
Prasiddha Bhandari
Rabin Adhikari
Safal Thapaliya
Bishesh Khanal
VLM
146
20
0
15 Aug 2023
Foundation Model is Efficient Multimodal Multitask Model Selector
Foundation Model is Efficient Multimodal Multitask Model Selector
Fanqing Meng
Wenqi Shao
Zhanglin Peng
Chong Jiang
Kaipeng Zhang
Yu Qiao
Ping Luo
67
17
0
11 Aug 2023
OpenProteinSet: Training data for structural biology at scale
OpenProteinSet: Training data for structural biology at scale
Gustaf Ahdritz
N. Bouatta
S. Kadyan
Lukas Jarosch
Daniel Berenberg
Ian Fisk
Andrew Watkins
Stephen Ra
Richard Bonneau
Mohammed AlQuraishi
AI4CE
115
12
0
10 Aug 2023
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual
  Try-On
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On
Daiheng Gao
Xu Chen
Xindi Zhang
Qi Wang
Ke Sun
Bang Zhang
Liefeng Bo
Qi-Xing Huang
DiffM
76
5
0
08 Aug 2023
Distributionally Robust Classification on a Data Budget
Distributionally Robust Classification on a Data Budget
Ben Feuer
Ameya Joshi
Minh Pham
Chinmay Hegde
OOD
75
2
0
07 Aug 2023
Foundation Model based Open Vocabulary Task Planning and Executive
  System for General Purpose Service Robots
Foundation Model based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots
Yoshiki Obinata
Naoaki Kanazawa
Kento Kawaharazuka
Iori Yanokura
Soon-Hyeob Kim
K. Okada
Masayuki Inaba
LM&Ro
40
7
0
07 Aug 2023
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating
  Vision-Language Models
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models
Zheng Ma
Mianzhi Pan
Wenhan Wu
Ka Leong Cheng
Jianbing Zhang
Shujian Huang
Jiajun Chen
VLMCoGe
76
5
0
06 Aug 2023
A Comprehensive Analysis of Real-World Image Captioning and Scene
  Identification
A Comprehensive Analysis of Real-World Image Captioning and Scene Identification
Sai Suprabhanu Nallapaneni
Subrahmanyam Konakanchi
66
2
0
05 Aug 2023
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic
  and Regional Comprehension
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
Qiang-feng Zhou
Chaohui Yu
Shaofeng Zhang
Sitong Wu
Zhibin Wang
Fan Wang
82
27
0
03 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
  using Vision-Language Pre-Training Model
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIPVLM
65
11
0
02 Aug 2023
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
  Understanding
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding
Runyu Ding
Jihan Yang
Chuhui Xue
Wenqing Zhang
Song Bai
Xiaojuan Qi
3DVVLM
84
29
0
01 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMeMLLM
126
46
0
30 Jul 2023
Described Object Detection: Liberating Object Detection with Flexible
  Expressions
Described Object Detection: Liberating Object Detection with Flexible Expressions
Chi Xie
Zhao Zhang
YiXuan Wu
Feng Zhu
Rui Zhao
Shuang Liang
ObjD
89
35
0
24 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future
  Challenges
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback
Ashish Singh
Ashutosh Singh
Prateek R. Agarwal
Zixuan Huang
Arpita Singh
...
Ryan Rossi
Puneet Mathur
Erik Learned-Miller
Franck Dernoncourt
Ryan Rossi
104
8
0
20 Jul 2023
Findings of Factify 2: Multimodal Fake News Detection
Findings of Factify 2: Multimodal Fake News Detection
S. Suryavardan
Shreyash Mishra
Megha Chakraborty
Parth Patwa
Anku Rani
...
Amitava Das
Amit P. Sheth
Manoj Kumar Chinnakotla
Asif Ekbal
Srijan Kumar
78
14
0
19 Jul 2023
Improving Multimodal Datasets with Image Captioning
Improving Multimodal Datasets with Image Captioning
Thao Nguyen
S. Gadre
Gabriel Ilharco
Sewoong Oh
Ludwig Schmidt
VLM
99
77
0
19 Jul 2023
Multimodal Diffusion Segmentation Model for Object Segmentation from
  Manipulation Instructions
Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
Yui Iioka
Y. Yoshida
Yuiga Wada
Shumpei Hatanaka
K. Sugiura
DiffM
116
6
0
17 Jul 2023
Tangent Model Composition for Ensembling and Continual Fine-tuning
Tangent Model Composition for Ensembling and Continual Fine-tuning
Tianlin Liu
Stefano Soatto
LRMMoMeCLL
84
17
0
16 Jul 2023
Switching Head-Tail Funnel UNITER for Dual Referring Expression
  Comprehension with Fetch-and-Carry Tasks
Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks
Ryosuke Korekata
Motonari Kambara
Yusuke Yoshida
Shintaro Ishikawa
Yosuke Kawasaki
Masaki Takahashi
K. Sugiura
LM&Ro
81
5
0
14 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language
  Pre-training
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLMMLLM
100
31
0
13 Jul 2023
Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
  Tasks
Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks
Denis Coquenet
Clément Rambour
Emanuele Dalsasso
Nicolas Thome
MLLMCLIPVLM
49
1
0
13 Jul 2023
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic
  Manipulation
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation
Junghyun Kim
Gi-Cheon Kang
Jaein Kim
Suyeon Shin
Byoung-Tak Zhang
LM&Ro
82
7
0
12 Jul 2023
Prototypical Contrastive Transfer Learning for Multimodal Language
  Understanding
Prototypical Contrastive Transfer Learning for Multimodal Language Understanding
Seitaro Otsuki
Shintaro Ishikawa
K. Sugiura
79
1
0
12 Jul 2023
DRMC: A Generalist Model with Dynamic Routing for Multi-Center PET Image
  Synthesis
DRMC: A Generalist Model with Dynamic Routing for Multi-Center PET Image Synthesis
Zhiwen Yang
Yang Zhou
Hui Zhang
Bingzheng Wei
Yubo Fan
Yan Xu
MedIm
57
3
0
11 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
129
138
0
11 Jul 2023
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for
  Radiology Report Summarization
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization
Gangwoo Kim
Hajung Kim
Lei Ji
Seongsu Bae
Chanhwi Kim
Mujeen Sung
Hyunjae Kim
Kun Yan
E. Chang
Jaewoo Kang
VLM
46
2
0
10 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
53
5
0
06 Jul 2023
AVSegFormer: Audio-Visual Segmentation with Transformer
AVSegFormer: Audio-Visual Segmentation with Transformer
Sheng Gao
Zhe Chen
Guo Chen
Wenhai Wang
Tong Lu
VOS
113
52
0
03 Jul 2023
Visual Instruction Tuning with Polite Flamingo
Visual Instruction Tuning with Polite Flamingo
Delong Chen
Jianfeng Liu
Wenliang Dai
Baoyuan Wang
MLLM
106
48
0
03 Jul 2023
JourneyDB: A Benchmark for Generative Image Understanding
JourneyDB: A Benchmark for Generative Image Understanding
Keqiang Sun
Junting Pan
Yuying Ge
Hao Li
Haodong Duan
...
Yi Wang
Jifeng Dai
Yu Qiao
Limin Wang
Hongsheng Li
129
120
0
03 Jul 2023
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Rui Sun
Zhecan Wang
Haoxuan You
Noel Codella
Kai-Wei Chang
Shih-Fu Chang
CLIP
108
4
0
03 Jul 2023
CLIPAG: Towards Generator-Free Text-to-Image Generation
CLIPAG: Towards Generator-Free Text-to-Image Generation
Roy Ganz
Michael Elad
VLM
82
8
0
29 Jun 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
133
652
0
27 Jun 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLMLRM
138
613
0
23 Jun 2023
Learning Descriptive Image Captioning via Semipermeable Maximum
  Likelihood Estimation
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Zihao Yue
Anwen Hu
Liang Zhang
Qin Jin
101
2
0
23 Jun 2023
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM: A Large Language Model That Can Speak and Listen
Paul Kishan Rubenstein
Chulayuth Asawaroengchai
D. Nguyen
Ankur Bapna
Zalan Borsos
...
Neil Zeghidour
Yu Zhang
Zhishuai Zhang
Lukás Zilka
Christian Frank
LM&MAAuLLMVLM
138
295
0
22 Jun 2023
Generative Multimodal Entity Linking
Generative Multimodal Entity Linking
Senbao Shi
Zhenran Xu
Baotian Hu
Hao Fei
MLLMVLM
62
6
0
22 Jun 2023
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
  Documents
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurenccon
Lucile Saulnier
Léo Tronchon
Stas Bekman
Amanpreet Singh
...
Siddharth Karamcheti
Alexander M. Rush
Douwe Kiela
Matthieu Cord
Victor Sanh
161
246
0
21 Jun 2023
ViTEraser: Harnessing the Power of Vision Transformers for Scene Text
  Removal with SegMIM Pretraining
ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
Dezhi Peng
Chongyu Liu
Yuliang Liu
Lianwen Jin
DiffM
79
10
0
21 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based
  Fusion
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
75
20
0
20 Jun 2023
Listener Model for the PhotoBook Referential Game with CLIPScores as
  Implicit Reference Chain
Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain
Shih-Lun Wu
Yi-Hui Chou
Liang Li
61
0
0
16 Jun 2023
Tell Me Where to Go: A Composable Framework for Context-Aware Embodied
  Robot Navigation
Tell Me Where to Go: A Composable Framework for Context-Aware Embodied Robot Navigation
Harel Biggie
Ajay Narasimha Mopidevi
Dusty Woods
Christoffer Heckman
LM&Ro
67
11
0
15 Jun 2023
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
  Text Integration
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Chenyang Lyu
Minghao Wu
Longyue Wang
Xinting Huang
Bingshuai Liu
Zefeng Du
Shuming Shi
Zhaopeng Tu
MLLMAuLLM
86
173
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLMCLIP
83
9
0
15 Jun 2023
Training Multimedia Event Extraction With Generated Images and Captions
Training Multimedia Event Extraction With Generated Images and Captions
Zilin Du
Yunxin Li
Xu Guo
Yidan Sun
Boyang Albert Li
DiffM
88
8
0
15 Jun 2023
Previous
123...8910...121314
Next