ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.03052
  4. Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
v1v2 (latest)

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
    MLLMObjD
ArXiv (abs)PDFHTMLGithub (2502★)

Papers citing "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"

50 / 656 papers shown
Title
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Haoxuan You
Haotian Zhang
Zhe Gan
Xianzhi Du
Bowen Zhang
Zirui Wang
Liangliang Cao
Shih-Fu Chang
Yinfei Yang
ObjDMLLMVLM
139
328
0
11 Oct 2023
Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic
  Reasoning Task 2023
Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023
Xiangyu Wu
Yang Yang
Shengdong Xu
Yifeng Wu
Qingguo Chen
Jianfeng Lu
54
1
0
10 Oct 2023
The Solution for the CVPR2023 NICE Image Captioning Challenge
The Solution for the CVPR2023 NICE Image Captioning Challenge
Xiangyu Wu
Yi Gao
Hailiang Zhang
Yang Yang
Weili Guo
Jianfeng Lu
58
1
0
10 Oct 2023
ViCor: Bridging Visual Understanding and Commonsense Reasoning with
  Large Language Models
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
KAI-QING Zhou
Kwonjoon Lee
Teruhisa Misu
Xin Eric Wang
LRM
102
4
0
09 Oct 2023
Negative Object Presence Evaluation (NOPE) to Measure Object
  Hallucination in Vision-Language Models
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia
Wenliang Dai
Samuel Cahyawijaya
Ziwei Ji
Pascale Fung
MLLM
105
53
0
09 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
65
2
0
08 Oct 2023
Video-Teller: Enhancing Cross-Modal Generation with Fusion and
  Decoupling
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling
Haogeng Liu
Qihang Fan
Tingkai Liu
Linjie Yang
Yunzhe Tao
Huaibo Huang
Ran He
Hongxia Yang
VGen
55
12
0
08 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video
  Understanding Tasks
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
74
0
0
07 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
  Pre-trained Models
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Ziyi Yin
Muchao Ye
Tianrong Zhang
Tianyu Du
Jinguo Zhu
Han Liu
Jinghui Chen
Ting Wang
Fenglong Ma
AAMLVLMCoGe
89
44
0
07 Oct 2023
Expedited Training of Visual Conditioned Language Generation via
  Redundancy Reduction
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
87
12
0
05 Oct 2023
TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus
TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus
Rafael Ferreira
Diogo Tavares
Diogo Glória-Silva
Rodrigo Valerio
João Bordalo
Ines Simoes
Vasco Ramos
David Semedo
João Magalhães
43
4
0
03 Oct 2023
Pink: Unveiling the Power of Referential Comprehension for Multi-modal
  LLMs
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Shiyu Xuan
Qingpei Guo
Ming Yang
Shiliang Zhang
MLLMObjD
81
40
0
01 Oct 2023
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision
  Generalists
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
Yulu Gan
Sungwoo Park
Alexander Schubert
Anthony Philippakis
Ahmed Alaa
VLM
109
25
0
30 Sep 2023
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with
  TikZ
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
Jonas Belouadi
Anne Lauscher
Steffen Eger
75
31
0
30 Sep 2023
Semantic Scene Difference Detection in Daily Life Patroling by Mobile
  Robots using Pre-Trained Large-Scale Vision-Language Model
Semantic Scene Difference Detection in Daily Life Patroling by Mobile Robots using Pre-Trained Large-Scale Vision-Language Model
Yoshiki Obinata
Kento Kawaharazuka
Naoaki Kanazawa
N. Yamaguchi
Naoto Tsukamoto
Iori Yanokura
Shingo Kitagawa
Koki Shinjo
K. Okada
Masayuki Inaba
LM&Ro
65
6
0
28 Sep 2023
Toloka Visual Question Answering Benchmark
Toloka Visual Question Answering Benchmark
Mert Pilanci
Nikita Pavlichenko
Sergey Koshelev
Daniil Likhobaba
Alisa Smirnova
81
4
0
28 Sep 2023
Teaching Text-to-Image Models to Communicate in Dialog
Teaching Text-to-Image Models to Communicate in Dialog
Xiaowen Sun
Jiazhan Feng
Yuxuan Wang
Yuxuan Lai
Xingyu Shen
Dongyan Zhao
DiffM
66
1
0
27 Sep 2023
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
  Multi-Modal Causal Attention
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Z. Yao
Xiaoxia Wu
Conglong Li
Minjia Zhang
Heyang Qi
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
93
11
0
25 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
Synthetic Boost: Leveraging Synthetic Data for Enhanced Vision-Language
  Segmentation in Echocardiography
Synthetic Boost: Leveraging Synthetic Data for Enhanced Vision-Language Segmentation in Echocardiography
Rabin Adhikari
Manish Dhakal
Safal Thapaliya
K. Poudel
Prasiddha Bhandari
Bishesh Khanal
82
8
0
22 Sep 2023
Does My Dog ''Speak'' Like Me? The Acoustic Correlation between Pet Dogs
  and Their Human Owners
Does My Dog ''Speak'' Like Me? The Acoustic Correlation between Pet Dogs and Their Human Owners
Jieyi Huang
Chunhao Zhang
Yufei Wang
Mengyue Wu
Ke Zhu
38
0
0
21 Sep 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
DreamLLM: Synergistic Multimodal Comprehension and Creation
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
111
199
0
20 Sep 2023
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models
An Vuong
Minh Nhat Vu
Hieu Le
Baoru Huang
B. Huynh
T. Vo
Andreas Kugi
Anh Nguyen
VLM
98
32
0
18 Sep 2023
PromptST: Prompt-Enhanced Spatio-Temporal Multi-Attribute Prediction
PromptST: Prompt-Enhanced Spatio-Temporal Multi-Attribute Prediction
Zijian Zhang
Xiangyu Zhao
Qidong Liu
Chunxu Zhang
Qian Ma
Wanyu Wang
Hongwei Zhao
Yiqi Wang
Zitao Liu
AI4TS
140
21
0
18 Sep 2023
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping
PROGrasp: Pragmatic Human-Robot Communication for Object Grasping
Gi-Cheon Kang
Junghyun Kim
Jaein Kim
Byoung-Tak Zhang
105
5
0
14 Sep 2023
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Hongbin Ye
Tong Liu
Aijia Zhang
Wei Hua
Weiqiang Jia
HILM
122
81
0
13 Sep 2023
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on
  Biosignals
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals
Ran Liu
Ellen L. Zippi
Hadi Pouransari
Chris Sandino
Jingping Nie
Hanlin Goh
Erdrin Azemi
Ali Moin
98
12
0
12 Sep 2023
NExT-GPT: Any-to-Any Multimodal LLM
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
117
507
0
11 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
99
27
0
08 Sep 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng Zhang
Han Hu
DongDong Chen
Baining Guo
DiffMVLM
123
107
0
07 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex
  Visually-Grounded Referencing using Determiners
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
Clarence Lee
M Ganesh Kumar
Cheston Tan
76
3
0
07 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
132
6
0
05 Sep 2023
Automatic Diary Generation System including Information on Joint
  Experiences between Humans and Robots
Automatic Diary Generation System including Information on Joint Experiences between Humans and Robots
Aiko Ichikura
Kento Kawaharazuka
Yoshiki Obinata
Koki Shinjo
K. Okada
Masayuki Inaba
24
3
0
05 Sep 2023
Recognition of Heat-Induced Food State Changes by Time-Series Use of
  Vision-Language Model for Cooking Robot
Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking Robot
Naoaki Kanazawa
Kento Kawaharazuka
Yoshiki Obinata
K. Okada
Masayuki Inaba
LM&Ro
45
6
0
04 Sep 2023
MAGMA: Music Aligned Generative Motion Autodecoder
MAGMA: Music Aligned Generative Motion Autodecoder
Sohan Anisetty
Amit Raj
James Hays
56
0
0
03 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint
  Objectives
A Fine-Grained Image Description Generation Method Based on Joint Objectives
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
37
0
0
02 Sep 2023
Towards Addressing the Misalignment of Object Proposal Evaluation for
  Vision-Language Tasks via Semantic Grounding
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding
Joshua Forster Feinglass
Yezhou Yang
53
2
0
01 Sep 2023
TouchStone: Evaluating Vision-Language Models by Language Models
TouchStone: Evaluating Vision-Language Models by Language Models
Shuai Bai
Shusheng Yang
Jinze Bai
Peng Wang
Xing Zhang
Junyang Lin
Xinggang Wang
Chang Zhou
Jingren Zhou
MLLM
119
48
0
31 Aug 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual
  Augmentation
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Towards Unified Token Learning for Vision-Language Tracking
Towards Unified Token Learning for Vision-Language Tracking
Yaozong Zheng
Bineng Zhong
Qihua Liang
Guorong Li
Rongrong Ji
Xianxian Li
132
36
0
27 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLMVLMObjD
196
945
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
  interaction tasks
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
47
4
0
24 Aug 2023
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across
  Languages
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu
Yuan Yao
Chong Wang
Shanonan Wang
Yinxu Pan
...
Yankai Lin
Jiao Xue
Dahai Li
Zhiyuan Liu
Maosong Sun
MLLMVLM
120
56
0
23 Aug 2023
RefEgo: Referring Expression Comprehension Dataset from First-Person
  Perception of Ego4D
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
Shuhei Kurita
Naoki Katsura
Eri Onami
EgoV
89
14
0
23 Aug 2023
Federated Learning in Big Model Era: Domain-Specific Multimodal Large
  Models
Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models
Zengxiang Li
Zhaoxiang Hou
Hui Liu
Ying Wang
Tongzhi Li
...
Chao Shi
Che-Sheng Yang
Weishan Zhang
Zelei Liu
Liang Xu
FedML
49
2
0
22 Aug 2023
Instruction Tuning for Large Language Models: A Survey
Instruction Tuning for Large Language Models: A Survey
Shengyu Zhang
Linfeng Dong
Xiaoya Li
Sen Zhang
Xiaofei Sun
...
Jiwei Li
Runyi Hu
Tianwei Zhang
Leilei Gan
Guoyin Wang
LM&MA
113
610
0
21 Aug 2023
Whether you can locate or not? Interactive Referring Expression
  Generation
Whether you can locate or not? Interactive Referring Expression Generation
Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang
74
4
0
19 Aug 2023
Tackling Vision Language Tasks Through Learning Inner Monologues
Tackling Vision Language Tasks Through Learning Inner Monologues
Diji Yang
Kezhen Chen
Jinmeng Rao
Xiaoyuan Guo
Yawen Zhang
Jie Yang
Yize Zhang
MLLM
99
11
0
19 Aug 2023
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual
  Questions
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
Wenbo Hu
Y. Xu
Yuante Li
W. Li
Zhe Chen
Zhuowen Tu
MLLMVLM
109
133
0
19 Aug 2023
A tailored Handwritten-Text-Recognition System for Medieval Latin
A tailored Handwritten-Text-Recognition System for Medieval Latin
Philipp Koch
Gilary Vera Nunez
Esteban Garces Arias
C. Heumann
Matthias Schoffel
Alexander Haberlin
Yi Men
68
2
0
18 Aug 2023
Previous
123...789...121314
Next