ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.03052
  4. Cited By
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
v1v2 (latest)

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

7 February 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
    MLLMObjD
ArXiv (abs)PDFHTMLGithub (2502★)

Papers citing "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework"

50 / 656 papers shown
Title
Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using
  Omnidirectional Camera and Multiple Vision-Language Models
Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
Naoto Tsukamoto
Kei Okada
Masayuki Inaba
LM&Ro
65
0
0
21 Aug 2024
UniFashion: A Unified Vision-Language Model for Multimodal Fashion
  Retrieval and Generation
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
Xiangyu Zhao
Yuehan Zhang
Wenlong Zhang
X. Wu
89
6
0
21 Aug 2024
Towards Flexible Visual Relationship Segmentation
Towards Flexible Visual Relationship Segmentation
Fangrui Zhu
Jianwei Yang
Huaizu Jiang
VOS
100
2
0
15 Aug 2024
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
Fan Yang
Sicheng Zhao
Yanhao Zhang
Haoxiang Chen
Hui Chen
Wenbo Tang
Guiguang Ding
85
3
0
14 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan O. Arik
Tejas Nama
Tomas Pfister
79
1
0
13 Aug 2024
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning
Yuze Zhao
Jintao Huang
Jinghan Hu
Xingjun Wang
Yunlin Mao
...
Zhikai Wu
Baole Ai
Ang Wang
Wenmeng Zhou
Yingda Chen
124
47
0
10 Aug 2024
Enhancing Journalism with AI: A Study of Contextualized Image Captioning
  for News Articles using LLMs and LMMs
Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs
Aliki Anagnostopoulou
Thiago S. Gouvêa
Daniel Sonntag
83
2
0
08 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Ping Luo
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
168
59
0
05 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for
  Multi-Task Visual Grounding
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
102
5
0
02 Aug 2024
Deep Learning based Visually Rich Document Content Understanding: A Survey
Deep Learning based Visually Rich Document Content Understanding: A Survey
Muhammad Ali
Jean Lee
Salman Khan
Eduard Hovy
111
6
0
02 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
90
2
0
28 Jul 2024
Answerability Fields: Answerable Location Estimation via Diffusion
  Models
Answerability Fields: Answerable Location Estimation via Diffusion Models
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
Koya Sakamoto
M. Kawanabe
DiffM
81
0
0
26 Jul 2024
An Efficient Inference Framework for Early-exit Large Language Models
An Efficient Inference Framework for Early-exit Large Language Models
Ruijie Miao
Yihan Yan
Xinshuo Yao
Tong Yang
59
0
0
25 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
81
3
0
23 Jul 2024
UniMEL: A Unified Framework for Multimodal Entity Linking with Large
  Language Models
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models
Liu Qi
Yongyi He
Lian Defu
Zhi Zheng
Tong Xu
Liu Che
Chen Enhong
MLLM
76
2
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
70
2
0
22 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal
  Reasoning
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLMLRM
137
13
0
22 Jul 2024
Influencer: Empowering Everyday Users in Creating Promotional Posts via
  AI-infused Exploration and Customization
Influencer: Empowering Everyday Users in Creating Promotional Posts via AI-infused Exploration and Customization
Xuye Liu
Annie Sun
Pengcheng An
Tengfei Ma
Jian Zhao
68
0
0
20 Jul 2024
Can VLMs be used on videos for action recognition? LLMs are Visual
  Reasoning Coordinators
Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators
Harsh Lunia
50
1
0
20 Jul 2024
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models
  Through 3D Reconstruction
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Zaiqiao Meng
Hao Zhou
Yifang Chen
68
4
0
19 Jul 2024
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
S. Swetha
Jinyu Yang
T. Neiman
Mamshad Nayeem Rizve
Son Tran
Benjamin Z. Yao
Trishul Chilimbi
Mubarak Shah
112
2
0
18 Jul 2024
Open Vocabulary 3D Scene Understanding via Geometry Guided
  Self-Distillation
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
Pengfei Wang
Yuxi Wang
Shuai Li
Zhaoxiang Zhang
Zhen Lei
Lei Zhang
105
3
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOSLRM
178
5
0
18 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual
  Narrative through User-Defined Highlights
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
76
1
0
16 Jul 2024
Mutual Learning for Acoustic Matching and Dereverberation via Visual
  Scene-driven Diffusion
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
Jian Ma
Wenguan Wang
Yi Yang
Feng Zheng
DiffM
89
0
0
15 Jul 2024
ActionVOS: Actions as Prompts for Video Object Segmentation
ActionVOS: Actions as Prompts for Video Object Segmentation
Liangyang Ouyang
Ruicong Liu
Yifei Huang
Ryosuke Furuta
Yoichi Sato
VOS
79
2
0
10 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Hao Peng
Heng Ji
LRM
107
17
0
08 Jul 2024
Second Place Solution of WSDM2023 Toloka Visual Question Answering
  Challenge
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
Xiangyu Wu
Zhouyang Chi
Yang Yang
Jianfeng Lu
69
0
0
05 Jul 2024
Uncertainty-Guided Optimization on Large Language Model Search Trees
Uncertainty-Guided Optimization on Large Language Model Search Trees
Julia Grosse
Ruotian Wu
Ahmad Rashid
Philipp Hennig
Pascal Poupart
Agustinus Kristiadi
107
3
0
04 Jul 2024
VCHAR:Variance-Driven Complex Human Activity Recognition framework with
  Generative Representation
VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation
Yuan Sun
Navid Salami Pargoo
Taqiya Ehsan
Zhao Zhang
Jorge Ortiz
HAI
39
3
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
107
4
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLMMLLM
98
15
0
01 Jul 2024
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Yue Fan
Yongqin Xian
Xiaohua Zhai
Alexander Kolesnikov
Muhammad Ferjad Naeem
Bernt Schiele
Federico Tombari
VLMMDEDiffM
60
1
0
29 Jun 2024
RAVEN: Multitask Retrieval Augmented Vision-Language Learning
RAVEN: Multitask Retrieval Augmented Vision-Language Learning
Varun Nagaraj Rao
Siddharth Choudhary
Aditya Deshpande
R. Satzoda
Srikar Appalaraju
RALMVLM
92
5
0
27 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
105
9
0
26 Jun 2024
Chrono: A Simple Blueprint for Representing Time in MLLMs
Chrono: A Simple Blueprint for Representing Time in MLLMs
Meinardus Boris
Batra Anil
Rohrbach Anna
Rohrbach Marcus
Marcus Rohrbach
MLLMVLM
97
4
0
26 Jun 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of
  Large Multimodal Models
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
103
9
0
24 Jun 2024
Does Object Grounding Really Reduce Hallucination of Large
  Vision-Language Models?
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Gregor Geigle
Radu Timofte
Goran Glavaš
83
0
0
20 Jun 2024
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
Tianren Ma
Lingxi Xie
Yunjie Tian
Boyu Yang
Yuan Zhang
80
0
0
17 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human
  Preferences
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
112
33
0
16 Jun 2024
Precision Empowers, Excess Distracts: Visual Question Answering With
  Dynamically Infused Knowledge In Language Models
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
Manas Jhalani
Annervaz K M
Pushpak Bhattacharyya
38
0
0
14 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
87
1
0
13 Jun 2024
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Roman Bachmann
Oğuzhan Fatih Kar
David Mizrahi
Ali Garjani
Mingfei Gao
David Griffiths
Jiaming Hu
Afshin Dehghan
Amir Zamir
MoEVLMMLLM
113
17
0
13 Jun 2024
Language-driven Grasp Detection
Language-driven Grasp Detection
An Dinh Vuong
Minh Nhat Vu
Baoru Huang
Nghia Nguyen
Hieu Le
T. Vo
Anh Nguyen
VLM
116
19
0
13 Jun 2024
Multimodal Table Understanding
Multimodal Table Understanding
Mingyu Zheng
Xinwei Feng
Q. Si
Qiaoqiao She
Zheng Lin
Wenbin Jiang
Weiping Wang
LMTDVLM
145
20
0
12 Jun 2024
Image Textualization: An Automatic Framework for Creating Accurate and
  Detailed Image Descriptions
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Renjie Pi
Jianshu Zhang
Jipeng Zhang
Boyao Wang
Zhekai Chen
Tong Zhang
3DV
87
24
0
11 Jun 2024
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based
  Reformulation and Box-Based Segmentation
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation
Jinyuan Li
Ziyan Li
Han Li
Jianfei Yu
Rui Xia
Di Sun
Gang Pan
69
2
0
11 Jun 2024
RWKV-CLIP: A Robust Vision-Language Representation Learner
RWKV-CLIP: A Robust Vision-Language Representation Learner
Tiancheng Gu
Kaicheng Yang
Xiang An
Ziyong Feng
Dongnan Liu
Weidong Cai
Jiankang Deng
VLMCLIP
105
14
0
11 Jun 2024
BrainChat: Decoding Semantic Information from fMRI using Vision-language
  Pretrained Models
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Wanaiu Huang
61
2
0
10 Jun 2024
Open-Vocabulary Part-Based Grasping
Open-Vocabulary Part-Based Grasping
Tjeard van Oort
Dimity Miller
Will N. Browne
Nicolas Marticorena
Jesse Haviland
Niko Suenderhauf
3DPC
75
2
0
10 Jun 2024
Previous
123456...121314
Next