ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08916
  4. Cited By
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

17 June 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
    ObjD
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks"

50 / 327 papers shown
Title
Locality Alignment Improves Vision-Language Models
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
72
4
0
14 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
26
2
0
12 Oct 2024
CAR: Controllable Autoregressive Modeling for Visual Generation
CAR: Controllable Autoregressive Modeling for Visual Generation
Ziyu Yao
Jialin Li
Yifeng Zhou
Yong Liu
Xi Jiang
Chengjie Wang
Feng Zheng
Yuexian Zou
Lei Li
DiffM
37
13
0
07 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal
  Multi-turn Chain-of-thoughts Reasoning
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRM
AI4CE
31
6
0
04 Oct 2024
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
  Transformer for Efficient Finegrained Image Generation
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
Liang Chen
Sinan Tan
Zefan Cai
Weichu Xie
Haozhe Zhao
Yichi Zhang
Junyang Lin
Jinze Bai
Tianyu Liu
Baobao Chang
ViT
58
3
0
02 Oct 2024
Universal Medical Image Representation Learning with Compositional
  Decoders
Universal Medical Image Representation Learning with Compositional Decoders
Kaini Wang
Ling Yang
Siping Zhou
Guangquan Zhou
Wentao Zhang
Bin Cui
Shuo Li
SSL
MedIm
36
0
0
30 Sep 2024
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Min Yang
Zichen Zhang
Limin Wang
AI4TS
39
0
0
27 Sep 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task
  Learning Via Connector-MoE
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
Xun Zhu
Ying Hu
Fanbin Mo
Miao Li
Ji Wu
52
8
0
26 Sep 2024
ChatCam: Empowering Camera Control through Conversational AI
ChatCam: Empowering Camera Control through Conversational AI
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
VGen
33
2
0
25 Sep 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Weifeng Lin
Xinyu Wei
Renrui Zhang
Le Zhuo
Shitian Zhao
...
Junlin Xie
Junlin Xie
Yu Qiao
Peng Gao
Hongsheng Li
MLLM
DiffM
60
10
0
23 Sep 2024
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive
  Technology
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
Xin Jiang
Junwei Zheng
Ruiping Liu
Jiahang Li
Jiaming Zhang
Sven Matthiesen
Rainer Stiefelhagen
VLM
28
0
0
21 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
39
1
0
19 Sep 2024
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Junjie Wen
Yichen Zhu
Jinming Li
Minjie Zhu
Kun Wu
...
Ran Cheng
Chaomin Shen
Yaxin Peng
Feifei Feng
Jian Tang
LM&Ro
74
48
0
19 Sep 2024
DETECLAP: Enhancing Audio-Visual Representation Learning with Object
  Information
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
Shota Nakada
Taichi Nishimura
Hokuto Munakata
Masayoshi Kondo
Tatsuya Komatsu
CLIP
VLM
36
0
0
18 Sep 2024
What to align in multimodal contrastive learning?
What to align in multimodal contrastive learning?
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
29
3
0
11 Sep 2024
AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning
AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning
Sudarshan Rajagopalan
Vishal M. Patel
27
3
0
30 Aug 2024
A Simple and Generalist Approach for Panoptic Segmentation
A Simple and Generalist Approach for Panoptic Segmentation
Nedyalko Prisadnikov
Wouter Van Gansbeke
Danda Pani Paudel
Luc Van Gool
VLM
51
0
0
29 Aug 2024
Surprisingly Fragile: Assessing and Addressing Prompt Instability in
  Multimodal Foundation Models
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
Ian Stewart
Sameera Horawalavithana
Brendan Kennedy
Sai Munikoti
Karl Pazdernik
AAML
42
2
0
26 Aug 2024
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework
  for Multimodal Large Language Model
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
Chaoya Jiang
Jia Hongrui
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
VLM
48
1
0
22 Aug 2024
Universal Novelty Detection Through Adaptive Contrastive Learning
Universal Novelty Detection Through Adaptive Contrastive Learning
Hossein Mirzaei
Mojtaba Nafez
Mohammad Jafari
Mohammad Bagher Soltani
Mohammad Azizmalayeri
Jafar Habibi
Mohammad Sabokrou
M. Rohban
32
4
0
20 Aug 2024
DIVE: Towards Descriptive and Diverse Visual Commonsense Generation
DIVE: Towards Descriptive and Diverse Visual Commonsense Generation
Jun-Hyung Park
Hyuntae Park
Youjin Kang
Eojin Jeon
SangKeun Lee
35
0
0
15 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Ping Luo
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
70
48
0
05 Aug 2024
XMeCap: Meme Caption Generation with Sub-Image Adaptability
XMeCap: Meme Caption Generation with Sub-Image Adaptability
Yuyan Chen
Songzhou Yan
Zhihong Zhu
Zhixu Li
Yanghua Xiao
VLM
49
10
0
24 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
41
1
0
23 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual
  Question Answering with Large Language Models
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
53
4
0
22 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
36
5
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
77
2
0
18 Jul 2024
Compositional Structures in Neural Embedding and Interaction
  Decompositions
Compositional Structures in Neural Embedding and Interaction Decompositions
Matthew Trager
Alessandro Achille
Pramuditha Perera
L. Zancato
Stefano Soatto
CoGe
37
0
0
12 Jul 2024
SoupLM: Model Integration in Large Language and Multi-Modal Models
SoupLM: Model Integration in Large Language and Multi-Modal Models
Yue Bai
Zichen Zhang
Jiasen Lu
Yun Fu
MoMe
35
1
0
11 Jul 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language
  Model
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji
Shilong Zhang
Jie Wu
Peize Sun
Weifeng Chen
Xuefeng Xiao
Sidi Yang
Yanting Yang
Ping Luo
VLM
48
3
0
10 Jul 2024
Multi-Object Hallucination in Vision-Language Models
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
49
16
0
08 Jul 2024
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
  Interleaved Image-Text Generation
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
Ethan Chern
Jiadi Su
Yan Ma
Pengfei Liu
MLLM
29
29
0
08 Jul 2024
VCHAR:Variance-Driven Complex Human Activity Recognition framework with
  Generative Representation
VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation
Yuan Sun
Navid Salami Pargoo
Taqiya Ehsan
Zhao Zhang
Jorge Ortiz
HAI
32
3
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
47
2
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
44
9
0
01 Jul 2024
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Toward a Diffusion-Based Generalist for Dense Vision Tasks
Yue Fan
Yongqin Xian
Xiaohua Zhai
Alexander Kolesnikov
Muhammad Ferjad Naeem
Bernt Schiele
Federico Tombari
VLM
MDE
DiffM
47
1
0
29 Jun 2024
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu
Yi R. Fung
Sha Li
Yixin Wan
Kai-Wei Chang
Heng Ji
47
5
0
20 Jun 2024
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of
  99%
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
Lei Zhu
Fangyun Wei
Yanye Lu
Dong Chen
VLM
43
34
0
17 Jun 2024
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Roman Bachmann
Oğuzhan Fatih Kar
David Mizrahi
Ali Garjani
Mingfei Gao
David Griffiths
Jiaming Hu
Afshin Dehghan
Amir Zamir
MoE
VLM
MLLM
41
14
0
13 Jun 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image
  Generation
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun
Yi Jiang
Shoufa Chen
Shilong Zhang
Bingyue Peng
Ping Luo
Zehuan Yuan
VLM
66
229
0
10 Jun 2024
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Sucheng Ren
Xiaoke Huang
Xianhang Li
Junfei Xiao
Jieru Mei
Zeyu Wang
Alan Yuille
Yuyin Zhou
MedIm
48
7
0
08 Jun 2024
Generalist Multimodal AI: A Review of Architectures, Challenges and
  Opportunities
Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti
Ian Stewart
Sameera Horawalavithana
Henry Kvinge
Tegan H. Emerson
Sandra E Thompson
Karl Pazdernik
38
2
0
08 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
55
10
0
04 Jun 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
45
30
0
29 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
39
3
0
29 May 2024
Multi-modal Generation via Cross-Modal In-Context Learning
Multi-modal Generation via Cross-Modal In-Context Learning
Amandeep Kumar
Muzammal Naseer
Sanath Narayan
Rao Muhammad Anwer
Salman Khan
Hisham Cholakkal
MLLM
56
0
0
28 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
43
15
0
28 May 2024
TrojFM: Resource-efficient Backdoor Attacks against Very Large
  Foundation Models
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models
Yuzhou Nie
Yanting Wang
Jinyuan Jia
Michael J. De Lucia
Nathaniel D. Bastian
Wenbo Guo
Dawn Song
SILM
AAML
36
5
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
71
5
0
26 May 2024
Activator: GLU Activation Function as the Core Component of a Vision
  Transformer
Activator: GLU Activation Function as the Core Component of a Vision Transformer
Abdullah Nazhat Abdullah
Tarkan Aydin
ViT
43
0
0
24 May 2024
Previous
1234567
Next