ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.10772
  4. Cited By
UniT: Multimodal Multitask Learning with a Unified Transformer

UniT: Multimodal Multitask Learning with a Unified Transformer

22 February 2021
Ronghang Hu
Amanpreet Singh
    ViT
ArXivPDFHTML

Papers citing "UniT: Multimodal Multitask Learning with a Unified Transformer"

50 / 60 papers shown
Title
ALFEE: Adaptive Large Foundation Model for EEG Representation
ALFEE: Adaptive Large Foundation Model for EEG Representation
Wei Xiong
Junming Lin
Jiangtong Li
Jie Li
Changjun Jiang
33
0
0
07 May 2025
Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices
Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices
Kilian Pfeiffer
Mohamed Aboelenien Ahmed
R. Khalili
J. Henkel
40
0
0
12 Nov 2024
Towards Attention-based Contrastive Learning for Audio Spoof Detection
Towards Attention-based Contrastive Learning for Audio Spoof Detection
C. Goel
Surya Koppisetti
Ben Colman
Ali Shahriyari
Gaurav Bharaj
60
5
0
03 Jul 2024
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Zizhao Hu
Mohammad Rostami
34
0
0
25 May 2024
Contextual Chart Generation for Cyber Deception
Contextual Chart Generation for Cyber Deception
David D. Nguyen
David Liebowitz
Surya Nepal
S. Kanhere
Sharif Abuadbba
49
0
0
07 Apr 2024
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Yash Jain
David M. Chan
Pranav Dheram
Aparna Khare
Olabanji Shonibare
Venkatesh Ravichandran
Shalini Ghosh
40
2
0
28 Mar 2024
Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation
Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation
Zhekai Du
Xinyao Li
Fengling Li
Ke Lu
Lei Zhu
Jingjing Li
43
15
0
05 Mar 2024
Convincing Rationales for Visual Question Answering Reasoning
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
44
1
0
06 Feb 2024
4M: Massively Multimodal Masked Modeling
4M: Massively Multimodal Masked Modeling
David Mizrahi
Roman Bachmann
Ouguzhan Fatih Kar
Teresa Yeo
Mingfei Gao
Afshin Dehghan
Amir Zamir
MLLM
50
63
0
11 Dec 2023
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
  Grounding with Multimodal Large Language Model
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model
Guozhang Li
Xinpeng Ding
De-Chun Cheng
Jie Li
Nannan Wang
Xinbo Gao
34
1
0
05 Dec 2023
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharing
Siddharth Srivastava
Gaurav Sharma
SSL
29
64
0
07 Nov 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
34
21
0
25 May 2023
i-Code Studio: A Configurable and Composable Framework for Integrative
  AI
i-Code Studio: A Configurable and Composable Framework for Integrative AI
Yuwei Fang
Mahmoud Khademi
Chenguang Zhu
Ziyi Yang
Reid Pryzant
...
Yao Qian
Takuya Yoshioka
Lu Yuan
Michael Zeng
Xuedong Huang
35
2
0
23 May 2023
MTLSegFormer: Multi-task Learning with Transformers for Semantic
  Segmentation in Precision Agriculture
MTLSegFormer: Multi-task Learning with Transformers for Semantic Segmentation in Precision Agriculture
D. Gonçalves
J. M. Junior
Pedro Zamboni
H. Pistori
Jonathan Li
Keiller Nogueira
W. Gonçalves
37
5
0
04 May 2023
Multimodal Data Integration for Oncology in the Era of Deep Neural
  Networks: A Review
Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review
Asim Waqas
Aakash Tripathi
Ravichandran Ramachandran
Paul Stewart
Ghulam Rasool
AI4CE
37
31
0
11 Mar 2023
Semantics-Aware Dynamic Localization and Refinement for Referring Image
  Segmentation
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip Torr
48
23
0
11 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
  Tasks
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
24
38
0
04 Mar 2023
Few-shot Multimodal Multitask Multilingual Learning
Few-shot Multimodal Multitask Multilingual Learning
Aman Chadha
Vinija Jain
53
0
0
19 Feb 2023
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
  using Synthetic Scene Data
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig
Ofir Abramovich
Elad Ben-Avraham
Assaf Arbelle
Leonid Karlinsky
Ariel Shamir
Trevor Darrell
Amir Globerson
41
16
0
08 Dec 2022
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual
  Grounding
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen
Ronghang Hu
Xinlei Chen
Matthias Nießner
Angel X. Chang
29
52
0
01 Dec 2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
  Visual Representation
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
Jiangyong Huang
William Zhu
Baoxiong Jia
Zan Wang
Xiaojian Ma
Qing Li
Siyuan Huang
37
5
0
28 Nov 2022
A Transformer Framework for Data Fusion and Multi-Task Learning in Smart
  Cities
A Transformer Framework for Data Fusion and Multi-Task Learning in Smart Cities
Alexander C. DeRieux
Walid Saad
W. Zuo
R. Budiarto
M. D. Koerniawan
D. Novitasari
20
1
0
18 Nov 2022
Multimodal Transformer for Parallel Concatenated Variational
  Autoencoders
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
Stephen D. Liang
J. Mendel
ViT
27
5
0
28 Oct 2022
M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
  Learning with Model-Accelerator Co-design
M3^33ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
Hanxue Liang
Zhiwen Fan
Rishov Sarkar
Ziyu Jiang
Tianlong Chen
Kai Zou
Yu Cheng
Cong Hao
Zhangyang Wang
MoE
36
81
0
26 Oct 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and
  Language Tasks
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
41
0
0
23 Aug 2022
Making the Best of Both Worlds: A Domain-Oriented Transformer for
  Unsupervised Domain Adaptation
Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation
Wen-hui Ma
Jinming Zhang
Shuang Li
Chi Harold Liu
Yulin Wang
Wei Li
26
14
0
02 Aug 2022
Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
24
48
0
26 Jul 2022
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer
  to Unlabeled Modality
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Wei-Ning Hsu
Bowen Shi
SSL
VLM
27
41
0
14 Jul 2022
Automatic Generation of Product-Image Sequence in E-commerce
Automatic Generation of Product-Image Sequence in E-commerce
Xiaochuan Fan
Chi Zhang
Yong-Jie Yang
Yue Shang
Xueying Zhang
Zhen He
Yun Xiao
Bo Long
Lingfei Wu
23
4
0
26 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
56
392
0
17 Jun 2022
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering
  in Indoor Scenes
IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes
Rui Zhu
Zhengqin Li
J. Matai
Fatih Porikli
Manmohan Chandraker
ViT
43
45
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
30
124
0
15 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
20
81
0
14 Jun 2022
Multi-Task Learning with Multi-Query Transformer for Dense Prediction
Multi-Task Learning with Multi-Query Transformer for Dense Prediction
Yangyang Xu
Xiangtai Li
Haobo Yuan
Yibo Yang
Lefei Zhang
ViT
28
45
0
28 May 2022
MulT: An End-to-End Multitask Learning Transformer
MulT: An End-to-End Multitask Learning Transformer
Deblina Bhattacharjee
Tong Zhang
Sabine Süsstrunk
Mathieu Salzmann
ViT
39
62
0
17 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
17
16
0
02 May 2022
Where in the World is this Image? Transformer-based Geo-localization in
  the Wild
Where in the World is this Image? Transformer-based Geo-localization in the Wild
Shraman Pramanick
E. Nowara
Joshua Gleason
Carlos D. Castillo
Rama Chellappa
ViT
21
30
0
29 Apr 2022
Modeling Motion with Multi-Modal Features for Text-Based Video
  Segmentation
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
29
21
0
06 Apr 2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Roman Bachmann
David Mizrahi
Andrei Atanov
Amir Zamir
35
265
0
04 Apr 2022
Multiscale Sensor Fusion and Continuous Control with Neural CDEs
Multiscale Sensor Fusion and Continuous Control with Neural CDEs
Sumeet Singh
Francis McCann Ramirez
Jacob Varley
Andy Zeng
Vikas Sindhwani
49
3
0
16 Mar 2022
Cross-modal Map Learning for Vision and Language Navigation
Cross-modal Map Learning for Vision and Language Navigation
G. Georgakis
Karl Schmeckpeper
Karan Wanchoo
Soham Dan
E. Miltsakaki
Dan Roth
Kostas Daniilidis
22
64
0
10 Mar 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
53
850
0
07 Feb 2022
Multimodal data matters: language model pre-training over structured and
  unstructured electronic health records
Multimodal data matters: language model pre-training over structured and unstructured electronic health records
Sicen Liu
Xiaolong Wang
Yongshuai Hou
Ge Li
Hui Wang
Huiqin Xu
Yang Xiang
Buzhou Tang
52
30
0
25 Jan 2022
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
40
687
0
08 Dec 2021
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip Torr
148
306
0
04 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
56
129
0
02 Dec 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
35
73
0
25 Nov 2021
Exploiting Both Domain-specific and Invariant Knowledge via a Win-win
  Transformer for Unsupervised Domain Adaptation
Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation
Wen-hui Ma
Jinming Zhang
Shuang Li
Chi Harold Liu
Yulin Wang
Wei Li
ViT
27
11
0
25 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
27
111
0
23 Nov 2021
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Sanchit Agarwal
Jan Jezabek
Arijit Biswas
Emre Barut
Shuyang Gao
Tagyoung Chung
20
1
0
22 Nov 2021
12
Next