ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.02387
  4. Cited By
An Empirical Study of Training End-to-End Vision-and-Language
  Transformers

An Empirical Study of Training End-to-End Vision-and-Language Transformers

3 November 2021
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
Lijuan Wang
Chenguang Zhu
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
    VLM
ArXivPDFHTML

Papers citing "An Empirical Study of Training End-to-End Vision-and-Language Transformers"

50 / 91 papers shown
Title
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
14
0
0
19 May 2025
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
Jiamin Chang
Yiming Li
Hammond Pearce
Ruoxi Sun
Bo-wen Li
Minhui Xue
38
0
0
28 Apr 2025
KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking
KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking
Juyeon Kim
Geon Lee
Taeuk Kim
Kijung Shin
29
0
0
21 Apr 2025
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Sheng Zhang
Yanbo Xu
Naoto Usuyama
Hanwen Xu
J. Bagga
...
Carlo Bifulco
M. Lungren
Tristan Naumann
Sheng Wang
Hoifung Poon
LM&MA
MedIm
154
205
0
10 Jan 2025
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Honglin Lin
Leyan Ou
Dairong Chen
Zihao Wang
Zeang Sheng
Weijia Li
Weijia Li
76
0
0
22 Dec 2024
Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
Zhilin Zhang
Jie Wang
Zhanghao Qin
Ruiqi Zhu
Xiaoliang Gong
MedIm
54
0
0
28 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for
  Vision-Language Pre-Training
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
56
9
0
16 Oct 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
40
13
0
08 Jun 2024
Accelerating Transformers with Spectrum-Preserving Token Merging
Accelerating Transformers with Spectrum-Preserving Token Merging
Hoai-Chau Tran
D. M. Nguyen
Duy M. Nguyen
Trung Thanh Nguyen
Ngan Le
Pengtao Xie
Daniel Sonntag
James Y. Zou
Binh T. Nguyen
Mathias Niepert
44
8
0
25 May 2024
Dialogue with Robots: Proposals for Broadening Participation and
  Research in the SLIVAR Community
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community
Casey Kennington
Malihe Alikhani
Heather Pon-Barry
Katherine Atwell
Yonatan Bisk
...
Jivko Sinapov
Angela Stewart
Matthew Stone
Stefanie Tellex
Tom Williams
61
0
0
01 Apr 2024
Boosting Transferability in Vision-Language Attacks via Diversification
  along the Intersection Region of Adversarial Trajectory
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Sensen Gao
Xiaojun Jia
Xuhong Ren
Ivor Tsang
Qing Guo
AAML
38
14
0
19 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
47
3
0
04 Mar 2024
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection and Explanation via Large Language Models
Hui Lv
Qianru Sun
33
21
0
11 Jan 2024
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
48
29
0
19 Dec 2023
UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic
  Cross-modal Learnable Prompts
UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts
Chenlu Zhan
Yufei Zhang
Yu Lin
Gaoang Wang
Hongwei Wang
VLM
MedIm
40
5
0
18 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
27
4
0
14 Dec 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network
  for VL Representation
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
30
0
0
20 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large
  Multimodal Models with In-Context Learning
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRM
MLLM
46
20
0
01 Oct 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
36
25
0
08 Sep 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and
  Modality-Aware MoE
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
60
9
0
23 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General
  Speech Knowledge and Language-specific Knowledge
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
34
16
0
18 Aug 2023
Set-level Guidance Attack: Boosting Adversarial Transferability of
  Vision-Language Pre-training Models
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models
Dong Lu
Zhiqiang Wang
Teng Wang
Weili Guan
Hongchang Gao
Feng Zheng
AAML
53
65
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
Fahad Shahbaz Khan
VLM
38
118
0
25 Jul 2023
Towards a Visual-Language Foundation Model for Computational Pathology
Towards a Visual-Language Foundation Model for Computational Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Ivy Liang
...
Andrew Zhang
L. Le
Georg Gerber
Anil V. Parwani
Faisal Mahmood
VLM
MedIm
42
46
0
24 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
Joint Adaptive Representations for Image-Language Learning
Joint Adaptive Representations for Image-Language Learning
A. Piergiovanni
A. Angelova
VLM
34
0
0
31 May 2023
ConaCLIP: Exploring Distillation of Fully-Connected Knowledge
  Interaction Graph for Lightweight Text-Image Retrieval
ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval
Jiapeng Wang
Chengyu Wang
Xiaodan Wang
Jun Huang
Lianwen Jin
VLM
37
4
0
28 May 2023
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao
Bhargavi Paranjape
Hannaneh Hajishirzi
MLLM
VLM
13
21
0
27 May 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
37
21
0
25 May 2023
EDIS: Entity-Driven Image Search over Multimodal Web Content
EDIS: Entity-Driven Image Search over Multimodal Web Content
Siqi Liu
Weixi Feng
Tsu-jui Fu
Wenhu Chen
Wei Wang
VLM
48
9
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
48
115
0
18 May 2023
Vision Language Pre-training by Contrastive Learning with Cross-Modal
  Similarity Regulation
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang
Wei Ye
Haiyang Xu
Miang yan
Shikun Zhang
Jie Zhang
Fei Huang
VLM
34
15
0
08 May 2023
On Robustness in Multimodal Learning
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
40
2
0
10 Apr 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
34
23
0
29 Mar 2023
Top-Down Visual Attention from Analysis by Synthesis
Top-Down Visual Attention from Analysis by Synthesis
Baifeng Shi
Trevor Darrell
Xin Eric Wang
25
29
0
23 Mar 2023
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
  Person Retrieval
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Ding Jiang
Mang Ye
35
139
0
22 Mar 2023
Prismer: A Vision-Language Model with Multi-Task Experts
Prismer: A Vision-Language Model with Multi-Task Experts
Shikun Liu
Linxi Fan
Edward Johns
Zhiding Yu
Chaowei Xiao
Anima Anandkumar
VLM
MLLM
44
21
0
04 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft
  Prompts
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
27
29
0
17 Feb 2023
AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance
  Industry
AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry
Azin Asgarian
Rohit Saha
Daniel Jakubovitz
Julia Peyre
32
2
0
15 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A
  Reproducibility Study
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
39
5
0
12 Jan 2023
Learning to Exploit Temporal Structure for Biomedical Vision-Language
  Processing
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Shruthi Bannur
Stephanie L. Hyland
Qianchu Liu
Fernando Pérez-García
Maximilian Ilse
...
Maria T. A. Wetscherek
M. Lungren
A. Nori
Javier Alvarez-Valle
Ozan Oktay
36
113
0
11 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with
  Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
34
15
0
05 Jan 2023
Exploring Vision Transformers as Diffusion Learners
Exploring Vision Transformers as Diffusion Learners
He Cao
Jianan Wang
Tianhe Ren
Xianbiao Qi
Yihao Chen
Yuan Yao
Lefei Zhang
44
10
0
28 Dec 2022
Attentive Mask CLIP
Attentive Mask CLIP
Yifan Yang
Weiquan Huang
Yixuan Wei
Houwen Peng
Xinyang Jiang
...
Fangyun Wei
Yin Wang
Han Hu
Lili Qiu
Yuqing Yang
CLIP
VLM
42
27
0
16 Dec 2022
Uniform Masking Prevails in Vision-Language Pretraining
Uniform Masking Prevails in Vision-Language Pretraining
Siddharth Verma
Yuchen Lu
Rui Hou
Hanchao Yu
Nicolas Ballas
Madian Khabsa
Amjad Almahairi
VLM
21
0
0
10 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
57
311
0
06 Dec 2022
Controllable Image Captioning via Prompting
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
22
23
0
04 Dec 2022
Scaling Language-Image Pre-training via Masking
Scaling Language-Image Pre-training via Masking
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
27
318
0
01 Dec 2022
12
Next