ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.07490
  4. Cited By
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

20 August 2019
Hao Hao Tan
Joey Tianyi Zhou
    VLM
    MLLM
ArXivPDFHTML

Papers citing "LXMERT: Learning Cross-Modality Encoder Representations from Transformers"

50 / 1,507 papers shown
Title
Enhancing Large Vision Language Models with Self-Training on Image
  Comprehension
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
44
36
0
30 May 2024
WIDIn: Wording Image for Domain-Invariant Representation in
  Single-Source Domain Generalization
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization
Jiawei Ma
Yulei Niu
Shiyuan Huang
G. Han
Shih-Fu Chang
VLM
42
1
0
28 May 2024
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial
  Documents with Vector Sampling using Multi-modal Embedding Models
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial Documents with Vector Sampling using Multi-modal Embedding Models
Anjanava Biswas
Wrick Talukdar
16
1
0
28 May 2024
Diagnosing the Compositional Knowledge of Vision Language Models from a
  Game-Theoretic View
Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View
Jin Wang
Shichao Dong
Yapeng Zhu
Kelu Yao
Weidong Zhao
Chao Li
Ping Luo
CoGe
LRM
48
2
0
27 May 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical
  Study of VCR
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
25
8
0
27 May 2024
LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction
  from Single Image
LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction from Single Image
Ruikai Cui
Xibin Song
Weixuan Sun
Senbo Wang
Weizhe Liu
...
Taizhang Shang
Yang Li
Nick Barnes
Hongdong Li
Pan Ji
3DV
53
5
0
24 May 2024
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed
Mahmoud Afifi
Alec Go
MLLM
VLM
36
3
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
82
42
0
23 May 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised
  Speech Embeddings
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings
Tariq Adnan
Abdelrahman Abdelkader
Zipei Liu
Ekram Hossain
Sooyong Park
Md. Saiful Islam
Ehsan Hoque
33
2
0
21 May 2024
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang
Pengpeng Zeng
Lianli Gao
Jingkuan Song
Yihang Duan
Xinyu Lyu
Hengtao Shen
VLM
CLIP
40
2
0
21 May 2024
Enhancing Fine-Grained Image Classifications via Cascaded Vision
  Language Models
Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
Canshi Wei
VLM
32
0
0
18 May 2024
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based
  Inferencing
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing
Siddhant Agarwal
Shivam Sharma
Preslav Nakov
Tanmoy Chakraborty
24
4
0
18 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
29
8
0
14 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
35
2
0
12 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
34
0
0
09 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
33
6
0
09 May 2024
Interpretable Tensor Fusion
Interpretable Tensor Fusion
Saurabh Varshneya
Antoine Ledent
Philipp Liznerski
Andriy Balinskyy
Purvanshi Mehta
Waleed Mustafa
Marius Kloft
21
1
0
07 May 2024
POV Learning: Individual Alignment of Multimodal Models using Human
  Perception
POV Learning: Individual Alignment of Multimodal Models using Human Perception
Simon Werner
Katharina Christ
Laura Bernardy
Marion G. Müller
Achim Rettinger
26
0
0
07 May 2024
Visual Language Model based Cross-modal Semantic Communication Systems
Visual Language Model based Cross-modal Semantic Communication Systems
Feibo Jiang
Chuanguo Tang
Li Dong
Kezhi Wang
Kun Yang
Cunhua Pan
VLM
36
2
0
06 May 2024
Transitive Vision-Language Prompt Learning for Domain Generalization
Transitive Vision-Language Prompt Learning for Domain Generalization
Liyuan Wang
Yan Jin
Zhen Chen
Jinlin Wu
Mengke Li
Yang Lu
Hanzi Wang
VLM
VPVLM
LRM
55
0
0
29 Apr 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
35
3
0
29 Apr 2024
Efficient Remote Sensing with Harmonized Transfer Learning and Modality
  Alignment
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment
Tengjun Huang
41
0
0
28 Apr 2024
Spatio-Temporal Side Tuning Pre-trained Foundation Models for
  Video-based Pedestrian Attribute Recognition
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition
Tianlin Li
Qian Zhu
Jiandong Jin
Jun Zhu
Futian Wang
Bowei Jiang
Yaowei Wang
Yonghong Tian
ViT
36
3
0
27 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
32
0
0
27 Apr 2024
A review of deep learning-based information fusion techniques for
  multimodal medical image classification
A review of deep learning-based information fusion techniques for multimodal medical image classification
Yi-Hsuan Li
Mostafa EL HABIB DAHO
Pierre-Henri Conze
Rachid Zeghlache
Hugo Le Boité
R. Tadayoni
B. Cochener
M. Lamard
G. Quellec
35
31
0
23 Apr 2024
Leveraging Speech for Gesture Detection in Multimodal Communication
Leveraging Speech for Gesture Detection in Multimodal Communication
E. Ghaleb
I. Burenko
Marlou Rasenberg
Wim Pouw
Ivan Toni
Peter Uhrig
Anna Wilson
Judith Holler
Asli Ozyurek
Raquel Fernández
SLR
30
4
0
23 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and
  Question Answering
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
36
0
0
22 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based
  Visual Question Answering
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
43
1
0
19 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
35
4
0
18 Apr 2024
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Jingmin Sun
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
AI4CE
30
15
0
18 Apr 2024
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI
  Agent
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Wei Chen
Zhiyuan Li
LLMAG
30
3
0
17 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
39
3
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
40
7
0
16 Apr 2024
AIGeN: An Adversarial Approach for Instruction Generation in VLN
AIGeN: An Adversarial Approach for Instruction Generation in VLN
Niyati Rawal
Roberto Bigazzi
Lorenzo Baraldi
Rita Cucchiara
GAN
52
4
0
15 Apr 2024
Evolving Interpretable Visual Classifiers with Large Language Models
Evolving Interpretable Visual Classifiers with Large Language Models
Mia Chiquier
Utkarsh Mall
Carl Vondrick
VLM
30
10
0
15 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
37
7
0
15 Apr 2024
Multimodal Cross-Document Event Coreference Resolution Using Linear
  Semantic Transfer and Mixed-Modality Ensembles
Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles
Abhijnan Nath
Huma Jamil
Shafiuddin Rehan Ahmed
George Baker
Rahul Ghosh
James H. Martin
Nathaniel Blanchard
Nikhil Krishnaswamy
34
2
0
13 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image
  Captions as Prompts
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
41
10
0
12 Apr 2024
FLoRA: Enhancing Vision-Language Models with Parameter-Efficient
  Federated Learning
FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning
Duy Phuong Nguyen
J. P. Muñoz
Ali Jannesari
VLM
31
6
0
12 Apr 2024
MedRG: Medical Report Grounding with Multi-modal Large Language Model
MedRG: Medical Report Grounding with Multi-modal Large Language Model
K. Zou
Yang Bai
Zhihao Chen
Yang Zhou
Yidi Chen
Kai Ren
Meng Wang
Xuedong Yuan
Xiaojing Shen
Huazhu Fu
MedIm
42
4
0
10 Apr 2024
What is Your Favorite Gender, MLM? Gender Bias Evaluation in
  Multilingual Masked Language Models
What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models
Emily M. Bender
Solon Barocas
Robert Sim
Hanna Wallach. 2021
29
3
0
09 Apr 2024
GUIDE: Graphical User Interface Data for Execution
GUIDE: Graphical User Interface Data for Execution
Rajat Chawla
Adarsh Jha
Muskaan Kumar
NS Mukunda
Ishaan Bhola
LLMAG
27
3
0
09 Apr 2024
Contextual Chart Generation for Cyber Deception
Contextual Chart Generation for Cyber Deception
David D. Nguyen
David Liebowitz
Surya Nepal
S. Kanhere
Sharif Abuadbba
49
0
0
07 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
H. Najjaran
OOD
32
1
0
05 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
32
20
0
04 Apr 2024
Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities
  for Human Identification
Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities for Human Identification
Rui Wang
Chuanfu Shen
M. Marín-Jiménez
George Q. Huang
Shiqi Yu
CVBM
53
4
0
04 Apr 2024
DELAN: Dual-Level Alignment for Vision-and-Language Navigation by
  Cross-Modal Contrastive Learning
DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning
Mengfei Du
Binhao Wu
Jiwen Zhang
Zhihao Fan
Zejun Li
Ruipu Luo
Xuanjing Huang
Zhongyu Wei
33
3
0
02 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric
  Vision-Language Pretraining
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
23
4
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative
  Vision-Language Reasoning
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
30
2
0
01 Apr 2024
LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End
  Autonomous Driving
LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving
Pranjal Paul
Anant Garg
Tushar Choudhary
Arun Kumar Singh
K. M. Krishna
52
3
0
29 Mar 2024
Previous
12345...293031
Next