ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSL
    VLM
ArXivPDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,093 papers shown
Title
Attribute-Aware Implicit Modality Alignment for Text Attribute Person
  Search
Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search
Xin Wang
Fangfang Liu
Zheng Li
Caili Guo
46
1
0
06 Jun 2024
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
53
0
0
06 Jun 2024
FILS: Self-Supervised Video Feature Prediction In Semantic Language
  Space
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Mona Ahmadian
Frank Guerin
Andrew Gilbert
44
1
0
05 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
44
15
0
04 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
56
0
0
04 Jun 2024
Augmented Commonsense Knowledge for Remote Object Grounding
Augmented Commonsense Knowledge for Remote Object Grounding
Bahram Mohammadi
Yicong Hong
Yuankai Qi
Qi Wu
Shirui Pan
Javen Qinfeng Shi
48
7
0
03 Jun 2024
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision
  Transformer
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia
Jianyuan Guo
Kai Han
Han Wu
Chao Zhang
Chang Xu
Xinghao Chen
ViT
48
16
0
03 Jun 2024
Hard Cases Detection in Motion Prediction by Vision-Language Foundation
  Models
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
Yi Yang
Qingwen Zhang
Kei Ikemura
Nazre Batool
John Folkesson
VLM
35
1
0
31 May 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits
  Multimodal Reasoning
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
72
6
0
31 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action
  Anticipation using Large Video-Language Models
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
44
14
0
30 May 2024
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval
  from Linguistically Complex Descriptions
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
Honglin Lin
Siyu Li
Gu Nan
Chaoyue Tang
Xueting Wang
...
Yankai Rong
Zhili Zhou
Yutong Gao
Qimei Cui
Xiaofeng Tao
33
0
0
29 May 2024
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing
  Image-Text Retrieval
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval
Rui Yang
Shuang Wang
Yi Han
Yuanheng Li
Dong Zhao
Dou Quan
Yanhe Guo
Licheng Jiao
68
3
0
29 May 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical
  Study of VCR
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
25
8
0
27 May 2024
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Zizhao Hu
Mohammad Rostami
34
0
0
25 May 2024
Planted: a dataset for planted forest identification from
  multi-satellite time series
Planted: a dataset for planted forest identification from multi-satellite time series
L. M. Pazos-Outón
Cristina Nader Vasconcelos
Anton Raichuk
Anurag Arnab
Dan Morris
Maxim Neumann
47
4
0
24 May 2024
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip
  Transformer
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer
Zichen Geng
Caren Han
Zeeshan Hayder
Jian Liu
Mubarak Shah
Ajmal Mian
27
3
0
24 May 2024
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed
Mahmoud Afifi
Alec Go
MLLM
VLM
39
3
0
24 May 2024
Distilling Vision-Language Pretraining for Efficient Cross-Modal
  Retrieval
Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval
Young Kyun Jang
Donghyun Kim
Ser-nam Lim
VLM
21
0
0
23 May 2024
Boosting Medical Image-based Cancer Detection via Text-guided
  Supervision from Reports
Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports
Guangyu Guo
Jiawen Yao
Yingda Xia
Tony C. W. Mok
Zhilin Zheng
Junwei Han
Le Lu
Dingwen Zhang
Jian Zhou
Ling Zhang
37
1
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
82
45
0
23 May 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A
  Survey
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
35
9
0
22 May 2024
Comprehensive Multimodal Deep Learning Survival Prediction Enabled by a
  Transformer Architecture: A Multicenter Study in Glioblastoma
Comprehensive Multimodal Deep Learning Survival Prediction Enabled by a Transformer Architecture: A Multicenter Study in Glioblastoma
A. Gomaa
Yixing Huang
Amr Hagag
Charlotte Schmitter
Daniel Höfler
...
U. Gaipl
S. Semrau
Christoph Bert
R. Fietkau
F. Putz
21
8
0
21 May 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised
  Speech Embeddings
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings
Tariq Adnan
Abdelrahman Abdelkader
Zipei Liu
Ekram Hossain
Sooyong Park
Md. Saiful Islam
Ehsan Hoque
38
2
0
21 May 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
30
2
0
19 May 2024
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based
  Inferencing
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing
Siddhant Agarwal
Shivam Sharma
Preslav Nakov
Tanmoy Chakraborty
24
4
0
18 May 2024
Self-supervised vision-langage alignment of deep learning
  representations for bone X-rays analysis
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis
A. Englebert
Anne-Sophie Collin
O. Cornu
Christophe De Vleeschouwer
34
1
0
14 May 2024
Alignment Helps Make the Most of Multimodal Data
Alignment Helps Make the Most of Multimodal Data
Christian Arnold
Andreas Küpfer
46
2
0
14 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
45
1
0
13 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
35
2
0
12 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
37
0
0
09 May 2024
Exploring Vision Transformers for 3D Human Motion-Language Models with
  Motion Patches
Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
Qing Yu
Mikihiro Tanaka
Kent Fujiwara
ViT
47
2
0
08 May 2024
POV Learning: Individual Alignment of Multimodal Models using Human
  Perception
POV Learning: Individual Alignment of Multimodal Models using Human Perception
Simon Werner
Katharina Christ
Laura Bernardy
Marion G. Müller
Achim Rettinger
26
0
0
07 May 2024
Language-Image Models with 3D Understanding
Language-Image Models with 3D Understanding
Jang Hyun Cho
Boris Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
42
17
0
06 May 2024
Language-Enhanced Latent Representations for Out-of-Distribution
  Detection in Autonomous Driving
Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving
Zhenjiang Mao
Dong-You Jhong
Ao Wang
Ivan Ruchkin
OODD
44
2
0
02 May 2024
Transitive Vision-Language Prompt Learning for Domain Generalization
Transitive Vision-Language Prompt Learning for Domain Generalization
Liyuan Wang
Yan Jin
Zhen Chen
Jinlin Wu
Mengke Li
Yang Lu
Hanzi Wang
VLM
VPVLM
LRM
58
0
0
29 Apr 2024
Enhancing Interactive Image Retrieval With Query Rewriting Using Large
  Language Models and Vision Language Models
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Hongyi Zhu
Jia-Hong Huang
S. Rudinac
Evangelos Kanoulas
44
7
0
29 Apr 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
41
3
0
29 Apr 2024
Spatio-Temporal Side Tuning Pre-trained Foundation Models for
  Video-based Pedestrian Attribute Recognition
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition
Tianlin Li
Qian Zhu
Jiandong Jin
Jun Zhu
Futian Wang
Bowei Jiang
Yaowei Wang
Yonghong Tian
ViT
39
3
0
27 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
40
0
0
27 Apr 2024
A review of deep learning-based information fusion techniques for
  multimodal medical image classification
A review of deep learning-based information fusion techniques for multimodal medical image classification
Yi-Hsuan Li
Mostafa EL HABIB DAHO
Pierre-Henri Conze
Rachid Zeghlache
Hugo Le Boité
R. Tadayoni
B. Cochener
M. Lamard
G. Quellec
38
31
0
23 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and
  Question Answering
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
36
0
0
22 Apr 2024
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking
  Enhances Visual Commonsense Reasoning
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
41
1
0
22 Apr 2024
General Item Representation Learning for Cold-start Content
  Recommendations
General Item Representation Learning for Cold-start Content Recommendations
Jooeun Kim
Jinri Kim
Kwangeun Yeo
Eungi Kim
Kyoung-Woon On
Jonghwan Mun
Joonseok Lee
VLM
32
1
0
22 Apr 2024
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Konstantinos Vilouras
Pedro Sanchez
Alison Q. OÑeil
Sotirios A. Tsaftaris
MedIm
45
2
0
19 Apr 2024
Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
Yuan Zang
Tian Yun
Hao Tan
Trung Bui
Chen Sun
VLM
CoGe
58
9
0
19 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
39
4
0
18 Apr 2024
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Jingmin Sun
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
AI4CE
30
17
0
18 Apr 2024
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI
  Agent
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Wei Chen
Zhiyuan Li
LLMAG
30
3
0
17 Apr 2024
Improving Composed Image Retrieval via Contrastive Learning with Scaling
  Positives and Negatives
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
Zhangchi Feng
Richong Zhang
Zhijie Nie
41
7
0
17 Apr 2024
Spatial Context-based Self-Supervised Learning for Handwritten Text Recognition
Spatial Context-based Self-Supervised Learning for Handwritten Text Recognition
Carlos Peñarrubia
Carlos Garrido-Munoz
J. J. Valero-Mas
Jorge Calvo-Zaragoza
39
1
0
17 Apr 2024
Previous
123...567...404142
Next