Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,119 papers shown
Title
Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation
Atoosa Malemir Chegini
Soheil Feizi
VLM
69
4
0
09 Dec 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models
Hongzhan Lin
Ziyang Luo
Jing Ma
Long Chen
62
12
0
09 Dec 2023
Cross-BERT for Point Cloud Pretraining
Xin Li
Peng Li
Zeyong Wei
Zhe Zhu
Mingqiang Wei
Junhui Hou
Liangliang Nan
J. Qin
H. Xie
F. Wang
SSL
3DPC
82
0
0
08 Dec 2023
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jinho Park
Jack Hessel
Khyathi Chandu
Paul Pu Liang
Ximing Lu
...
Youngjae Yu
Qiuyuan Huang
Jianfeng Gao
Ali Farhadi
Yejin Choi
VLM
77
13
0
08 Dec 2023
Visual Grounding of Whole Radiology Reports for 3D CT Images
Akimichi Ichinose
Taro Hatsutani
Keigo Nakamura
Yoshiro Kitamura
S. Iizuka
E. Simo-Serra
Shoji Kido
Noriyuki Tomiyama
82
9
0
08 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLM
ObjD
LRM
FAtt
93
12
0
07 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
99
4
0
07 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
97
33
0
07 Dec 2023
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm
Jiandong Jin
Tianlin Li
Chenglong Li
Lili Huang
Jin Tang
AI4TS
67
7
0
04 Dec 2023
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
Anh Tuan Luu
100
6
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
108
22
0
04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
77
27
0
02 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
130
99
0
01 Dec 2023
Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model
Meera Hahn
Amit Raj
James M. Rehg
95
3
0
30 Nov 2023
A Lightweight Clustering Framework for Unsupervised Semantic Segmentation
Yau Shing Jonathan Cheung
Xi Chen
Lihe Yang
Hengshuang Zhao
80
1
0
30 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
62
3
0
29 Nov 2023
PALM: Predicting Actions through Language Models
Sanghwan Kim
Daoji Huang
Yongqin Xian
Otmar Hilliges
Luc Van Gool
Xi Wang
VLM
87
14
0
29 Nov 2023
Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?
Wang Zhu
Ishika Singh
Yuan Huang
Robin Jia
Jesse Thomason
131
2
0
28 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
471
960
0
27 Nov 2023
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
He Cao
Zijing Liu
Xingyu Lu
Yuan Yao
Yu-Feng Li
114
68
0
27 Nov 2023
C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing
Avigyan Bhattacharya
Mainak Singha
Ankit Jha
Biplab Banerjee
SSL
VLM
85
6
0
27 Nov 2023
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Bin Xie
Jiale Cao
Jin Xie
Fahad Shahbaz Khan
Yanwei Pang
VLM
125
48
0
27 Nov 2023
Generalized Graph Prompt: Toward a Unification of Pre-Training and Downstream Tasks on Graphs
Xingtong Yu
Zhenghao Liu
Yuan Fang
Zemin Liu
Sihong Chen
Xinming Zhang
127
31
0
26 Nov 2023
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
Cheng Tan
Jingxuan Wei
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Ruifeng Guo
Xihong Yang
Stan Z. Li
LRM
100
10
0
23 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
85
12
0
22 Nov 2023
A Survey on Multimodal Large Language Models for Autonomous Driving
Can Cui
Yunsheng Ma
Xu Cao
Wenqian Ye
Yang Zhou
...
Xinrui Yan
Shuqi Mei
Jianguo Cao
Ziran Wang
Chao Zheng
172
291
0
21 Nov 2023
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
66
14
0
18 Nov 2023
Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
Marvin Schmitt
Stefan T. Radev
Paul-Christian Bürkner
155
5
0
17 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
145
72
0
16 Nov 2023
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Hefeng Wu
Weifeng Chen
Zhibin Liu
Tianshui Chen
Zhiguang Chen
Liang Lin
86
13
0
15 Nov 2023
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Yating Xu
Conghui Hu
Gim Hee Lee
55
2
0
14 Nov 2023
Learning Mutually Informed Representations for Characters and Subwords
Yilin Wang
Xinyi Hu
Matthew R. Gormley
70
0
0
14 Nov 2023
Interaction is all You Need? A Study of Robots Ability to Understand and Execute
Kushal Koshti
Nidhir Bhavsar
103
1
0
13 Nov 2023
TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction
Ruiquan Ge
Xiangyang Hu
Rungen Huang
Gangyong Jia
Yaqi Wang
...
Changmiao Wang
Elazab Ahmed
Linyan Wang
Juan Ye
Ye Li
ViT
13
1
0
13 Nov 2023
Detecting and Correcting Hate Speech in Multimodal Memes with Large Visual Language Model
Minh-Hao Van
Xintao Wu
VLM
MLLM
65
11
0
12 Nov 2023
Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
Chancharik Mitra
Abrar Anwar
Rodolfo Corona
Dan Klein
Trevor Darrell
Jesse Thomason
72
1
0
12 Nov 2023
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
Sheng-Hsuan Peng
Seongmin Lee
Xiaojing Wang
Rajarajeswari Balasubramaniyan
Duen Horng Chau
ViT
LMTD
50
3
0
09 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
64
6
0
09 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
82
7
0
07 Nov 2023
Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI
Yaoxian Song
Penglei Sun
Haoyu Liu
Li Zhixu
Wei Song
Yanghua Xiao
Xiaofang Zhou
LM&Ro
131
16
0
07 Nov 2023
CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations
Xuzhe Dang
Stefan Edelkamp
176
4
0
06 Nov 2023
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
Iqra Qasim
Alexander Horsch
Dilip K. Prasad
96
9
0
05 Nov 2023
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models
Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
VLM
85
1
0
05 Nov 2023
LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds
Anqi Joyce Yang
Sergio Casas
Nikita Dvornik
Sean Segal
Yuwen Xiong
Jordan Sir Kwang Hu
Carter Fang
R. Urtasun
86
6
0
02 Nov 2023
Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
Xue-mei Hu
Ce Zhang
Yi Zhang
Bowen Hai
Ke Yu
Zhihai He
MDE
VLM
100
18
0
02 Nov 2023
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
Sungjune Park
Hyunjun Kim
Y. Ro
82
12
0
02 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
156
44
0
01 Nov 2023
Re-Scoring Using Image-Language Similarity for Few-Shot Object Detection
Min Jae Jung
S. Han
Joohee Kim
132
14
0
01 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
127
17
0
31 Oct 2023
The Expressibility of Polynomial based Attention Scheme
Zhao Song
Guangyi Xu
Junze Yin
95
5
0
30 Oct 2023
Previous
1
2
3
...
9
10
11
...
41
42
43
Next