Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.06165
Cited By
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"
50 / 478 papers shown
Title
Contrastive Vision-Language Pre-training with Limited Resources
Quan Cui
Boyan Zhou
Yu Guo
Weidong Yin
Hao Wu
Osamu Yoshie
Yubo Chen
VLM
CLIP
19
33
0
17 Dec 2021
RegionCLIP: Region-based Language-Image Pretraining
Yiwu Zhong
Jianwei Yang
Pengchuan Zhang
Chunyuan Li
Noel Codella
...
Luowei Zhou
Xiyang Dai
Lu Yuan
Yin Li
Jianfeng Gao
VLM
CLIP
40
557
0
16 Dec 2021
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Zhecan Wang
Haoxuan You
Liunian Harold Li
Alireza Zareian
Suji Park
Yiqing Liang
Kai-Wei Chang
Shih-Fu Chang
ReLM
LRM
15
30
0
16 Dec 2021
ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition
Xinyu Wang
Min Gui
Yong-jia Jiang
Zixia Jia
Nguyen Bach
Tao Wang
Zhongqiang Huang
Fei Huang
Kewei Tu
44
52
0
13 Dec 2021
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Jingjing Chen
Yu-Gang Jiang
VLM
MLLM
32
10
0
10 Dec 2021
Injecting Semantic Concepts into End-to-End Image Captioning
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lin Liang
Zhe Gan
Lijuan Wang
Yezhou Yang
Zicheng Liu
ViT
VLM
29
86
0
09 Dec 2021
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
Rameen Abdal
Peihao Zhu
John C. Femiani
Niloy J. Mitra
Peter Wonka
CLIP
39
103
0
09 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
40
690
0
08 Dec 2021
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjD
VLM
31
1,023
0
07 Dec 2021
Embedding Arithmetic of Multimodal Queries for Image Retrieval
Guillaume Couairon
Matthieu Cord
Matthijs Douze
Holger Schwenk
37
23
0
06 Dec 2021
General Facial Representation Learning in a Visual-Linguistic Manner
Yinglin Zheng
Hao Yang
Ting Zhang
Jianmin Bao
Dongdong Chen
Yangyu Huang
Lu Yuan
Dong Chen
Ming Zeng
Fang Wen
CVBM
146
164
0
06 Dec 2021
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Longtian Qiu
Renrui Zhang
Ziyu Guo
Wei Zhang
Zilu Guo
Ziyao Zeng
Guangnan Zhang
VLM
CLIP
30
45
0
04 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
33
23
0
02 Dec 2021
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
Christopher Clark
Jordi Salvador
Dustin Schwenk
Derrick Bonafilia
Mark Yatskar
...
Aaron Sarnat
Hannaneh Hajishirzi
Aniruddha Kembhavi
Oren Etzioni
Ali Farhadi
MLLM
25
3
0
01 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
25
79
0
01 Dec 2021
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang-ju Yang
Tong Zhang
Yuexian Zou
CLIP
25
20
0
30 Nov 2021
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
Yoad Tewel
Yoav Shalev
Idan Schwartz
Lior Wolf
VLM
34
192
0
29 Nov 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
Yufan Zhou
Ruiyi Zhang
Changyou Chen
Chunyuan Li
Chris Tensmeyer
Tong Yu
Jiuxiang Gu
Jinhui Xu
Tong Sun
VLM
35
163
0
27 Nov 2021
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
Changyao Tian
Wenhai Wang
Xizhou Zhu
Jifeng Dai
Yu Qiao
VLM
32
70
0
26 Nov 2021
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu
Tianwei Lin
Hao Tang
Fu Li
Dongliang He
N. Sebe
Radu Timofte
Luc Van Gool
Errui Ding
EGVM
38
41
0
26 Nov 2021
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
20
12
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
34
246
0
24 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
29
111
0
23 Nov 2021
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Sanchit Agarwal
Jan Jezabek
Arijit Biswas
Emre Barut
Shuyang Gao
Tagyoung Chung
23
1
0
22 Nov 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
57
882
0
22 Nov 2021
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
31
162
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming Yang
23
91
0
22 Nov 2021
L-Verse: Bidirectional Generation Between Image and Text
Taehoon Kim
Gwangmo Song
Sihaeng Lee
Sangyun Kim
Yewon Seo
Soonyoung Lee
S. Kim
Honglak Lee
Kyunghoon Bae
28
25
0
22 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
31
189
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
29
57
0
19 Nov 2021
Transparent Human Evaluation for Image Captioning
Jungo Kasai
Keisuke Sakaguchi
Lavinia Dunagan
Jacob Morrison
Ronan Le Bras
Yejin Choi
Noah A. Smith
33
47
0
17 Nov 2021
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
32
12
0
17 Nov 2021
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGS
ViT
77
332
0
11 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
30
619
0
09 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
38
369
0
03 Nov 2021
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
A. Moudgil
Arjun Majumdar
Harsh Agrawal
Stefan Lee
Dhruv Batra
LM&Ro
27
57
0
27 Oct 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
28
226
0
25 Oct 2021
Text-Based Person Search with Limited Data
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
21
89
0
20 Oct 2021
A Self-Explainable Stylish Image Captioning Framework via Multi-References
Chengxi Li
Brent Harrison
26
0
0
20 Oct 2021
A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation
Yupan Huang
Bei Liu
Jianlong Fu
Yutong Lu
DiffM
27
5
0
19 Oct 2021
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu
19
57
0
19 Oct 2021
Guiding Visual Question Generation
Nihir Vedd
Zixu Wang
Marek Rei
Yishu Miao
Lucia Specia
89
23
0
15 Oct 2021
Object-Region Video Transformers
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
30
82
0
13 Oct 2021
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
ZhuoSheng Zhang
Hanqing Zhang
Keming Chen
Yuhang Guo
Jingyun Hua
Yulong Wang
Ming Zhou
VLM
55
70
0
13 Oct 2021
CLIP4Caption: CLIP for Video Caption
Mingkang Tang
Zhanyu Wang
Zhenhua Liu
Fengyun Rao
Dian Li
Xiu Li
CLIP
VLM
35
150
0
13 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLM
CLIP
50
448
0
11 Oct 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao
Shijie Geng
Renrui Zhang
Teli Ma
Rongyao Fang
Yongfeng Zhang
Hongsheng Li
Yu Qiao
VLM
CLIP
101
991
0
09 Oct 2021
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Yulei Niu
Hanwang Zhang
Jun Xiao
AAML
OOD
21
36
0
03 Oct 2021
CIDEr-R: Robust Consensus-based Image Description Evaluation
G. O. D. Santos
Esther Luna Colombini
Sandra Avila
47
30
0
28 Sep 2021
Multimodal Integration of Human-Like Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Philippe Muller
Dominike Thomas
Mihai Bâce
Andreas Bulling
41
16
0
27 Sep 2021
Previous
1
2
3
...
10
7
8
9
Next