Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.15162
Cited By
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
30 November 2021
Bang-ju Yang
Tong Zhang
Yuexian Zou
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter"
31 / 31 papers shown
Title
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
95
12
0
28 Oct 2022
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning
Fenglin Liu
Xuancheng Ren
Xian Wu
Bang-ju Yang
Shen Ge
Yuexian Zou
Xu Sun
57
32
0
05 Aug 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
253
408
0
13 Jul 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
391
801
0
18 Apr 2021
Open-book Video Captioning with Retrieve-Copy-Generate Network
Ziqi Zhang
Zhongang Qi
Chun Yuan
Ying Shan
Bing Li
Ying Deng
Weiming Hu
52
93
0
09 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
806
29,167
0
26 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
112
1,735
0
05 Feb 2021
Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz
J. Perez
Diane Larlus
VLM
SSL
80
159
0
04 Aug 2020
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai
Justin Johnson
SSL
VLM
144
433
0
11 Jun 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
101
502
0
01 May 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
88
1,934
0
13 Apr 2020
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Boxiao Pan
Haoye Cai
De-An Huang
Kuan-Hui Lee
Adrien Gaidon
Ehsan Adeli
Juan Carlos Niebles
56
236
0
31 Mar 2020
Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo
Jing Liu
Xinxin Zhu
Peng Yao
Shichen Lu
Hanqing Lu
ViT
174
190
0
19 Mar 2020
Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Ziqi Zhang
Yaya Shi
Chunfen Yuan
Bing Li
Peijin Wang
Weiming Hu
Zhengjun Zha
VLM
78
272
0
26 Feb 2020
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
91
549
0
06 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
69
1,243
0
03 Apr 2019
Rethinking ImageNet Pre-training
Kaiming He
Ross B. Girshick
Piotr Dollár
VLM
SSeg
123
1,083
0
21 Nov 2018
End-to-End Video Captioning with Multitask Reinforcement Learning
Lijun Li
Boqing Gong
46
56
0
21 Mar 2018
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
118
1,931
0
27 Nov 2017
Video Captioning with Guidance of Multimodal Latent Topics
Shizhe Chen
Jia Chen
Qin Jin
Alexander G. Hauptmann
85
67
0
31 Aug 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
622
130,942
0
12 Jun 2017
Multi-Task Video Captioning with Video and Entailment Generation
Ramakanth Pasunuru
Joey Tianyi Zhou
54
117
0
24 Apr 2017
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
Youngjae Yu
Hyungjin Ko
Jongwook Choi
Gunhee Kim
115
231
0
10 Oct 2016
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju
Michael Cogswell
Abhishek Das
Ramakrishna Vedantam
Devi Parikh
Dhruv Batra
FAtt
248
19,929
0
07 Oct 2016
CNN Architectures for Large-Scale Audio Classification
Shawn Hershey
Sourish Chaudhuri
D. Ellis
J. Gemmeke
A. Jansen
...
Rif A. Saurous
Bryan Seybold
M. Slaney
Ron J. Weiss
K. Wilson
111
2,494
0
29 Sep 2016
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
332
10,467
0
21 Jul 2016
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
1.8K
193,426
0
10 Dec 2015
Sequence to Sequence -- Video to Text
Subhashini Venugopalan
Marcus Rohrbach
Jeff Donahue
Raymond J. Mooney
Trevor Darrell
Kate Saenko
108
1,417
0
03 May 2015
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.4K
149,842
0
22 Dec 2014
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Subhashini Venugopalan
Huijuan Xu
Jeff Donahue
Marcus Rohrbach
Raymond J. Mooney
Kate Saenko
106
952
0
15 Dec 2014
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
248
4,471
0
20 Nov 2014
1