Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.13196
Cited By
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
25 November 2021
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Faisal Ahmed
Zhe Gan
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
50 / 58 papers shown
Title
FocusedAD: Character-centric Movie Audio Description
Xiaojun Ye
C. Wang
Yiren Song
Sheng Zhou
Liangcheng Li
Jiajun Bu
VGen
67
0
0
16 Apr 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
319
2
0
11 Mar 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
83
1
0
31 Dec 2024
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
Yuan-Ming Li
An-Lan Wang
Kun-Yu Lin
Yu-Ming Tang
Ling-an Zeng
Jian-Fang Hu
Wei-Shi Zheng
132
6
0
26 Nov 2024
Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
Dexuan Ding
Lei Wang
Liyun Zhu
Tom Gedeon
Piotr Koniusz
78
8
0
02 Oct 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
74
4
0
10 Jun 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
68
4
0
19 Mar 2024
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Haoyu Zhang
Meng Liu
Yaowei Wang
Da Cao
Weili Guan
Liqiang Nie
54
0
0
11 Oct 2023
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Yifan Xu
Zhijie Zhang
Mengdan Zhang
Kekai Sheng
Ke Li
Weiming Dong
Liqing Zhang
Changsheng Xu
Xing Sun
ViT
50
207
0
03 Aug 2021
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
62
1,458
0
24 Jun 2021
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
Linjie Li
Jie Lei
Zhe Gan
Licheng Yu
Yen-Chun Chen
...
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
45
100
0
08 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLM
LRM
62
377
0
04 Jun 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
353
796
0
18 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
89
1,154
0
01 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
84
2,119
0
29 Mar 2021
Open-book Video Captioning with Retrieve-Copy-Generate Network
Ziqi Zhang
Zhongang Qi
Chun Yuan
Ying Shan
Bing Li
Ying Deng
Weiming Hu
39
93
0
09 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
507
28,659
0
26 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
82
651
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
301
2,016
0
09 Feb 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
86
419
0
14 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
157
40,217
0
22 Oct 2020
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
31
246
0
06 Oct 2020
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
Xiaowei Hu
Xi Yin
Kevin Qinghong Lin
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
39
56
0
28 Sep 2020
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
Shaoxiang Chen
Wenhao Jiang
Wei Liu
Yu-Gang Jiang
35
101
0
28 Jul 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
76
496
0
01 May 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
58
1,927
0
13 Apr 2020
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Boxiao Pan
Haoye Cai
De-An Huang
Kuan-Hui Lee
Adrien Gaidon
Ehsan Adeli
Juan Carlos Niebles
47
234
0
31 Mar 2020
Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Ziqi Zhang
Yaya Shi
Chunfen Yuan
Bing Li
Peijin Wang
Weiming Hu
Zhengjun Zha
VLM
46
271
0
26 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Botian Shi
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
65
442
0
15 Feb 2020
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Jie Lei
Licheng Yu
Tamara L. Berg
Joey Tianyi Zhou
170
279
0
24 Jan 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
84
705
0
13 Dec 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
106
42,038
0
03 Dec 2019
Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network
Bairui Wang
Lin Ma
Wei Zhang
Wenhao Jiang
Jingwen Wang
Wei Liu
81
163
0
27 Aug 2019
AI and Accessibility: A Discussion of Ethical Considerations
M. R. Morris.
14
59
0
21 Aug 2019
Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
Junchao Zhang
Yuxin Peng
42
170
0
11 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
81
1,186
0
07 Jun 2019
Memory-Attended Recurrent Network for Video Captioning
Wenjie Pei
Jiyuan Zhang
Xiangrong Wang
Lei Ke
Xiaoyong Shen
Yu-Wing Tai
90
199
0
10 May 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
55
544
0
06 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
26
1,238
0
03 Apr 2019
Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Nayyer Aafaq
Naveed Akhtar
Wen Liu
Syed Zulqarnain Gilani
Ajmal Mian
38
204
0
27 Feb 2019
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
130
3,244
0
10 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
751
93,936
0
11 Oct 2018
TVQA: Localized, Compositional Video Question Answering
Muhammad Abdul Wahab
Licheng Yu
Mounir Nasr Allah
Tamara L. Berg
47
619
0
05 Sep 2018
A Short Note about Kinetics-600
João Carreira
Eric Noland
Andras Banki-Horvath
Chloe Hillier
Andrew Zisserman
50
520
0
03 Aug 2018
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
59
527
0
03 Apr 2018
Less Is More: Picking Informative Frames for Video Captioning
Yangyu Chen
Shuhui Wang
Wentao Zhang
Qingming Huang
35
200
0
05 Mar 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
119
1,317
0
13 Dec 2017
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
89
1,926
0
27 Nov 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
178
7,961
0
22 May 2017
The Kinetics Human Action Video Dataset
W. Kay
João Carreira
Karen Simonyan
Brian Zhang
Chloe Hillier
...
Tim Green
T. Back
Apostol Natsev
Mustafa Suleyman
Andrew Zisserman
182
3,771
0
19 May 2017
1
2
Next