ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11746
  4. Cited By
VidTr: Video Transformer Without Convolutions
v1v2 (latest)

VidTr: Video Transformer Without Convolutions

23 April 2021
Yanyi Zhang
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Biagio Brattoli
Hao Chen
I. Marsic
Joseph Tighe
    ViT
ArXiv (abs)PDFHTML

Papers citing "VidTr: Video Transformer Without Convolutions"

50 / 52 papers shown
Title
Interpretability-Aware Vision Transformer
Interpretability-Aware Vision Transformer
Yao Qiang
Chengyin Li
Prashant Khanduri
D. Zhu
ViT
221
8
0
14 Sep 2023
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
103
1,482
0
24 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
86
279
0
09 Jun 2021
Multiscale Vision Transformers
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
132
1,259
0
22 Apr 2021
An Empirical Study of Training Self-Supervised Vision Transformers
An Empirical Study of Training Self-Supervised Vision Transformers
Xinlei Chen
Saining Xie
Kaiming He
ViT
154
1,864
0
05 Apr 2021
CvT: Introducing Convolutions to Vision Transformers
CvT: Introducing Convolutions to Vision Transformers
Haiping Wu
Bin Xiao
Noel Codella
Mengchen Liu
Xiyang Dai
Lu Yuan
Lei Zhang
ViT
152
1,910
0
29 Mar 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
222
2,150
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
450
21,439
0
25 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
381
2,053
0
09 Feb 2021
Video Transformer Network
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
264
432
0
01 Feb 2021
Tokens-to-Token ViT: Training Vision Transformers from Scratch on
  ImageNet
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li-xin Yuan
Yunpeng Chen
Tao Wang
Weihao Yu
Yujun Shi
Zihang Jiang
Francis E. H. Tay
Jiashi Feng
Shuicheng Yan
ViT
133
1,939
0
28 Jan 2021
SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation
SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation
Brendan Duke
Abdalla Ahmed
Christian Wolf
P. Aarabi
Graham W. Taylor
VOS
61
166
0
21 Jan 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
387
6,768
0
23 Dec 2020
NUTA: Non-uniform Temporal Aggregation for Action Recognition
NUTA: Non-uniform Temporal Aggregation for Action Recognition
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Hao Chen
Joseph Tighe
ViT
40
16
0
15 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
657
41,103
0
22 Oct 2020
Directional Temporal Modeling for Action Recognition
Directional Temporal Modeling for Action Recognition
Xinyu Li
Bing Shuai
Joseph Tighe
48
41
0
21 Jul 2020
Multi-modal Transformer for Video Retrieval
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
533
610
0
21 Jul 2020
End-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT3DVPINN
415
13,048
0
26 May 2020
Quantifying Attention Flow in Transformers
Quantifying Attention Flow in Transformers
Samira Abnar
Willem H. Zuidema
157
796
0
02 May 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALMVLM
171
4,071
0
10 Apr 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
134
1,020
0
09 Apr 2020
Temporal Pyramid Network for Action Recognition
Temporal Pyramid Network for Action Recognition
Ceyuan Yang
Yinghao Xu
Jianping Shi
Bo Dai
Bolei Zhou
47
372
0
07 Apr 2020
TEA: Temporal Excitation and Aggregation for Action Recognition
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan-Ran Li
Bin Ji
Xintian Shi
Jianguo Zhang
Bin Kang
Limin Wang
ViT
87
447
0
03 Apr 2020
Bridging Text and Video: A Universal Multimodal Transformer for
  Video-Audio Scene-Aware Dialog
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
Zekang Li
Zongjia Li
Jinchao Zhang
Yang Feng
Cheng Niu
Jie Zhou
123
37
0
01 Feb 2020
Temporal Interlacing Network
Temporal Interlacing Network
Hao Shao
Shengju Qian
Yu Liu
60
93
0
17 Jan 2020
More Is Less: Learning Efficient Video Representations by Big-Little
  Network and Depthwise Temporal Aggregation
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
Quanfu Fan
Chun-Fu Chen
Hilde Kuehne
Marco Pistoia
David D. Cox
74
126
0
02 Dec 2019
TEINet: Towards an Efficient Architecture for Video Recognition
TEINet: Towards an Efficient Architecture for Video Recognition
Zhaoyang Liu
Donghao Luo
Yabiao Wang
Limin Wang
Ying Tai
Chengjie Wang
Jilin Li
Feiyue Huang
Tong Lu
ViT
77
240
0
21 Nov 2019
Tiny Video Networks
Tiny Video Networks
A. Piergiovanni
A. Angelova
Michael S. Ryoo
112
46
0
15 Oct 2019
STM: SpatioTemporal and Motion Encoding for Action Recognition
STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang
Mengmeng Wang
Weihao Gan
Wei Wu
Junjie Yan
81
382
0
07 Aug 2019
A Short Note on the Kinetics-700 Human Action Dataset
A Short Note on the Kinetics-700 Human Action Dataset
João Carreira
Eric Noland
Chloe Hillier
Andrew Zisserman
76
453
0
15 Jul 2019
Video Modeling with Correlation Networks
Video Modeling with Correlation Networks
Heng Wang
Du Tran
Lorenzo Torresani
Matt Feiszli
67
129
0
07 Jun 2019
Video Classification with Channel-Separated Convolutional Networks
Video Classification with Channel-Separated Convolutional Networks
Du Tran
Heng Wang
Lorenzo Torresani
Matt Feiszli
3DV
72
587
0
04 Apr 2019
Long-Term Feature Banks for Detailed Video Understanding
Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu
Christoph Feichtenhofer
Haoqi Fan
Kaiming He
Philipp Krahenbuhl
Ross B. Girshick
174
480
0
12 Dec 2018
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
166
3,274
0
10 Dec 2018
Video Action Transformer Network
Video Action Transformer Network
Rohit Girdhar
João Carreira
Carl Doersch
Andrew Zisserman
ViT
126
709
0
06 Dec 2018
TSM: Temporal Shift Module for Efficient Video Understanding
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
Chuang Gan
Song Han
98
1,691
0
20 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
94,891
0
11 Oct 2018
Multi-Fiber Networks for Video Recognition
Multi-Fiber Networks for Video Recognition
Yunpeng Chen
Yannis Kalantidis
Jianshu Li
Shuicheng Yan
Jiashi Feng
CVBM
105
218
0
30 Jul 2018
Videos as Space-Time Region Graphs
Videos as Space-Time Region Graphs
Xinyu Wang
Abhinav Gupta
104
756
0
05 Jun 2018
End-to-End Dense Video Captioning with Masked Transformer
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
92
528
0
03 Apr 2018
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran
Heng Wang
Lorenzo Torresani
Jamie Ray
Yann LeCun
Manohar Paluri
218
3,030
0
30 Nov 2017
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
126
1,934
0
27 Nov 2017
Temporal Relational Reasoning in Videos
Temporal Relational Reasoning in Videos
Bolei Zhou
A. Andonian
Aude Oliva
Antonio Torralba
NAI
96
1,039
0
22 Nov 2017
Non-local Neural Networks
Non-local Neural Networks
Xinyu Wang
Ross B. Girshick
Abhinav Gupta
Kaiming He
OffRL
289
8,906
0
21 Nov 2017
The "something something" video database for learning and evaluating
  visual common sense
The "something something" video database for learning and evaluating visual common sense
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
87
1,535
0
13 Jun 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
701
131,652
0
12 Jun 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
232
8,019
0
22 May 2017
ActionVLAD: Learning spatio-temporal aggregation for action
  classification
ActionVLAD: Learning spatio-temporal aggregation for action classification
Rohit Girdhar
Deva Ramanan
Abhinav Gupta
Josef Sivic
Bryan C. Russell
AI4TS
72
451
0
10 Apr 2017
Aggregated Residual Transformations for Deep Neural Networks
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Zhuowen Tu
Kaiming He
517
10,330
0
16 Nov 2016
Temporal Segment Networks: Towards Good Practices for Deep Action
  Recognition
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
ViT
105
3,835
0
02 Aug 2016
12
Next