Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.11591
Cited By
Efficient Video Transformers with Spatial-Temporal Token Selection
23 November 2021
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Efficient Video Transformers with Spatial-Temporal Token Selection"
50 / 71 papers shown
Title
SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
Wenxi Li
Yuchen Guo
Jilai Zheng
Haozhe Lin
Chao Ma
Lu Fang
Xiaokang Yang
ViT
98
4
0
11 Feb 2025
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
134
0
0
20 Nov 2024
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen
Tianxiang Hao
Tao He
Sicheng Zhao
Pengzhang Liu
Yongjun Bao
Guiguang Ding
Guiguang Ding
213
14
0
02 Sep 2024
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li
Yali Wang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
118
249
0
12 Jan 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
Yulin Wang
Yang Yue
Yuanze Lin
Haojun Jiang
Zihang Lai
V. Kulikov
Nikita Orlov
Humphrey Shi
Gao Huang
58
50
0
28 Dec 2021
BEVT: BERT Pretraining of Video Transformers
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
81
208
0
02 Dec 2021
Focal Self-attention for Local-Global Interactions in Vision Transformers
Jianwei Yang
Chunyuan Li
Pengchuan Zhang
Xiyang Dai
Bin Xiao
Lu Yuan
Jianfeng Gao
ViT
78
435
0
01 Jul 2021
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
94
1,481
0
24 Jun 2021
IA-RED
2
^2
2
: Interpretability-Aware Redundancy Reduction for Vision Transformers
Bowen Pan
Yikang Shen
Yi Ding
Zhangyang Wang
Rogerio Feris
A. Oliva
VLM
ViT
91
160
0
23 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
80
279
0
09 Jun 2021
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Yongming Rao
Wenliang Zhao
Benlin Liu
Jiwen Lu
Jie Zhou
Cho-Jui Hsieh
ViT
78
697
0
03 Jun 2021
Intriguing Properties of Vision Transformers
Muzammal Naseer
Kanchana Ranasinghe
Salman Khan
Munawar Hayat
Fahad Shahbaz Khan
Ming-Hsuan Yang
ViT
313
647
0
21 May 2021
Adaptive Focus for Efficient Video Recognition
Yulin Wang
Zhaoxi Chen
Haojun Jiang
Shiji Song
Yizeng Han
Gao Huang
66
99
0
07 May 2021
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
127
1,259
0
22 Apr 2021
Differentiable Patch Selection for Image Recognition
Jean-Baptiste Cordonnier
Aravindh Mahendran
Alexey Dosovitskiy
Dirk Weissenborn
Jakob Uszkoreit
Thomas Unterthiner
63
95
0
07 Apr 2021
UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles
Tianjiao Li
Jun Liu
Wei Emma Zhang
Yun Ni
Wenqian Wang
Zhiheng Li
AI4TS
56
191
0
02 Apr 2021
Rethinking Spatial Dimensions of Vision Transformers
Byeongho Heo
Sangdoo Yun
Dongyoon Han
Sanghyuk Chun
Junsuk Choe
Seong Joon Oh
ViT
494
581
0
30 Mar 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
215
2,149
0
29 Mar 2021
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
Li Xu
He Huang
Jun Liu
ViT
LRM
70
86
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
441
21,418
0
25 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
365
2,048
0
09 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
264
432
0
01 Feb 2021
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Sixiao Zheng
Jiachen Lu
Hengshuang Zhao
Xiatian Zhu
Zekun Luo
...
Yanwei Fu
Jianfeng Feng
Tao Xiang
Philip Torr
Li Zhang
ViT
194
2,897
0
31 Dec 2020
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
377
6,762
0
23 Dec 2020
Human Action Recognition from Various Data Modalities: A Review
Zehua Sun
Qiuhong Ke
Hossein Rahmani
Mohammed Bennamoun
Gang Wang
Jun Liu
MU
124
523
0
22 Dec 2020
TDN: Temporal Difference Networks for Efficient Action Recognition
Limin Wang
Zhan Tong
Bin Ji
Gangshan Wu
78
397
0
18 Dec 2020
GTA: Global Temporal Attention for Video Action Understanding
Bo He
Xitong Yang
Zuxuan Wu
Hao Chen
Ser-Nam Lim
Abhinav Shrivastava
ViT
60
27
0
15 Dec 2020
End-to-End Video Instance Segmentation with Transformers
Yuqing Wang
Zhaoliang Xu
Xinlong Wang
Chunhua Shen
Baoshan Cheng
Hao Shen
Huaxia Xia
ViT
72
690
0
30 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
637
41,003
0
22 Oct 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
216
5,073
0
08 Oct 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
531
608
0
21 Jul 2020
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Heeseung Kwon
Manjin Kim
Suha Kwak
Minsu Cho
FAtt
75
128
0
20 Jul 2020
Feature Pyramid Transformer
Dong Zhang
Hanwang Zhang
Jinhui Tang
Meng Wang
Xiansheng Hua
Qianru Sun
ViT
53
254
0
18 Jul 2020
Dynamic Sampling Networks for Efficient Action Recognition in Videos
Yin-Dong Zheng
Zhaoyang Liu
Tong Lu
Limin Wang
46
77
0
28 Jun 2020
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
210
1,702
0
08 Jun 2020
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
385
13,035
0
26 May 2020
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
128
1,019
0
09 Apr 2020
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan-Ran Li
Bin Ji
Xintian Shi
Jianguo Zhang
Bin Kang
Limin Wang
ViT
84
447
0
03 Apr 2020
Learning with Differentiable Perturbed Optimizers
Quentin Berthet
Mathieu Blondel
O. Teboul
Marco Cuturi
Jean-Philippe Vert
Francis R. Bach
61
109
0
20 Feb 2020
Reformer: The Efficient Transformer
Nikita Kitaev
Lukasz Kaiser
Anselm Levskaya
VLM
186
2,313
0
13 Jan 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
493
42,407
0
03 Dec 2019
LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition
Zuxuan Wu
Caiming Xiong
Yu-Gang Jiang
L. Davis
75
109
0
03 Dec 2019
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
Quanfu Fan
Chun-Fu Chen
Hilde Kuehne
Marco Pistoia
David D. Cox
74
126
0
02 Dec 2019
TEINet: Towards an Efficient Architecture for Video Recognition
Zhaoyang Liu
Donghao Luo
Yabiao Wang
Limin Wang
Ying Tai
Chengjie Wang
Jilin Li
Feiyue Huang
Tong Lu
ViT
77
240
0
21 Nov 2019
STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang
Mengmeng Wang
Weihao Gan
Wei Wu
Junjie Yan
79
382
0
07 Aug 2019
Central Similarity Quantization for Efficient Image and Video Retrieval
Li-xin Yuan
Tao Wang
Xiaopeng Zhang
Francis E. H. Tay
Zequn Jie
Wei Liu
Jiashi Feng
73
283
0
01 Aug 2019
Video Modeling with Correlation Networks
Heng Wang
Du Tran
Lorenzo Torresani
Matt Feiszli
59
129
0
07 Jun 2019
Differentiable Ranks and Sorting using Optimal Transport
Marco Cuturi
O. Teboul
Jean-Philippe Vert
OT
81
158
0
28 May 2019
SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition
Bruno Korbar
Du Tran
Lorenzo Torresani
66
224
0
08 Apr 2019
Video Classification with Channel-Separated Convolutional Networks
Du Tran
Heng Wang
Lorenzo Torresani
Matt Feiszli
3DV
64
587
0
04 Apr 2019
1
2
Next