ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.13626
  4. Cited By
Principles of Visual Tokens for Efficient Video Understanding

Principles of Visual Tokens for Efficient Video Understanding

20 November 2024
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
ArXivPDFHTML

Papers citing "Principles of Visual Tokens for Efficient Video Understanding"

46 / 46 papers shown
Title
LookupViT: Compressing visual information to a limited number of tokens
LookupViT: Compressing visual information to a limited number of tokens
Rajat Koner
Gagan Jain
Prateek Jain
Volker Tresp
Sujoy Paul
23
11
0
17 Jul 2024
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Joonmyung Choi
Sanghyeok Lee
Jaewon Chu
Minhyuk Choi
Hyunwoo J. Kim
MoMe
ViT
69
12
0
20 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLM
VLM
84
131
0
11 Mar 2024
How does the primate brain combine generative and discriminative
  computations in vision?
How does the primate brain combine generative and discriminative computations in vision?
Benjamin Peters
J. DiCarlo
Todd Gureckis
Ralf Haefner
Leyla Isik
...
Kimberly Stachenfeld
Zenna Tavares
Doris Y. Tsao
Ilker Yildirim
N. Kriegeskorte
44
3
0
11 Jan 2024
Watt For What: Rethinking Deep Learning's Energy-Performance
  Relationship
Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
Shreyank N. Gowda
Xinyue Hao
Gen Li
Laura Sevilla-Lara
Shashank Narayana Gowda
HAI
38
11
0
10 Oct 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
79
51
0
04 Sep 2023
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Shuangrui Ding
Peisen Zhao
Xiaopeng Zhang
Rui Qian
H. Xiong
Qi Tian
ViT
43
16
0
08 Aug 2023
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
  Recognition
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Syed Talal Wasim
Muhammad Uzair Khattak
Muzammal Naseer
Salman Khan
M. Shah
Fahad Shahbaz Khan
ViT
79
19
0
13 Jul 2023
How can objects help action recognition?
How can objects help action recognition?
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
62
14
0
20 Jun 2023
Revisiting Token Pruning for Object Detection and Instance Segmentation
Revisiting Token Pruning for Object Detection and Instance Segmentation
Yifei Liu
Mathias Gehrig
Nico Messikommer
Marco Cannici
Davide Scaramuzza
ViT
VLM
60
27
0
12 Jun 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
102
135
0
11 May 2023
AIM: Adapting Image Models for Efficient Video Action Recognition
AIM: Adapting Image Models for Efficient Video Action Recognition
Taojiannan Yang
Yi Zhu
Yusheng Xie
Aston Zhang
Chong Chen
Mu Li
ViT
89
146
0
06 Feb 2023
Token Merging: Your ViT But Faster
Token Merging: Your ViT But Faster
Daniel Bolya
Cheng-Yang Fu
Xiaoliang Dai
Peizhao Zhang
Christoph Feichtenhofer
Judy Hoffman
MoMe
66
446
0
17 Oct 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Expanding Language-Image Pretrained Models for General Video Recognition
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLM
CLIP
ViT
82
319
0
04 Aug 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Hongsheng Li
47
200
0
27 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language Understanding
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
64
159
0
03 Jun 2022
The Carbon Footprint of Machine Learning Training Will Plateau, Then
  Shrink
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink
David A. Patterson
Joseph E. Gonzalez
Urs Holzle
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
62
239
0
11 Apr 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
195
1,156
0
23 Mar 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
  Long-Term Video Recognition
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
74
199
0
20 Jan 2022
AdaViT: Adaptive Tokens for Efficient Vision Transformer
AdaViT: Adaptive Tokens for Efficient Vision Transformer
Hongxu Yin
Arash Vahdat
J. Álvarez
Arun Mallya
Jan Kautz
Pavlo Molchanov
ViT
65
327
0
14 Dec 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
38
64
0
23 Nov 2021
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
80
1,458
0
24 Jun 2021
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo
A. Piergiovanni
Anurag Arnab
Mostafa Dehghani
A. Angelova
ViT
95
128
0
21 Jun 2021
Space-time Mixing Attention for Video Transformer
Space-time Mixing Attention for Video Transformer
Adrian Bulat
Juan-Manuel Perez-Rua
Swathikiran Sudhakaran
Brais Martínez
Georgios Tzimiropoulos
ViT
71
125
0
10 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
67
278
0
09 Jun 2021
DynamicViT: Efficient Vision Transformers with Dynamic Token
  Sparsification
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Yongming Rao
Wenliang Zhao
Benlin Liu
Jiwen Lu
Jie Zhou
Cho-Jui Hsieh
ViT
65
685
0
03 Jun 2021
Multiscale Vision Transformers
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
123
1,248
0
22 Apr 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
158
2,119
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
351
21,175
0
25 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
332
2,016
0
09 Feb 2021
Video Transformer Network
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
249
430
0
01 Feb 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
317
6,657
0
23 Dec 2020
SMART Frame Selection for Action Recognition
SMART Frame Selection for Action Recognition
Shreyank N. Gowda
Marcus Rohrbach
Laura Sevilla-Lara
53
145
0
19 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
425
40,217
0
22 Oct 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
121
1,013
0
09 Apr 2020
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Laura Sevilla-Lara
Shengxin Cindy Zha
Zhicheng Yan
Vedanuj Goswami
Matt Feiszli
Lorenzo Torresani
67
75
0
19 Jul 2019
Pyramid Feature Attention Network for Saliency detection
Pyramid Feature Attention Network for Saliency detection
Ting Zhao
Xiangqian Wu
58
609
0
01 Mar 2019
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
146
3,244
0
10 Dec 2018
ECO: Efficient Convolutional Network for Online Video Understanding
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
175
498
0
24 Apr 2018
Compressed Video Action Recognition
Compressed Video Action Recognition
Chao-Yuan Wu
Manzil Zaheer
Hexiang Hu
R. Manmatha
Alex Smola
Philipp Krahenbuhl
129
325
0
02 Dec 2017
The "something something" video database for learning and evaluating
  visual common sense
The "something something" video database for learning and evaluating visual common sense
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
76
1,516
0
13 Jun 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
501
129,831
0
12 Jun 2017
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual
  Actions
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu
Chen Sun
David A. Ross
Carl Vondrick
C. Pantofaru
...
G. Toderici
Susanna Ricco
Rahul Sukthankar
Cordelia Schmid
Jitendra Malik
VGen
89
1,021
0
23 May 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
206
7,961
0
22 May 2017
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based
  Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju
Michael Cogswell
Abhishek Das
Ramakrishna Vedantam
Devi Parikh
Dhruv Batra
FAtt
225
19,796
0
07 Oct 2016
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro
Amir Zamir
M. Shah
CLIP
VGen
96
6,100
0
03 Dec 2012
1