Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.11297
Cited By
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
21 June 2021
Michael S. Ryoo
A. Piergiovanni
Anurag Arnab
Mostafa Dehghani
A. Angelova
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"
50 / 94 papers shown
Title
OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging
Sifan Song
Siyeop Yoon
Pengfei Jin
Sekeun Kim
Matthew Tivnan
...
Zhiliang Lyu
Dufan Wu
Ning Guo
Xiang Li
Quanzheng Li
OOD
ViT
64
0
0
08 May 2025
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning
Yuanbing Ouyang
Yizhuo Liang
Qingpeng Li
Xinfei Guo
Yiming Luo
Di Wu
Hao Wang
Yushan Pan
ViT
VLM
73
0
0
25 Apr 2025
ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing
Aditi Tiwari
Klara Nahrstedt
41
1
0
17 Mar 2025
Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation
K. A. Kinfu
René Vidal
ViT
26
0
0
28 Feb 2025
MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks
Yifei Liu
Zhihang Zhong
Yifan Zhan
Sheng Xu
Xiao Sun
3DGS
51
3
0
29 Dec 2024
Extending Video Masked Autoencoders to 128 frames
N. B. Gundavarapu
Luke Friedman
Raghav Goyal
Chaitra Hegde
Eirikur Agustsson
...
Mikhail Sirotenko
Ming Yang
Tobias Weyand
Boqing Gong
Leonid Sigal
77
1
0
20 Nov 2024
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
98
0
0
20 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
26
9
0
07 Nov 2024
Situational Scene Graph for Structured Human-centric Situation Understanding
Chinthani Sugandhika
Chen Li
Deepu Rajan
Basura Fernando
146
1
0
30 Oct 2024
Robust Imitation Learning for Mobile Manipulator Focusing on Task-Related Viewpoints and Regions
Yutaro Ishida
Yuki Noguchi
Takayuki Kanai
Kazuhiro Shintani
Hiroshi Bito
24
1
0
02 Oct 2024
Token Turing Machines are Efficient Vision Models
Purvish Jajal
Nick Eliopoulos
Benjamin Shiue-Hal Chou
George K. Thiravathukal
James C. Davis
Yung-Hsiang Lu
90
0
0
11 Sep 2024
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Gagan Jain
Nidhi Hegde
Aditya Kusupati
Arsha Nagrani
Shyamal Buch
Prateek Jain
Anurag Arnab
Sujoy Paul
MoE
35
7
0
29 Jul 2024
ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
Narges Norouzi
Svetlana Orlova
Daan de Geus
Gijs Dubbelman
ViT
FedML
46
3
0
14 Jun 2024
Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification
Weilian Zhou
Sei-ichiro Kamata
Haipeng Wang
Man-Sing Wong
Huiying
H. Hou
Mamba
27
30
0
20 May 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
31
31
0
01 Apr 2024
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
Run Shao
Zhaoyang Zhang
Chao Tao
Yunsheng Zhang
Chengli Peng
Haifeng Li
VLM
35
4
0
27 Mar 2024
Understanding Neural Network Binarization with Forward and Backward Proximal Quantizers
Yiwei Lu
Yaoliang Yu
Xinlin Li
Vahid Partovi Nia
MQ
30
3
0
27 Feb 2024
ResoNet: Robust and Explainable ENSO Forecasts with Hybrid Convolution and Transformer Networks
Pumeng Lyu
Tao Tang
Fenghua Ling
Jing-Jia Luo
Niklas Boers
Wanli Ouyang
Lei Bai
12
5
0
16 Dec 2023
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren
Zeyu Wang
Hongru Zhu
Junfei Xiao
Alan L. Yuille
Cihang Xie
VLM
49
7
0
04 Dec 2023
Merlin:Empowering Multimodal LLMs with Foresight Minds
En Yu
Liang Zhao
Yana Wei
Jinrong Yang
Dongming Wu
...
Haoran Wei
Tiancai Wang
Zheng Ge
Xiangyu Zhang
Wenbing Tao
LRM
13
25
0
30 Nov 2023
AiluRus: A Scalable ViT Framework for Dense Prediction
Jin Li
Yaoming Wang
Xiaopeng Zhang
Bowen Shi
Dongsheng Jiang
Chenglin Li
Wenrui Dai
Hongkai Xiong
Qi Tian
57
5
0
02 Nov 2023
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
Ao Wang
Hui Chen
Zijia Lin
Sicheng Zhao
J. Han
Guiguang Ding
ViT
26
6
0
27 Sep 2023
Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation
Kathia Melbouci
F. Nashashibi
25
0
0
18 Sep 2023
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
Matthew Dutson
Yin Li
M. Gupta
ViT
30
8
0
25 Aug 2023
Patch Is Not All You Need
Chang-bo Li
Jie M. Zhang
Yang Wei
Zhilong Ji
Jinfeng Bai
Shiguang Shan
ViT
46
1
0
21 Aug 2023
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers
Tobias Christian Nauen
Sebastián M. Palacio
Federico Raue
Andreas Dengel
42
3
0
18 Aug 2023
From Sparse to Soft Mixtures of Experts
J. Puigcerver
C. Riquelme
Basil Mustafa
N. Houlsby
MoE
121
114
0
02 Aug 2023
Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition
Cheng Da
P. Wang
Cong Yao
13
8
0
25 Jul 2023
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Wenhao Wu
Yuxin Song
Zhun Sun
Jingdong Wang
Chang Xu
Wanli Ouyang
40
8
0
18 Jul 2023
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
Jakob Drachmann Havtorn
Amelie Royer
Tijmen Blankevoort
B. Bejnordi
25
8
0
05 Jul 2023
Make A Long Image Short: Adaptive Token Length for Vision Transformers
Yuqin Zhu
Yichen Zhu
ViT
59
17
0
05 Jul 2023
How can objects help action recognition?
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
35
14
0
20 Jun 2023
Revisiting Token Pruning for Object Detection and Instance Segmentation
Yifei Liu
Mathias Gehrig
Nico Messikommer
Marco Cannici
Davide Scaramuzza
ViT
VLM
37
24
0
12 Jun 2023
Efficient Vision Transformer for Human Pose Estimation via Patch Selection
K. A. Kinfu
René Vidal
ViT
31
4
0
07 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
16
0
0
01 Jun 2023
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
Mengzhao Chen
Wenqi Shao
Peng Xu
Mingbao Lin
Kaipeng Zhang
Fei Chao
Rongrong Ji
Yu Qiao
Ping Luo
ViT
34
43
0
29 May 2023
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao
Bhargavi Paranjape
Hannaneh Hajishirzi
MLLM
VLM
8
20
0
27 May 2023
FIT: Far-reaching Interleaved Transformers
Ting-Li Chen
Lala Li
21
12
0
22 May 2023
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
Mingliang Zhai
Yulin Li
Xiameng Qin
Chen Yi
Qunyi Xie
Chengquan Zhang
Kun Yao
Yuwei Wu
Yunde Jia
13
8
0
19 May 2023
AutoFocusFormer: Image Segmentation off the Grid
Chen Ziwen
K. Patnaik
Shuangfei Zhai
Alvin Wan
Zhile Ren
A. Schwing
Alex Colburn
Li Fuxin
17
9
0
24 Apr 2023
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Mingyu Ding
Yikang Shen
Lijie Fan
Zhenfang Chen
Z. Chen
Ping Luo
J. Tenenbaum
Chuang Gan
ViT
79
14
0
06 Apr 2023
SVT: Supertoken Video Transformer for Efficient Video Understanding
Chen-Ming Pan
Rui Hou
Hanchao Yu
Qifan Wang
Senem Velipasalar
Madian Khabsa
ViT
21
0
0
01 Apr 2023
DOAD: Decoupled One Stage Action Detection Network
Shuning Chang
Pichao Wang
Fan Wang
Jiashi Feng
Mike Zheng Show
13
4
0
01 Apr 2023
APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding
Hengjia Li
Tu Zheng
Zhihao Chi
Zheng Yang
Wenxiao Wang
Boxi Wu
Binbin Lin
Deng Cai
3DPC
38
1
0
31 Mar 2023
PaLM-E: An Embodied Multimodal Language Model
Danny Driess
F. Xia
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
...
Marc Toussaint
Klaus Greff
Andy Zeng
Igor Mordatch
Peter R. Florence
LM&Ro
22
1,562
0
06 Mar 2023
Open-World Object Manipulation using Pre-trained Vision-Language Models
Austin Stone
Ted Xiao
Yao Lu
K. Gopalakrishnan
Kuang-Huei Lee
...
Sean Kirmani
Brianna Zitkovich
F. Xia
Chelsea Finn
Karol Hausman
LM&Ro
142
144
0
02 Mar 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
40
160
0
01 Feb 2023
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu
Xiaohan Wang
Haipeng Luo
Jingdong Wang
Yi Yang
Wanli Ouyang
98
48
0
31 Dec 2022
Transformers in Action Recognition: A Review on Temporal Modeling
Elham Shabaninia
Hossein Nezamabadi-pour
Fatemeh Shafizadegan
ViT
21
8
0
29 Dec 2022
What Makes for Good Tokenizers in Vision Transformer?
Shengju Qian
Yi Zhu
Wenbo Li
Mu Li
Jiaya Jia
ViT
31
13
0
21 Dec 2022
1
2
Next