Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.15691
Cited By
ViViT: A Video Vision Transformer
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViViT: A Video Vision Transformer"
50 / 127 papers shown
Title
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Wenhan Luo
Baotian Hu
Min Zhang
OffRL
LRM
49
0
0
25 May 2025
Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Ping Li
Jianan Ni
Bo Pang
AAML
141
0
0
23 May 2025
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
Kutay Ertürk
Furkan Altınışık
İrem Sarıaltın
Ömer Nezih Gerek
SLR
126
0
0
11 May 2025
Federated Learning with LoRA Optimized DeiT and Multiscale Patch Embedding for Secure Eye Disease Recognition
Md. Naimur Asif Borno
Md Sakib Hossain Shovon
MD Hanif Sikder
Iffat Firozy Rimi
Tahani Jaser Alahmadi
Mohammad Ali Moni
MedIm
35
0
0
11 May 2025
Let Humanoids Hike! Integrative Skill Development on Complex Trails
Kwan-Yee Lin
Stella X.Yu
65
0
0
09 May 2025
Active Perception for Tactile Sensing: A Task-Agnostic Attention-Based Approach
Tim Schneider
Cristiana de Farias
Roberto Calandra
Lawrence Yunliang Chen
Jan Peters
357
0
0
09 May 2025
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
Hafez Ghaemi
Eilif Muller
Shahab Bakhtiari
97
0
0
06 May 2025
Token Coordinated Prompt Attention is Needed for Visual Prompting
Zichen Liu
Xu Zou
Gang Hua
Jiahuan Zhou
130
0
0
05 May 2025
Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches
Guodong Shen
Yuqi Ouyang
Junru Lu
Yixuan Yang
Victor Sanchez
143
1
0
20 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
66
0
0
03 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Kai Zhang
MGen
VGen
153
1
0
01 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
144
0
0
30 Mar 2025
Action tube generation by person query matching for spatio-temporal action detection
Kazuki Omi
Jion Oshima
Toru Tamaki
105
0
0
17 Mar 2025
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
Tianqing Zhang
Kairong Yu
Xian Zhong
Hongwei Wang
Qi Xu
Qiang Zhang
105
1
0
04 Mar 2025
Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking
Shaheer Mohamed
Tharindu Fernando
Sridha Sridharan
Peyman Moghadam
Clinton Fookes
ViT
117
1
0
26 Feb 2025
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers
Yingyu Liang
Zhizhou Sha
Zhenmei Shi
Zhao Song
Yufa Zhou
114
18
0
21 Feb 2025
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
Yen-Siang Wu
Chi-Pin Huang
Fu-En Yang
Yu-Jie Wang
DiffM
VGen
87
1
0
18 Feb 2025
Conformal Predictions for Human Action Recognition with Vision-Language Models
Bary Tim
Fuchs Clément
Macq Benoît
VLM
101
0
0
10 Feb 2025
SecPE: Secure Prompt Ensembling for Private and Robust Large Language Models
Jiawen Zhang
Kejia Chen
Zunlei Feng
Jian Lou
Mingli Song
Qingbin Liu
Xiaoyu Yang
AAML
SILM
FedML
84
1
0
02 Feb 2025
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. P. Muñoz
Jinjie Yuan
Nilesh Jain
Mamba
99
1
0
28 Jan 2025
Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa
Kensho Hara
Hirokatsu Kataoka
Toru Tamaki
87
1
0
22 Jan 2025
Slot-BERT: Self-supervised Object Discovery in Surgical Video
Guiqiu Liao
M. Jogan
Marcel Hussing
Kenta Nakahashi
Kazuhiro Yasufuku
Amin Madani
Eric Eaton
Daniel A. Hashimoto
364
0
0
21 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
192
26
0
17 Jan 2025
DynST: Dynamic Sparse Training for Resource-Constrained Spatio-Temporal Forecasting
Hao Wu
Haomin Wen
Guibin Zhang
Yutong Xia
Kai Wang
Yuxuan Liang
Yu Zheng
Kun Wang
116
2
0
17 Jan 2025
Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches
Gabriel Bianchin de Oliveira
Hélio Pedrini
Z. Dias
MQ
47
0
0
13 Jan 2025
Measuring Error Alignment for Decision-Making Systems
Binxia Xu
Antonis Bikakis
Daniel Onah
A. Vlachidis
Luke Dickens
67
0
0
03 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
91
26
0
31 Dec 2024
Predicting Chess Puzzle Difficulty with Transformers
Szymon Miłosz
Paweł Kapusta
55
0
0
31 Dec 2024
Semantics Prompting Data-Free Quantization for Low-Bit Vision Transformers
Mingliang Xu
Yuyao Zhou
Yuxin Zhang
Shen Li
Yong Li
Chia-Wen Lin
Zhanpeng Zeng
Rongrong Ji
MQ
204
0
0
31 Dec 2024
DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images
Enbo Huang
Yuan Zhang
Faliang Huang
Guangyu Zhang
Yang Liu
DiffM
55
0
0
25 Dec 2024
Hierarchical Vector Quantization for Unsupervised Action Segmentation
Federico Spurio
Emad Bahrami
Gianpiero Francesca
Juergen Gall
72
0
0
23 Dec 2024
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
Yilong Wang
Zilin Gao
Qilong Wang
Zhaofeng Chen
P. Li
Q. Hu
127
1
0
28 Nov 2024
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
123
0
0
20 Nov 2024
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Xiufeng Song
Xiao Guo
Junxuan Zhang
Qirui Li
Lei Bai
Xiaoming Liu
Guangtao Zhai
Xiaohong Liu
DiffM
VGen
103
9
0
31 Oct 2024
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion
Onkar Susladkar
Jishu Sen Gupta
Chirag Sehgal
Sparsh Mittal
Rekha Singhal
DiffM
VGen
54
0
0
10 Oct 2024
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu
Sangkyung Kwak
Huiwon Jang
Jongheon Jeong
Jonathan Huang
Jinwoo Shin
Saining Xie
OCL
114
77
0
09 Oct 2024
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
Lukas Heine
Fabian Horst
Jana Fragemann
Gijs Luijten
M. Balzer
Jan Egger
F. Bahnsen
M. Sarfraz
Jens Kleesiek
56
0
0
25 Sep 2024
A Comprehensive Review of Few-shot Action Recognition
Yuyang Wanyan
Xiaoshan Yang
Weiming Dong
Changsheng Xu
VLM
118
3
0
20 Jul 2024
Video-to-Audio Generation with Hidden Alignment
Manjie Xu
Chenxing Li
Yong Ren
Rilin Chen
Yu Gu
Yu Gu
Dong Yu
Dong Yu
DiffM
VGen
62
12
0
10 Jul 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
167
16
0
06 Jun 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
66
24
0
09 Apr 2024
Edit3K: Universal Representation Learning for Video Editing Components
Xin Gu
Libo Zhang
Fan Chen
Longyin Wen
Yufei Wang
Tiejian Luo
Sijie Zhu
78
4
0
24 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
75
6
0
21 Mar 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
89
1
0
15 Jan 2024
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma
Yaohui Wang
Gengyun Jia
Xinyuan Chen
Ziqiang Liu
Yuan-Fang Li
Cunjian Chen
Yu Qiao
DiffM
VGen
163
252
0
05 Jan 2024
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Arun V. Reddy
William Paul
Corban Rivera
Ketul Shah
Celso M. de Melo
Rama Chellappa
70
4
0
05 Dec 2023
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
André O. Françani
Marcos R. O. A. Máximo
52
8
0
10 May 2023
A vector quantized masked autoencoder for audiovisual speech emotion recognition
Samir Sadok
Simon Leglaive
Renaud Séguier
SSL
96
6
0
05 May 2023
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
110
0
0
18 Feb 2023
Transfer-learning for video classification: Video Swin Transformer on multiple domains
Daniel de Oliveira
David Martins de Matos
ViT
47
0
0
18 Oct 2022
1
2
3
Next