ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.00527
  4. Cited By
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

1 April 2025
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Bernard Ghanem
ArXiv (abs)PDFHTML

Papers citing "SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning"

50 / 53 papers shown
Title
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
Shuming Liu
Chen Zhao
Fatimah Zohra
Mattia Soldan
Alejandro Pardo
...
Juan Carlos León Alcázar
A. Cioppa
Silvio Giancola
Carlos Hinojosa
Bernard Ghanem
86
3
0
27 Feb 2025
SIGMA:Sinkhorn-Guided Masked Video Modeling
SIGMA:Sinkhorn-Guided Masked Video Modeling
Mohammadreza Salehi
Michael Dorkenwald
Fida Mohammad Thoker
E. Gavves
Cees G. M. Snoek
Yuki M. Asano
78
4
0
22 Jul 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
87
34
0
20 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLMMLLM
48
12
0
13 Feb 2024
Motion-Guided Masking for Spatiotemporal Representation Learning
Motion-Guided Masking for Spatiotemporal Representation Learning
D. Fan
Jue Wang
Shuai Liao
Yi Zhu
Vimal Bhat
H. Santos-Villalobos
M. Rohith
Xinyu Li
VGen
74
21
0
24 Aug 2023
Implicit Temporal Modeling with Learnable Alignment for Video
  Recognition
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
S. Tu
Qi Dai
Zuxuan Wu
Zhi-Qi Cheng
Hang-Rui Hu
Yu-Gang Jiang
75
37
0
20 Apr 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
100
166
0
28 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Dual-path Adaptation from Image to Video Transformers
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
ViT
70
38
0
17 Mar 2023
Masked Video Distillation: Rethinking Masked Feature Modeling for
  Self-supervised Video Representation Learning
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Lu Yuan
Yu-Gang Jiang
VGen
87
92
0
08 Dec 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Expanding Language-Image Pretrained Models for General Video Recognition
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLMCLIPViT
106
326
0
04 Aug 2022
ActionFormer: Localizing Moments of Actions with Transformers
ActionFormer: Localizing Moments of Actions with Transformers
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
70
342
0
16 Feb 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal
  Representation Learning
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li
Yali Wang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
123
249
0
12 Jan 2022
High-Resolution Image Synthesis with Latent Diffusion Models
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
458
15,665
0
20 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
148
690
0
02 Dec 2021
Self-supervised Video Transformer
Self-supervised Video Transformer
Kanchana Ranasinghe
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
Michael S. Ryoo
ViT
113
88
0
02 Dec 2021
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViTTPM
467
7,757
0
11 Nov 2021
Motion-aware Contrastive Video Representation Learning via
  Foreground-background Merging
Motion-aware Contrastive Video Representation Learning via Foreground-background Merging
Shuangrui Ding
Maomao Li
Tianyu Yang
Rui Qian
Haohang Xu
Qingyi Chen
Jue Wang
Hongkai Xiong
SSL
88
51
0
30 Sep 2021
Skeleton-Contrastive 3D Action Representation Learning
Skeleton-Contrastive 3D Action Representation Learning
Fida Mohammad Thoker
Hazel Doughty
Cees G. M. Snoek
SSL
68
133
0
08 Aug 2021
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
106
1,482
0
24 Jun 2021
MaCLR: Motion-aware Contrastive Learning of Representations for Videos
MaCLR: Motion-aware Contrastive Learning of Representations for Videos
Fanyi Xiao
Joseph Tighe
Davide Modolo
SSL
59
14
0
17 Jun 2021
BEiT: BERT Pre-Training of Image Transformers
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao
Li Dong
Songhao Piao
Furu Wei
ViT
274
2,826
0
15 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
86
279
0
09 Jun 2021
A Large-Scale Study on Unsupervised Spatiotemporal Representation
  Learning
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
Christoph Feichtenhofer
Haoqi Fan
Bo Xiong
Ross B. Girshick
Kaiming He
SSLAI4TS
99
262
0
29 Apr 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
694
6,079
0
29 Apr 2021
Multiscale Vision Transformers
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
132
1,259
0
22 Apr 2021
Self-supervised Motion Learning from Static Images
Self-supervised Motion Learning from Static Images
Ziyuan Huang
Shiwei Zhang
Jianwen Jiang
Mingqian Tang
Rong Jin
M. Ang
SSL
46
29
0
01 Apr 2021
VideoMoCo: Contrastive Video Representation Learning with Temporally
  Adversarial Examples
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples
Tian Pan
Yibing Song
Tianyu Yang
Wenhao Jiang
Wei Liu
81
225
0
10 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
931
29,436
0
26 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
387
2,053
0
09 Feb 2021
Self-supervised Co-training for Video Representation Learning
Self-supervised Co-training for Video Representation Learning
Tengda Han
Weidi Xie
Andrew Zisserman
SSL
242
319
0
19 Oct 2020
Removing the Background by Adding the Background: Towards Background
  Robust Self-supervised Video Representation Learning
Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning
Jinpeng Wang
Yuting Gao
Ke Li
Yiqi Lin
A. J. Ma
Hao Cheng
Pai Peng
Feiyue Huang
Rongrong Ji
Xing Sun
SSL
95
97
0
12 Sep 2020
Rescaling Egocentric Vision
Rescaling Egocentric Vision
Dima Damen
Hazel Doughty
G. Farinella
Antonino Furnari
Evangelos Kazakos
...
Davide Moltisanti
Jonathan Munro
Toby Perrett
Will Price
Michael Wray
EgoV
78
460
0
23 Jun 2020
Context-aware and Scale-insensitive Temporal Repetition Counting
Context-aware and Scale-insensitive Temporal Repetition Counting
Huaidong Zhang
Xuemiao Xu
Guoqiang Han
Shengfeng He
41
49
0
18 May 2020
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Pedro Morgado
Nuno Vasconcelos
Ishan Misra
SSL
80
273
0
27 Apr 2020
FineGym: A Hierarchical Video Dataset for Fine-grained Action
  Understanding
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
Dian Shao
Yue Zhao
Bo Dai
Dahua Lin
65
329
0
14 Apr 2020
SpeedNet: Learning the Speediness in Videos
SpeedNet: Learning the Speediness in Videos
Sagie Benaim
Ariel Ephrat
Oran Lang
Inbar Mosseri
William T. Freeman
Michael Rubinstein
Michal Irani
Tali Dekel
69
260
0
13 Apr 2020
Evolving Losses for Unsupervised Video Representation Learning
Evolving Losses for Unsupervised Video Representation Learning
A. Piergiovanni
A. Angelova
Michael S. Ryoo
SSL
67
140
0
26 Feb 2020
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
Dezhao Luo
Chang-rui Liu
Yu Zhou
Dongbao Yang
Can Ma
QiXiang Ye
Weiping Wang
SSL
61
161
0
02 Jan 2020
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Humam Alwassel
D. Mahajan
Bruno Korbar
Lorenzo Torresani
Guohao Li
Du Tran
SSL
95
431
0
28 Nov 2019
You Only Watch Once: A Unified CNN Architecture for Real-Time
  Spatiotemporal Action Localization
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Okan Kopuklu
Xiangyu Wei
Gerhard Rigoll
79
144
0
15 Nov 2019
RandAugment: Practical automated data augmentation with a reduced search
  space
RandAugment: Practical automated data augmentation with a reduced search space
E. D. Cubuk
Barret Zoph
Jonathon Shlens
Quoc V. Le
MQ
234
3,490
0
30 Sep 2019
LVIS: A Dataset for Large Vocabulary Instance Segmentation
LVIS: A Dataset for Large Vocabulary Instance Segmentation
Agrim Gupta
Piotr Dollár
Ross B. Girshick
ISegVLM
103
1,371
0
08 Aug 2019
A Short Note on the Kinetics-700 Human Action Dataset
A Short Note on the Kinetics-700 Human Action Dataset
João Carreira
Eric Noland
Chloe Hillier
Andrew Zisserman
76
453
0
15 Jul 2019
CutMix: Regularization Strategy to Train Strong Classifiers with
  Localizable Features
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Sangdoo Yun
Dongyoon Han
Seong Joon Oh
Sanghyuk Chun
Junsuk Choe
Y. Yoo
OOD
619
4,780
0
13 May 2019
Self-Supervised Spatiotemporal Feature Learning via Video Rotation
  Prediction
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction
Longlong Jing
Xiaodong Yang
Jingen Liu
Yingli Tian
68
156
0
28 Nov 2018
Self-Supervised Video Representation Learning with Space-Time Cubic
  Puzzles
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
Dahun Kim
Donghyeon Cho
In So Kweon
SSL
77
348
0
24 Nov 2018
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
N. Xu
L. Yang
Yuchen Fan
Dingcheng Yue
Yuchen Liang
Jianchao Yang
Thomas Huang
VOS
147
530
0
06 Sep 2018
Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video
  Action Recognition
Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition
Unaiza Ahsan
Rishi Madhok
Irfan Essa
SSL
56
109
0
22 Aug 2018
mixup: Beyond Empirical Risk Minimization
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang
Moustapha Cissé
Yann N. Dauphin
David Lopez-Paz
NoLa
280
9,764
0
25 Oct 2017
The 2017 DAVIS Challenge on Video Object Segmentation
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset
Federico Perazzi
Sergi Caelles
Pablo Arbeláez
A. Sorkine-Hornung
Luc Van Gool
VGenVOS
84
1,208
0
03 Apr 2017
12
Next