Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.10211
Cited By
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
18 March 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning"
35 / 35 papers shown
Title
Self-Supervised Audio-Visual Soundscape Stylization
Tingle Li
Renhao Wang
Po-Yao Huang
Andrew Owens
Gopala Anumanchipalli
DiffM
SSL
38
4
0
22 Sep 2024
Human-AI Collaborative Multi-modal Multi-rater Learning for Endometriosis Diagnosis
Hu Wang
David Butler
Yuan Zhang
Jodie C Avery
Steven Knox
Congbo Ma
Louise Hull
Gustavo Carneiro
20
2
0
03 Sep 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
35
9
0
22 May 2024
VideoSAGE: Video Summarization with Graph Representation Learning
Jose M. Rojas Chaves
Subarna Tripathi
26
3
0
14 Apr 2024
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Zhengcong Fei
Mingyuan Fan
Junshi Huang
25
17
0
27 Nov 2023
FLAP: Fast Language-Audio Pre-training
Ching-Feng Yeh
Po-Yao Huang
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
CLIP
VLM
36
8
0
02 Nov 2023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng
Layne Berry
Yi-Ting Chen
I-Hsiang Chiu
Hsuan-Hao Lin
...
Yu Tsao
Shinji Watanabe
Abdel-rahman Mohamed
Chi-Luen Feng
Hung-yi Lee
VLM
SSL
50
14
0
19 Sep 2023
Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling
Hu Wang
Yuanhong Chen
Congbo Ma
Jodie Avery
Louise Hull
G. Carneiro
18
79
0
26 Jul 2023
Self-Supervised Video Representation Learning via Latent Time Navigation
Di Yang
Yaohui Wang
Quan Kong
A. Dantcheva
Lorenzo Garattoni
Gianpiero Francesca
F. Brémond
SSL
AI4TS
46
10
0
10 May 2023
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
19
51
0
15 Dec 2022
Spatio-Temporal Crop Aggregation for Video Representation Learning
Sepehr Sameni
Simon Jenni
Paolo Favaro
18
3
0
30 Nov 2022
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar
Ali Etemad
19
21
0
25 Nov 2022
Compressed Vision for Efficient Video Understanding
Olivia Wiles
João Carreira
Iain Barr
Andrew Zisserman
Mateusz Malinowski
14
7
0
06 Oct 2022
Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Jingcheng Ni
Nana Zhou
Jie Qin
Qianrun Wu
Junqi Liu
Boxun Li
Di Huang
SSL
34
16
0
12 Aug 2022
Uncertainty-aware Multi-modal Learning via Cross-modal Random Network Prediction
Hu Wang
Jianpeng Zhang
Yuanhong Chen
Congbo Ma
Jodie Avery
Louise Hull
G. Carneiro
UQCV
14
18
0
22 Jul 2022
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
Sumanth Gurram
An Fang
David M. Chan
John F. Canny
VLM
AI4TS
33
1
0
16 Jul 2022
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
Kyle Min
Sourya Roy
Subarna Tripathi
T. Guha
Somdeb Majumdar
19
36
0
15 Jul 2022
Semi-Supervised Temporal Action Detection with Proposal-Free Masking
Sauradip Nag
Xiatian Zhu
Yi-Zhe Song
Tao Xiang
19
17
0
14 Jul 2022
Masked Autoencoders that Listen
Po-Yao (Bernie) Huang
Hu Xu
Juncheng Billy Li
Alexei Baevski
Michael Auli
Wojciech Galuba
Florian Metze
Christoph Feichtenhofer
15
268
0
13 Jul 2022
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
F. Saleh
Fuwen Tan
Adrian Bulat
Georgios Tzimiropoulos
Brais Martínez
SSL
34
1
0
16 Jun 2022
On Negative Sampling for Audio-Visual Contrastive Learning from Movies
Mahdi M. Kalayeh
Shervin Ardeshir
Lingyi Liu
Nagendra Kamath
Ashok Chandrashekar
SSL
22
3
0
29 Apr 2022
Less than Few: Self-Shot Video Instance Segmentation
Pengwan Yang
Yuki M. Asano
Pascal Mettes
Cees G. M. Snoek
SSL
21
2
0
19 Apr 2022
Probabilistic Representations for Video Contrastive Learning
Jungin Park
Jiyoung Lee
Ig-Jae Kim
K. Sohn
SSL
26
43
0
08 Apr 2022
Visual Acoustic Matching
Changan Chen
Ruohan Gao
P. Calamia
Kristen Grauman
16
55
0
14 Feb 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
22
103
0
16 Jan 2022
Learning Spatial-Temporal Graphs for Active Speaker Detection
Sourya Roy
Kyle Min
Subarna Tripathi
T. Guha
Somdeb Majumdar
27
3
0
02 Dec 2021
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Pritam Sarkar
Ali Etemad
SSL
23
11
0
09 Nov 2021
Self-supervised Learning for Semi-supervised Temporal Language Grounding
Fan Luo
Shaoxiang Chen
Jingjing Chen
Zuxuan Wu
Yu-Gang Jiang
VLM
51
11
0
23 Sep 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
8
274
0
09 Jun 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
280
1,981
0
09 Feb 2021
Self-supervised Co-training for Video Representation Learning
Tengda Han
Weidi Xie
Andrew Zisserman
SSL
215
309
0
19 Oct 2020
CrossTransformers: spatially-aware few-shot transfer
Carl Doersch
Ankush Gupta
Andrew Zisserman
ViT
206
330
0
22 Jul 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
415
596
0
21 Jul 2020
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
194
205
0
23 Jan 2020
Lip Reading Sentences in the Wild
Joon Son Chung
A. Senior
Oriol Vinyals
Andrew Zisserman
162
784
0
16 Nov 2016
1