What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
Untrimmed Multi-Action Videos from Narrated Instructions

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

29 March 2023

Andrew Rouditchenko

Papers citing "What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions"

14 / 14 papers shown

Title
VideoGEM: Training-free Action Grounding in Videos Felix Vogel Walid Bousselham Anna Kukleva Nina Shvetsova Hilde Kuehne LM&Ro VLM 120 0 0 26 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation Evangelos Kazakos Cordelia Schmid Josef Sivic 59 0 0 13 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living Ramanathan Rajendiran Debaditya Roy Basura Fernando VGen 41 0 0 03 Mar 2025
Grounded Video Caption Generation Evangelos Kazakos Cordelia Schmid Josef Sivic 30 0 0 12 Nov 2024
Described Spatial-Temporal Video Detection Wei Ji Xiangyan Liu Yingfei Sun Jiajun Deng You Qin Ammar Nuwanna Mengyao Qiu Lina Wei Roger Zimmermann 32 2 0 08 Jul 2024
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos Kumaranage Ravindu Yasas Nagasinghe Honglu Zhou Malitha Gunawardhana Martin Renqiang Min Daniel Harari Muhammad Haris Khan 32 7 0 05 Mar 2024
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding Syed Talal Wasim Muzammal Naseer Salman Khan Ming-Hsuan Yang Fahad Shahbaz Khan 18 12 0 31 Dec 2023
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding Yang Jin Yongzhi Li Zehuan Yuan Yadong Mu 29 32 0 27 Sep 2022
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection Lewei Yao Jianhua Han Youpeng Wen Xiaodan Liang Dan Xu Wei Zhang Zhenguo Li Chunjing Xu Hang Xu CLIP VLM 115 152 0 20 Sep 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video Kristen Grauman Andrew Westbury Eugene Byrne Zachary Chavis Antonino Furnari ... Mike Zheng Shou Antonio Torralba Lorenzo Torresani Mingfei Yan Jitendra Malik EgoV 224 1,018 0 13 Oct 2021
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions Shuang Li Yilun Du Antonio Torralba Josef Sivic Bryan C. Russell 54 15 0 07 Oct 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Yin Cui Boqing Gong ViT 248 577 0 22 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Huaishao Luo Lei Ji Ming Zhong Yang Chen Wen Lei Nan Duan Tianrui Li CLIP VLM 314 780 0 18 Apr 2021
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 233 31,253 0 16 Jan 2013