Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.16990
Cited By
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
29 March 2023
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
Re-assign community
ArXiv
PDF
HTML
Papers citing
"What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions"
14 / 14 papers shown
Title
VideoGEM: Training-free Action Grounding in Videos
Felix Vogel
Walid Bousselham
Anna Kukleva
Nina Shvetsova
Hilde Kuehne
LM&Ro
VLM
120
0
0
26 Mar 2025
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
59
0
0
13 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
VGen
41
0
0
03 Mar 2025
Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
30
0
0
12 Nov 2024
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
32
2
0
08 Jul 2024
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
Kumaranage Ravindu Yasas Nagasinghe
Honglu Zhou
Malitha Gunawardhana
Martin Renqiang Min
Daniel Harari
Muhammad Haris Khan
32
7
0
05 Mar 2024
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
Syed Talal Wasim
Muzammal Naseer
Salman Khan
Ming-Hsuan Yang
Fahad Shahbaz Khan
18
12
0
31 Dec 2023
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
29
32
0
27 Sep 2022
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Lewei Yao
Jianhua Han
Youpeng Wen
Xiaodan Liang
Dan Xu
Wei Zhang
Zhenguo Li
Chunjing Xu
Hang Xu
CLIP
VLM
115
152
0
20 Sep 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
224
1,018
0
13 Oct 2021
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions
Shuang Li
Yilun Du
Antonio Torralba
Josef Sivic
Bryan C. Russell
54
15
0
07 Oct 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
248
577
0
22 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
314
780
0
18 Apr 2021
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
233
31,253
0
16 Jan 2013
1