Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.19941
Cited By
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark
29 November 2024
Joseph Heyward
João Carreira
Dima Damen
Andrew Zisserman
Viorica Patraucean
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark"
16 / 16 papers shown
Title
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
111
26
0
31 Dec 2024
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Aitor Ormazabal
Che Zheng
Cyprien de Masson dÁutume
Dani Yogatama
Deyu Fu
...
Yazheng Yang
Yi Tay
Yuqi Wang
Zhongkai Zhu
Zhihui Xie
LRM
VLM
ReLM
78
51
0
18 Apr 2024
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
Liting Lin
Heng Fan
Zhipeng Zhang
Yaowei Wang
Yong-mei Xu
Haibin Ling
127
32
0
08 Mar 2024
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
109
2
0
30 Nov 2023
Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video
Shashanka Venkataramanan
Mamshad Nayeem Rizve
João Carreira
Yuki M. Asano
Yannis Avrithis
SSL
60
20
0
12 Oct 2023
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Viorica Puatruaucean
Lucas Smaira
Ankush Gupta
Adrià Recasens Continente
L. Markeeva
...
Y. Aytar
Simon Osindero
Dima Damen
Andrew Zisserman
João Carreira
VLM
173
179
0
23 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
130
140
0
11 May 2023
TAP-Vid: A Benchmark for Tracking Any Point in a Video
Carl Doersch
Ankush Gupta
L. Markeeva
Adrià Recasens
Lucas Smaira
Y. Aytar
João Carreira
Andrew Zisserman
Yezhou Yang
83
164
0
07 Nov 2022
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
78
128
0
02 Oct 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
418
3,602
0
29 Apr 2022
ActionFormer: Localizing Moments of Actions with Transformers
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
70
342
0
16 Feb 2022
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
182
889
0
26 Apr 2021
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks
Humam Alwassel
Silvio Giancola
Guohao Li
72
124
0
23 Nov 2020
HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking
Jonathon Luiten
Aljosa Osep
Patrick Dendorfer
Philip Torr
Andreas Geiger
Laura Leal-Taixe
Bastian Leibe
VOT
82
918
0
16 Sep 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
139
375
0
29 Jun 2020
Rescaling Egocentric Vision
Dima Damen
Hazel Doughty
G. Farinella
Antonino Furnari
Evangelos Kazakos
...
Davide Moltisanti
Jonathan Munro
Toby Perrett
Will Price
Michael Wray
EgoV
91
464
0
23 Jun 2020
1