Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.04900
Cited By
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
7 October 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
Re-assign community
ArXiv
PDF
HTML
Papers citing
"HowToCaption: Prompting LLMs to Transform Video Annotations at Scale"
13 / 13 papers shown
Title
Large-scale Pre-training for Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
59
0
0
13 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Y. S. Rawat
VLM
131
1
0
11 Mar 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan L. Yuille
Lu Jiang
VGen
58
2
0
10 Jan 2025
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
115
1
0
04 Sep 2024
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya-Qin Zhang
Yanfeng Wang
Weidi Xie
AI4TS
VGen
35
5
0
21 Dec 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
270
4,229
0
30 Jan 2023
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
392
4,125
0
28 Jan 2022
Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan
Tao Xu
Raul Puri
Alec Radford
Jesse Michael Han
...
Tabarak Khan
Toki Sherbakov
Joanne Jang
Peter Welinder
Lilian Weng
SSL
AI4TS
218
421
0
24 Jan 2022
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
317
780
0
18 Apr 2021
A Straightforward Framework For Video Retrieval Using CLIP
Jesús Andrés Portillo-Quintero
J. C. Ortíz-Bayliss
Hugo Terashima-Marín
CLIP
318
117
0
24 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
275
1,082
0
17 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho
Jie Lei
Hao Tan
Mohit Bansal
MLLM
256
525
0
04 Feb 2021
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
418
596
0
21 Jul 2020
1