LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

21 October 2022

Dongsheng Chen

Chaofan Tao

Lu Hou

Lifeng Shang

Xin Jiang

Qun Liu

VLM

ArXiv PDF HTML

Papers citing "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling"

36 / 36 papers shown

Title
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Dahun Kim A. Piergiovanni Ganesh Mallya A. Angelova CoGe 86 0 0 04 Apr 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames Pinelopi Papalampidi Skanda Koppula Shreya Pathak Justin T Chiu Joseph Heyward Viorica Patraucean Jiajun Shen Antoine Miech Andrew Zisserman Aida Nematzdeh VLM 91 26 0 31 Dec 2024
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval S. Gorti Noël Vouitsis Junwei Ma Keyvan Golestan M. Volkovs Animesh Garg Guangwei Yu 67 156 0 28 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong Guosheng Lin MLLM BDL VLM CLIP 446 4,283 0 28 Jan 2022
LaMDA: Language Models for Dialog Applications R. Thoppilan Daniel De Freitas Jamie Hall Noam M. Shazeer Apoorv Kulshreshtha ... Blaise Aguera-Arcas Claire Cui M. Croak Ed H. Chi Quoc Le ALM 98 1,577 0 20 Jan 2022
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dongxu Li Junnan Li Hongdong Li Juan Carlos Niebles Guosheng Lin 69 192 0 17 Dec 2021
Ethical and social risks of harm from Language Models Laura Weidinger John F. J. Mellor Maribeth Rauh Conor Griffin J. Uesato ... Lisa Anne Hendricks William S. Isaac Sean Legassick G. Irving Iason Gabriel PILM 64 1,009 0 08 Dec 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Hu Xu Gargi Ghosh Po-Yao (Bernie) Huang Dmytro Okhonko Armen Aghajanyan Florian Metze Luke Zettlemoyer Florian Metze Luke Zettlemoyer Christoph Feichtenhofer CLIP VLM 293 567 0 28 Sep 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Junnan Li Ramprasaath R. Selvaraju Akhilesh Deepak Gotmare Shafiq Joty Caiming Xiong Guosheng Lin FaML 157 1,915 0 16 Jul 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Huaishao Luo Lei Ji Ming Zhong Yang Chen Wen Lei Nan Duan Tianrui Li CLIP VLM 361 796 0 18 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Max Bain Arsha Nagrani Gül Varol Andrew Zisserman VGen 117 1,154 0 01 Apr 2021
ViViT: A Video Vision Transformer Anurag Arnab Mostafa Dehghani G. Heigold Chen Sun Mario Lucic Cordelia Schmid ViT 137 2,119 0 29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin B. Guo ViT 324 21,175 0 25 Mar 2021
Transformer in Transformer Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang ViT 362 1,544 0 27 Feb 2021
Learning Transferable Visual Models From Natural Language Supervision Alec Radford Jong Wook Kim Chris Hallacy Aditya A. Ramesh Gabriel Goh ... Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever CLIP VLM 681 28,659 0 26 Feb 2021
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions Wenhai Wang Enze Xie Xiang Li Deng-Ping Fan Kaitao Song Ding Liang Tong Lu Ping Luo Ling Shao ViT 450 3,678 0 24 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Jie Lei Linjie Li Luowei Zhou Zhe Gan Tamara L. Berg Joey Tianyi Zhou Jingjing Liu CLIP 96 651 0 11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius Heng Wang Lorenzo Torresani ViT 327 2,016 0 09 Feb 2021
Training data-efficient image transformers & distillation through attention Hugo Touvron Matthieu Cord Matthijs Douze Francisco Massa Alexandre Sablayrolles Hervé Jégou ViT 300 6,657 0 23 Dec 2020
Look Before you Speak: Visually Contextualized Utterances Paul Hongsuck Seo Arsha Nagrani Cordelia Schmid 36 67 0 10 Dec 2020
ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Yi Yang ViT 101 419 0 14 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 397 40,217 0 22 Oct 2020
Multi-modal Transformer for Video Retrieval Valentin Gabeur Chen Sun Alahari Karteek Cordelia Schmid ViT 504 602 0 21 Jul 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning Elad Amrani Rami Ben-Ari Daniel Rotman A. Bronstein 50 122 0 06 Mar 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Huaishao Luo Lei Ji Botian Shi Haoyang Huang Nan Duan Tianrui Li Jason Li Xilin Chen Ming Zhou VLM 78 442 0 15 Feb 2020
RandAugment: Practical automated data augmentation with a reduced search space E. D. Cubuk Barret Zoph Jonathon Shlens Quoc V. Le MQ 188 3,458 0 30 Sep 2019
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Yang Liu Samuel Albanie Arsha Nagrani Andrew Zisserman 61 387 0 31 Jul 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic VGen 91 1,192 0 07 Jun 2019
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering Chenyou Fan Xiaofan Zhang Shu Zhang Wensheng Wang Chi Zhang Heng-Chiao Huang 35 276 0 08 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning Chen Sun Austin Myers Carl Vondrick Kevin Patrick Murphy Cordelia Schmid VLM SSL 44 1,238 0 03 Apr 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 961 93,936 0 11 Oct 2018
Localizing Moments in Video with Natural Language Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell Bryan C. Russell 91 940 0 04 Aug 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 453 129,831 0 12 Jun 2017
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R. Selvaraju Michael Cogswell Abhishek Das Ramakrishna Vedantam Devi Parikh Dhruv Batra FAtt 216 19,796 0 07 Oct 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata ... Yannis Kalantidis Li Li David A. Shamma Michael S. Bernstein Fei-Fei Li 170 5,706 0 23 Feb 2016
Microsoft COCO: Common Objects in Context Nayeon Lee Michael Maire Serge J. Belongie Lubomir Bourdev Ross B. Girshick James Hays Pietro Perona Deva Ramanan C. L. Zitnick Piotr Dollár ObjD 257 43,290 0 01 May 2014