MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

11 May 2020

Papers citing "MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning"

50 / 86 papers shown

Title
Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation Lakshita Agarwal Bindu Verma ViT 24 0 0 23 Apr 2025
EgoLife: Towards Egocentric Life Assistant Jingkang Yang Shuai Liu Hongming Guo Yuhao Dong X. Zhang ... Joerg Widmer Francesco Gringoli Lei Yang Bo Li Z. Liu EgoV 51 2 0 05 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding Louis Mahon Mirella Lapata VLM 41 1 0 03 Mar 2025
VideoA11y: Method and Dataset for Accessible Video Description Chaoyu Li Sid Padmanabhuni Maryam Cheema H. Seifi Pooyan Fazli VGen 65 0 0 27 Feb 2025
Natural Language Generation from Visual Sequences: Challenges and Future Directions Aditya K Surikuchi Raquel Fernández Sandro Pezzelle EGVM 210 0 0 18 Feb 2025
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation Louis Mahon Mirella Lapata 26 2 0 17 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning Eileen Wang Caren Han Josiah Poon 34 0 0 12 Oct 2024
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning Tieyuan Chen Huabin Liu Tianyao He Yihang Chen Chaofan Gan ... Cheng Zhong Yang Zhang Yingxue Wang Hui Lin Weiyao Lin VGen CML 39 5 0 26 Sep 2024
Box2Flow: Instance-based Action Flow Graphs from Videos Jiatong Li Kalliopi Basioti Vladimir Pavlovic 38 0 0 30 Aug 2024
Audio Description Customization Rosiana Natalie Ruei-Che Chang Smitha Sheshadri Anhong Guo Kotaro Hara 26 4 0 21 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark Koki Maeda Tosho Hirasawa Atsushi Hashimoto Jun Harashima Leszek Rybicki Yusuke Fukasawa Yoshitaka Ushiku 45 0 0 05 Aug 2024
Learning Video Context as Interleaved Multimodal Sequences S. Shao Pengchuan Zhang Y. Li Xide Xia A. Meso Ziteng Gao Jinheng Xie N. Holliman Mike Zheng Shou 43 5 0 31 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks Takumi Komatsu Motonari Kambara Shumpei Hatanaka Haruka Matsuo Tsubasa Hirakawa Takayoshi Yamashita H. Fujiyoshi Komei Sugiura 43 0 0 18 Jul 2024
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack Yuri Kuratov Aydar Bulatov Petr Anokhin Ivan Rodkin Dmitry Sorokin Artyom Sorokin Mikhail Burtsev RALM ALM LRM ReLM ELM 49 59 0 14 Jun 2024
From Text to Life: On the Reciprocal Relationship between Artificial Life and Large Language Models Eleni Nisioti Claire Glanois Elias Najarro Andrew Dai Elliot Meyerson J. Pedersen Laetitia Teodorescu Conor F. Hayes Shyam Sudhakaran Sebastian Risi AI4CE LM&Ro 48 3 0 14 Jun 2024
Decoding Radiologists' Intentions: A Novel System for Accurate Region Identification in Chest X-ray Image Analysis Akash Awasthi Safwan Ahmad Bryant Le Hien Nguyen 23 0 0 29 Apr 2024
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality Sishuo Chen Lei Li Shuhuai Ren Rundong Gao Yuanxin Liu Xiaohan Bi Xu Sun Lu Hou 42 3 0 28 Mar 2024
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos Ali Zare Yulei Niu Hammad A. Ayyubi Shih-Fu Chang 50 1 0 27 Mar 2024
MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage Hao Hao Tan K. Cheuk Taemin Cho Wei-Hsiang Liao Yuki Mitsufuji 31 0 0 15 Mar 2024
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation Joseph Cho Fachrina Dewi Puspitasari Sheng Zheng Jingyao Zheng Lik-Hang Lee Tae-Ho Kim Choong Seon Hong Chaoning Zhang EGVM VGen 36 40 0 08 Mar 2024
A Modular Approach for Multimodal Summarization of TV Shows Louis Mahon Mirella Lapata 26 9 0 06 Mar 2024
Video ReCap: Recursive Captioning of Hour-Long Videos Md. Mohaiminul Islam Ngan Ho Xitong Yang Tushar Nagarajan Lorenzo Torresani Gedas Bertasius VGen VLM 35 44 0 20 Feb 2024
In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss Yuri Kuratov Aydar Bulatov Petr Anokhin Dmitry Sorokin Artyom Sorokin Mikhail Burtsev RALM 119 33 0 16 Feb 2024
Investigating Recurrent Transformers with Dynamic Halt Jishnu Ray Chowdhury Cornelia Caragea 39 1 0 01 Feb 2024
Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing Zi Yang Nan Hua RALM 34 4 0 10 Jan 2024
CLearViD: Curriculum Learning for Video Description Cheng-Yu Chuang Pooyan Fazli 38 1 0 08 Nov 2023
Collaborative Three-Stream Transformers for Video Captioning Hao Wang Libo Zhang Hengrui Fan Tiejian Luo 36 6 0 18 Sep 2023
Story Visualization by Online Text Augmentation with Context Memory Daechul Ahn Daneul Kim Gwangmo Song Seung Wook Kim Honglak Lee Dongyeop Kang Jonghyun Choi DiffM 19 4 0 15 Aug 2023
Recurrent Action Transformer with Memory A. Staroverov A. Bessonov Dmitry A. Yudin A. Kovalev Aleksandr I. Panov OffRL 33 4 0 15 Jun 2023
AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content Shuyang Cao Lu Wang 24 5 0 24 May 2023
A Review of Deep Learning for Video Captioning Moloud Abdar Meenakshi Kollati Swaraja Kuraparthi Farhad Pourpanah Daniel J. McDuff ... Shuicheng Yan Abduallah A. Mohamed Abbas Khosravi Erik Cambria Fatih Porikli 3DV 32 21 0 22 Apr 2023
Scaling Transformer to 1M tokens and beyond with RMT Aydar Bulatov Yuri Kuratov Yermek Kapushev Mikhail Burtsev LRM 19 87 0 19 Apr 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning Xin Gu G. Chen Yufei Wang Libo Zhang Tiejian Luo Longyin Wen 27 47 0 22 Mar 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning Shih-Han Chou James J. Little Leonid Sigal 21 2 0 14 Mar 2023
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos Teng Wang Jinrui Zhang Feng Zheng Wenhao Jiang Ran Cheng Ping Luo VLM 33 11 0 11 Mar 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training Wei Li Linchao Zhu Longyin Wen Yi Yang VLM 45 86 0 06 Mar 2023
Models See Hallucinations: Evaluating the Factuality in Video Captioning Hui Liu Xiaojun Wan HILM 34 10 0 06 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Antoine Yang Arsha Nagrani Paul Hongsuck Seo Antoine Miech Jordi Pont-Tuset Ivan Laptev Josef Sivic Cordelia Schmid AI4TS VLM 39 221 0 27 Feb 2023
Contextual Explainable Video Representation: Human Perception-based Understanding Khoa T. Vo Kashu Yamazaki Phong H. Nguyen Pha Nguyen Khoa Luu Ngan Le 13 9 0 12 Dec 2022
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection Kevin Hyekang Joo Khoa T. Vo Kashu Yamazaki Ngan Le 24 38 0 09 Dec 2022
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners Shen Yan Tao Zhu Zirui Wang Yuan Cao Mi Zhang Soham Ghosh Yonghui Wu Jiahui Yu VLM VGen 32 46 0 09 Dec 2022
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning Kashu Yamazaki Khoa T. Vo Sang Truong Bhiksha Raj Ngan Le 29 35 0 28 Nov 2022
Event and Entity Extraction from Generated Video Captions Johannes Scherer A. Scherp Deepayan Bhowmik 26 0 0 05 Nov 2022
Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata VGen 29 12 0 10 Oct 2022
Recipe Generation from Unsegmented Cooking Videos Taichi Nishimura Atsushi Hashimoto Yoshitaka Ushiku Hirotaka Kameko Shinsuke Mori 25 3 0 21 Sep 2022
Real-time Online Video Detection with Temporal Smoothing Transformers Yue Zhao Philipp Krahenbuhl ViT 69 57 0 19 Sep 2022
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation A. Maharana Darryl Hannan Joey Tianyi Zhou DiffM 29 77 0 13 Sep 2022
Explain My Surprise: Learning Efficient Long-Term Memory by Predicting Uncertain Outcomes A. Sorokin N. Buzun Leonid Pugachev Mikhail Burtsev 23 8 0 27 Jul 2022
AutoTransition: Learning to Recommend Video Transition Effects Yaojie Shen Libo Zhang Kai Xu Xiaojie Jin VGen 17 13 0 27 Jul 2022
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks Motonari Kambara K. Sugiura 22 6 0 19 Jul 2022