ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2005.05402
  4. Cited By
MART: Memory-Augmented Recurrent Transformer for Coherent Video
  Paragraph Captioning

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

11 May 2020
Jie Lei
Liwei Wang
Yelong Shen
Dong Yu
Tamara L. Berg
Joey Tianyi Zhou
ArXivPDFHTML

Papers citing "MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning"

50 / 86 papers shown
Title
Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation
Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation
Lakshita Agarwal
Bindu Verma
ViT
24
0
0
23 Apr 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Z. Liu
EgoV
51
2
0
05 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding
Louis Mahon
Mirella Lapata
VLM
41
1
0
03 Mar 2025
VideoA11y: Method and Dataset for Accessible Video Description
VideoA11y: Method and Dataset for Accessible Video Description
Chaoyu Li
Sid Padmanabhuni
Maryam Cheema
H. Seifi
Pooyan Fazli
VGen
65
0
0
27 Feb 2025
Natural Language Generation from Visual Sequences: Challenges and Future Directions
Natural Language Generation from Visual Sequences: Challenges and Future Directions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
210
0
0
18 Feb 2025
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation
Louis Mahon
Mirella Lapata
26
2
0
17 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video
  Paragraph Captioning
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
34
0
0
12 Oct 2024
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning
Tieyuan Chen
Huabin Liu
Tianyao He
Yihang Chen
Chaofan Gan
...
Cheng Zhong
Yang Zhang
Yingxue Wang
Hui Lin
Weiyao Lin
VGen
CML
39
5
0
26 Sep 2024
Box2Flow: Instance-based Action Flow Graphs from Videos
Box2Flow: Instance-based Action Flow Graphs from Videos
Jiatong Li
Kalliopi Basioti
Vladimir Pavlovic
38
0
0
30 Aug 2024
Audio Description Customization
Audio Description Customization
Rosiana Natalie
Ruei-Che Chang
Smitha Sheshadri
Anhong Guo
Kotaro Hara
26
4
0
21 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a
  Vision-Language Benchmark
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
45
0
0
05 Aug 2024
Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
43
5
0
31 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible
  Collisions in Object Placement Tasks
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
43
0
0
18 Jul 2024
BABILong: Testing the Limits of LLMs with Long Context
  Reasoning-in-a-Haystack
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Yuri Kuratov
Aydar Bulatov
Petr Anokhin
Ivan Rodkin
Dmitry Sorokin
Artyom Sorokin
Mikhail Burtsev
RALM
ALM
LRM
ReLM
ELM
49
59
0
14 Jun 2024
From Text to Life: On the Reciprocal Relationship between Artificial
  Life and Large Language Models
From Text to Life: On the Reciprocal Relationship between Artificial Life and Large Language Models
Eleni Nisioti
Claire Glanois
Elias Najarro
Andrew Dai
Elliot Meyerson
J. Pedersen
Laetitia Teodorescu
Conor F. Hayes
Shyam Sudhakaran
Sebastian Risi
AI4CE
LM&Ro
48
3
0
14 Jun 2024
Decoding Radiologists' Intentions: A Novel System for Accurate Region
  Identification in Chest X-ray Image Analysis
Decoding Radiologists' Intentions: A Novel System for Accurate Region Identification in Chest X-ray Image Analysis
Akash Awasthi
Safwan Ahmad
Bryant Le
Hien Nguyen
23
0
0
29 Apr 2024
Towards Multimodal Video Paragraph Captioning Models Robust to Missing
  Modality
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
Sishuo Chen
Lei Li
Shuhuai Ren
Rundong Gao
Yuanxin Liu
Xiaohan Bi
Xu Sun
Lu Hou
42
3
0
28 Mar 2024
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in
  Instructional Videos
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Ali Zare
Yulei Niu
Hammad A. Ayyubi
Shih-Fu Chang
50
1
0
27 Mar 2024
MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate
  Instrument Leakage
MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage
Hao Hao Tan
K. Cheuk
Taemin Cho
Wei-Hsiang Liao
Yuki Mitsufuji
31
0
0
15 Mar 2024
Sora as an AGI World Model? A Complete Survey on Text-to-Video
  Generation
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation
Joseph Cho
Fachrina Dewi Puspitasari
Sheng Zheng
Jingyao Zheng
Lik-Hang Lee
Tae-Ho Kim
Choong Seon Hong
Chaoning Zhang
EGVM
VGen
36
40
0
08 Mar 2024
A Modular Approach for Multimodal Summarization of TV Shows
A Modular Approach for Multimodal Summarization of TV Shows
Louis Mahon
Mirella Lapata
26
9
0
06 Mar 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
35
44
0
20 Feb 2024
In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs
  Miss
In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
Yuri Kuratov
Aydar Bulatov
Petr Anokhin
Dmitry Sorokin
Artyom Sorokin
Mikhail Burtsev
RALM
119
33
0
16 Feb 2024
Investigating Recurrent Transformers with Dynamic Halt
Investigating Recurrent Transformers with Dynamic Halt
Jishnu Ray Chowdhury
Cornelia Caragea
39
1
0
01 Feb 2024
Attendre: Wait To Attend By Retrieval With Evicted Queries in
  Memory-Based Transformers for Long Context Processing
Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing
Zi Yang
Nan Hua
RALM
34
4
0
10 Jan 2024
CLearViD: Curriculum Learning for Video Description
CLearViD: Curriculum Learning for Video Description
Cheng-Yu Chuang
Pooyan Fazli
38
1
0
08 Nov 2023
Collaborative Three-Stream Transformers for Video Captioning
Collaborative Three-Stream Transformers for Video Captioning
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
36
6
0
18 Sep 2023
Story Visualization by Online Text Augmentation with Context Memory
Story Visualization by Online Text Augmentation with Context Memory
Daechul Ahn
Daneul Kim
Gwangmo Song
Seung Wook Kim
Honglak Lee
Dongyeop Kang
Jonghyun Choi
DiffM
19
4
0
15 Aug 2023
Recurrent Action Transformer with Memory
Recurrent Action Transformer with Memory
A. Staroverov
A. Bessonov
Dmitry A. Yudin
A. Kovalev
Aleksandr I. Panov
OffRL
33
4
0
15 Jun 2023
AWESOME: GPU Memory-constrained Long Document Summarization using Memory
  Mechanism and Global Salient Content
AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content
Shuyang Cao
Lu Wang
24
5
0
24 May 2023
A Review of Deep Learning for Video Captioning
A Review of Deep Learning for Video Captioning
Moloud Abdar
Meenakshi Kollati
Swaraja Kuraparthi
Farhad Pourpanah
Daniel J. McDuff
...
Shuicheng Yan
Abduallah A. Mohamed
Abbas Khosravi
Erik Cambria
Fatih Porikli
3DV
32
21
0
22 Apr 2023
Scaling Transformer to 1M tokens and beyond with RMT
Scaling Transformer to 1M tokens and beyond with RMT
Aydar Bulatov
Yuri Kuratov
Yermek Kapushev
Mikhail Burtsev
LRM
19
87
0
19 Apr 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
27
47
0
22 Mar 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Shih-Han Chou
James J. Little
Leonid Sigal
21
2
0
14 Mar 2023
Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
33
11
0
11 Mar 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
  Training
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training
Wei Li
Linchao Zhu
Longyin Wen
Yi Yang
VLM
45
86
0
06 Mar 2023
Models See Hallucinations: Evaluating the Factuality in Video Captioning
Models See Hallucinations: Evaluating the Factuality in Video Captioning
Hui Liu
Xiaojun Wan
HILM
34
10
0
06 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Contextual Explainable Video Representation: Human Perception-based
  Understanding
Contextual Explainable Video Representation: Human Perception-based Understanding
Khoa T. Vo
Kashu Yamazaki
Phong H. Nguyen
Pha Nguyen
Khoa Luu
Ngan Le
13
9
0
12 Dec 2022
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised
  Video Anomaly Detection
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
Kevin Hyekang Joo
Khoa T. Vo
Kashu Yamazaki
Ngan Le
24
38
0
09 Dec 2022
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive
  Captioners
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan
Tao Zhu
Zirui Wang
Yuan Cao
Mi Zhang
Soham Ghosh
Yonghui Wu
Jiahui Yu
VLM
VGen
32
46
0
09 Dec 2022
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
  Paragraph Captioning
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Kashu Yamazaki
Khoa T. Vo
Sang Truong
Bhiksha Raj
Ngan Le
29
35
0
28 Nov 2022
Event and Entity Extraction from Generated Video Captions
Event and Entity Extraction from Generated Video Captions
Johannes Scherer
A. Scherp
Deepayan Bhowmik
26
0
0
05 Nov 2022
Hierarchical3D Adapters for Long Video-to-text Summarization
Hierarchical3D Adapters for Long Video-to-text Summarization
Pinelopi Papalampidi
Mirella Lapata
VGen
29
12
0
10 Oct 2022
Recipe Generation from Unsegmented Cooking Videos
Recipe Generation from Unsegmented Cooking Videos
Taichi Nishimura
Atsushi Hashimoto
Yoshitaka Ushiku
Hirotaka Kameko
Shinsuke Mori
25
3
0
21 Sep 2022
Real-time Online Video Detection with Temporal Smoothing Transformers
Real-time Online Video Detection with Temporal Smoothing Transformers
Yue Zhao
Philipp Krahenbuhl
ViT
69
57
0
19 Sep 2022
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
  Continuation
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
A. Maharana
Darryl Hannan
Joey Tianyi Zhou
DiffM
29
77
0
13 Sep 2022
Explain My Surprise: Learning Efficient Long-Term Memory by Predicting
  Uncertain Outcomes
Explain My Surprise: Learning Efficient Long-Term Memory by Predicting Uncertain Outcomes
A. Sorokin
N. Buzun
Leonid Pugachev
Mikhail Burtsev
23
8
0
27 Jul 2022
AutoTransition: Learning to Recommend Video Transition Effects
AutoTransition: Learning to Recommend Video Transition Effects
Yaojie Shen
Libo Zhang
Kai Xu
Xiaojie Jin
VGen
17
13
0
27 Jul 2022
Relational Future Captioning Model for Explaining Likely Collisions in
  Daily Tasks
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks
Motonari Kambara
K. Sugiura
22
6
0
19 Jul 2022
12
Next