ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1804.00819
  4. Cited By
End-to-End Dense Video Captioning with Masked Transformer

End-to-End Dense Video Captioning with Masked Transformer

3 April 2018
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
ArXivPDFHTML

Papers citing "End-to-End Dense Video Captioning with Masked Transformer"

50 / 114 papers shown
Title
Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point
  Modeling
Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling
Xumin Yu
Lulu Tang
Yongming Rao
Tiejun Huang
Jie Zhou
Jiwen Lu
3DPC
51
655
0
29 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
40
4
0
19 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
Masking Modalities for Cross-modal Video Retrieval
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
19
29
0
01 Nov 2021
Revitalizing CNN Attentions via Transformers in Self-Supervised Visual
  Representation Learning
Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning
Chongjian Ge
Youwei Liang
Yibing Song
Jianbo Jiao
Jue Wang
Ping Luo
ViT
24
36
0
11 Oct 2021
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video
  Representations
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
Mohammadreza Zolfaghari
Yi Zhu
Peter V. Gehler
Thomas Brox
137
127
0
30 Sep 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
259
561
0
28 Sep 2021
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and
  Benchmark
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai
27
6
0
23 Sep 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
72
44
0
21 Sep 2021
EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery
  Generation
EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation
Yanjun Gao
Lulu Liu
Jason Wang
Xin Chen
Huayan Wang
Rui Zhang
31
1
0
10 Sep 2021
Video2Skill: Adapting Events in Demonstration Videos to Skills in an
  Environment using Cyclic MDP Homomorphisms
Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms
Sumedh Anand Sontakke
Sumegh Roychowdhury
Mausoom Sarkar
Nikaash Puri
Balaji Krishnamurthy
Laurent Itti
34
1
0
08 Sep 2021
FuseFormer: Fusing Fine-Grained Information in Transformers for Video
  Inpainting
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting
R. Liu
Hanming Deng
Yangyi Huang
Xiaoyu Shi
Lewei Lu
Wenxiu Sun
Xiaogang Wang
Jifeng Dai
Hongsheng Li
ViT
30
124
0
07 Sep 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal
  Attention
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada
EgoV
31
12
0
07 Sep 2021
Zero-shot Natural Language Video Localization
Zero-shot Natural Language Video Localization
Jinwoo Nam
Daechul Ahn
Dongyeop Kang
S. Ha
Jonghyun Choi
94
43
0
29 Aug 2021
Support-Set Based Cross-Supervision for Video Grounding
Support-Set Based Cross-Supervision for Video Grounding
Xinpeng Ding
N. Wang
Shiwei Zhang
De Cheng
Xiaomeng Li
Ziyuan Huang
Mingqian Tang
Xinbo Gao
33
42
0
24 Aug 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
27
137
0
23 Aug 2021
End-to-End Dense Video Captioning with Parallel Decoding
End-to-End Dense Video Captioning with Parallel Decoding
Teng Wang
Ruimao Zhang
Zhichao Lu
Feng Zheng
Ran Cheng
Ping Luo
3DV
47
180
0
17 Aug 2021
Hybrid Reasoning Network for Video-based Commonsense Captioning
Hybrid Reasoning Network for Video-based Commonsense Captioning
Weijiang Yu
Jian Liang
Lei Ji
Lu Li
Yuejian Fang
Nong Xiao
Nan Duan
19
10
0
05 Aug 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual
  Transformers
Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers
Chiori Hori
Takaaki Hori
Jonathan Le Roux
25
4
0
04 Aug 2021
Armour: Generalizable Compact Self-Attention for Vision Transformers
Armour: Generalizable Compact Self-Attention for Vision Transformers
Lingchuan Meng
ViT
21
3
0
03 Aug 2021
HiFT: Hierarchical Feature Transformer for Aerial Tracking
HiFT: Hierarchical Feature Transformer for Aerial Tracking
Ziang Cao
Changhong Fu
Junjie Ye
Bowen Li
Yiming Li
26
196
0
31 Jul 2021
CLIP-It! Language-Guided Video Summarization
CLIP-It! Language-Guided Video Summarization
Medhini Narasimhan
Anna Rohrbach
Trevor Darrell
CLIP
20
113
0
01 Jul 2021
Trust It or Not: Confidence-Guided Automatic Radiology Report Generation
Trust It or Not: Confidence-Guided Automatic Radiology Report Generation
Yixin Wang
Zihao Lin
Zhe Xu
Haoyu Dong
Jiang Tian
Jie Luo
Zhongchao Shi
Yang Zhang
Jianping Fan
Zhiqiang He
UQCV
MedIm
36
12
0
21 Jun 2021
Towards Diverse Paragraph Captioning for Untrimmed Videos
Towards Diverse Paragraph Captioning for Untrimmed Videos
Yuqing Song
Shizhe Chen
Qin Jin
21
37
0
30 May 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
45
446
0
18 May 2021
VidTr: Video Transformer Without Convolutions
VidTr: Video Transformer Without Convolutions
Yanyi Zhang
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Biagio Brattoli
Hao Chen
I. Marsic
Joseph Tighe
ViT
148
193
0
23 Apr 2021
All Tokens Matter: Token Labeling for Training Better Vision
  Transformers
All Tokens Matter: Token Labeling for Training Better Vision Transformers
Zihang Jiang
Qibin Hou
Li-xin Yuan
Daquan Zhou
Yujun Shi
Xiaojie Jin
Anran Wang
Jiashi Feng
ViT
27
203
0
22 Apr 2021
Understanding Chinese Video and Language via Contrastive Multimodal
  Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
Chenyi Lei
Shixian Luo
Yong-jin Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
30
41
0
19 Apr 2021
CvT: Introducing Convolutions to Vision Transformers
CvT: Introducing Convolutions to Vision Transformers
Haiping Wu
Bin Xiao
Noel Codella
Mengchen Liu
Xiyang Dai
Lu Yuan
Lei Zhang
ViT
81
1,878
0
29 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for
  High-Resolution Image Encoding
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
29
330
0
29 Mar 2021
Incorporating Convolution Designs into Visual Transformers
Incorporating Convolution Designs into Visual Transformers
Kun Yuan
Shaopeng Guo
Ziwei Liu
Aojun Zhou
F. Yu
Wei Wu
ViT
56
467
0
22 Mar 2021
Continuous 3D Multi-Channel Sign Language Production via Progressive
  Transformers and Mixture Density Networks
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
Ben Saunders
Necati Cihan Camgöz
Richard Bowden
SLR
33
77
0
11 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
283
1,989
0
09 Feb 2021
End-to-End Video Question-Answer Generation with Generator-Pretester
  Network
End-to-End Video Question-Answer Generation with Generator-Pretester Network
Hung-Ting Su
Chen-Hsi Chang
Po-Wei Shen
Yu-Siang Wang
Ya-Liang Chang
Yu-Cheng Chang
Pu-Jen Cheng
Winston H. Hsu
35
31
0
05 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
227
2,434
0
04 Jan 2021
Understanding Action Sequences based on Video Captioning for
  Learning-from-Observation
Understanding Action Sequences based on Video Captioning for Learning-from-Observation
Iori Yanokura
Naoki Wake
Kazuhiro Sasabuchi
Katsushi Ikeuchi
Masayuki Inaba
30
4
0
09 Dec 2020
A Comprehensive Review on Recent Methods and Challenges of Video
  Description
A Comprehensive Review on Recent Methods and Challenges of Video Description
Ashutosh Kumar Singh
Thoudam Doren Singh
Sivaji Bandyopadhyay
3DV
VLM
19
5
0
30 Nov 2020
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
  Tasks
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks
Humam Alwassel
Silvio Giancola
Guohao Li
33
123
0
23 Nov 2020
DORB: Dynamically Optimizing Multiple Rewards with Bandits
DORB: Dynamically Optimizing Multiple Rewards with Bandits
Ramakanth Pasunuru
Han Guo
Joey Tianyi Zhou
OffRL
32
6
0
15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
49
417
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
21
81
0
10 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation
  Learning
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
31
168
0
01 Nov 2020
Learning Modality Interaction for Temporal Sentence Localization and
  Event Captioning in Videos
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
Shaoxiang Chen
Wenhao Jiang
Wei Liu
Yu-Gang Jiang
25
101
0
28 Jul 2020
Active Learning for Video Description With Cluster-Regularized Ensemble
  Ranking
Active Learning for Video Description With Cluster-Regularized Ensemble Ranking
David M. Chan
Sudheendra Vijayanarasimhan
David A. Ross
John F. Canny
VLM
14
6
0
27 Jul 2020
SBAT: Video Captioning with Sparse Boundary-Aware Transformer
SBAT: Video Captioning with Sparse Boundary-Aware Transformer
Tao Jin
Siyu Huang
Ming Chen
Yingming Li
Zhongfei Zhang
32
52
0
23 Jul 2020
Knowledge-Based Video Question Answering with Unsupervised Scene
  Descriptions
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
Noa Garcia
Yuta Nakashima
23
32
0
17 Jul 2020
A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks
A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks
Angela S. Lin
Sudha Rao
Asli Celikyilmaz
E. Nouri
Chris Brockett
Debadeepta Dey
Bill Dolan
26
24
0
19 May 2020
MART: Memory-Augmented Recurrent Transformer for Coherent Video
  Paragraph Captioning
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jie Lei
Liwei Wang
Yelong Shen
Dong Yu
Tamara L. Berg
Joey Tianyi Zhou
27
186
0
11 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
48
494
0
01 May 2020
Progressive Transformers for End-to-End Sign Language Production
Progressive Transformers for End-to-End Sign Language Production
Ben Saunders
Necati Cihan Camgöz
Richard Bowden
SLR
22
128
0
30 Apr 2020
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Boxiao Pan
Haoye Cai
De-An Huang
Kuan-Hui Lee
Adrien Gaidon
Ehsan Adeli
Juan Carlos Niebles
31
235
0
31 Mar 2020
Previous
123
Next