Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1610.02947
Cited By
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
10 October 2016
Youngjae Yu
Hyungjin Ko
Jongwook Choi
Gunhee Kim
Re-assign community
ArXiv
PDF
HTML
Papers citing
"End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering"
43 / 43 papers shown
Title
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
Caihua Liu
Xu Li
Wenjing Xue
Wei Tang
Xia Feng
56
0
0
20 Feb 2025
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
64
10
1
09 Jun 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
55
1
0
01 Apr 2024
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan
Md. Mohaiminul Islam
Thomas Seidl
Gedas Bertasius
33
4
0
11 Dec 2023
Hierarchical Video-Moment Retrieval and Step-Captioning
Abhaysinh Zala
Jaemin Cho
Satwik Kottur
Xilun Chen
Barlas Ouguz
Yasher Mehdad
Joey Tianyi Zhou
3DV
22
51
0
29 Mar 2023
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Fenglin Liu
Xian Wu
Chenyu You
Shen Ge
Yuexian Zou
Xu Sun
29
23
0
22 Nov 2022
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David Clifton
Jing Chen
VLM
52
64
0
21 Nov 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
36
272
0
15 Jul 2022
Automatic Concept Extraction for Concept Bottleneck-based Video Classification
J. Jeyakumar
Luke Dickens
L. Garcia
Yu Cheng
Diego Ramirez Echavarria
Joseph Noor
Alessandra Russo
Lance M. Kaplan
Erik P. Blasch
Mani B. Srivastava
24
8
0
21 Jun 2022
Structured Two-stream Attention Network for Video Question Answering
Lianli Gao
Pengpeng Zeng
Jingkuan Song
Yuan-Fang Li
Wu Liu
Tao Mei
Heng Tao Shen
43
68
0
02 Jun 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
56
39
0
06 Apr 2022
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Alexander Kunitsyn
M. Kalashnikov
Maksim Dzabraev
Andrei Ivaniuta
30
16
0
14 Mar 2022
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang-ju Yang
Tong Zhang
Yuexian Zou
CLIP
25
20
0
30 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
31
189
0
19 Nov 2021
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
Mohammadreza Zolfaghari
Yi Zhu
Peter V. Gehler
Thomas Brox
137
127
0
30 Sep 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
259
561
0
28 Sep 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
27
137
0
23 Aug 2021
HANet: Hierarchical Alignment Networks for Video-Text Retrieval
Peng Wu
Xiangteng He
Mingqian Tang
Yiliang Lv
Jing Liu
44
52
0
26 Jul 2021
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
Shuai Bai
Zhedong Zheng
Xiaohan Wang
Junyang Lin
Zhu Zhang
Chang Zhou
Yi Yang
Hongxia Yang
26
27
0
31 May 2021
Towards Diverse Paragraph Captioning for Untrimmed Videos
Yuqing Song
Shizhe Chen
Qin Jin
21
37
0
30 May 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang
Linchao Zhu
Yi Yang
170
170
0
20 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
332
782
0
18 Apr 2021
VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks
Hung Le
Nancy F. Chen
Guosheng Lin
MLLM
30
19
0
16 Apr 2021
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
Li Xu
He Huang
Jun Liu
ViT
LRM
17
83
0
29 Mar 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev
M. Kalashnikov
Stepan Alekseevich Komkov
Aleksandr Petiushko
24
128
0
19 Mar 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
46
648
0
11 Feb 2021
Recent Advances in Video Question Answering: A Review of Datasets and Methods
Devshree Patel
Ratnam Parikh
Yesha Shastri
15
18
0
15 Jan 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
49
417
0
14 Nov 2020
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues
Hung Le
Doyen Sahoo
Nancy F. Chen
Guosheng Lin
55
30
0
20 Oct 2020
Dual Encoding for Video Retrieval by Text
Jianfeng Dong
Xirong Li
Chaoxi Xu
Xun Yang
Gang Yang
Xun Wang
Meng Wang
26
2
0
10 Sep 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
436
596
0
21 Jul 2020
Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
Xun Yang
Jianfeng Dong
Yixin Cao
Xun Wang
Meng Wang
Tat-Seng Chua
33
137
0
06 Jul 2020
Action Modifiers: Learning from Adverbs in Instructional Videos
Hazel Doughty
Ivan Laptev
W. Mayol-Cuevas
Dima Damen
27
30
0
13 Dec 2019
A Graph-Based Framework to Bridge Movies and Synopses
Yu Xiong
Chengyi Zhang
Lingfeng Guo
Hang Zhou
Bolei Zhou
Dahua Lin
34
62
0
24 Oct 2019
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
Aditya Mogadala
M. Kalimuthu
Dietrich Klakow
VLM
29
133
0
22 Jul 2019
Dual Encoding for Zero-Example Video Retrieval
Jianfeng Dong
Xirong Li
Chaoxi Xu
S. Ji
Yuan He
Gang Yang
Xun Wang
30
268
0
17 Sep 2018
Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities
Nathaniel Blanchard
Daniel Moreira
Aparna Bharati
Walter J. Scheirer
27
16
0
03 Jul 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
24
233
0
07 Apr 2018
Motion-Appearance Co-Memory Networks for Video Question Answering
J. Gao
Runzhou Ge
Kan Chen
Ram Nevatia
41
240
0
29 Mar 2018
A Read-Write Memory Network for Movie Story Understanding
Seil Na
Sangho Lee
Jisung Kim
Gunhee Kim
AIMat
24
98
0
27 Sep 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
34
547
0
14 Apr 2017
A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering
Tegan Maharaj
Nicolas Ballas
Anna Rohrbach
Aaron Courville
C. Pal
VGen
15
107
0
23 Nov 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui
Dong Huk Park
Daylen Yang
Anna Rohrbach
Trevor Darrell
Marcus Rohrbach
167
1,465
0
06 Jun 2016
1