Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.14069
Cited By
Saying the Unseen: Video Descriptions via Dialog Agents
26 June 2021
Ye Zhu
Yu Wu
Yi Yang
Yan Yan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Saying the Unseen: Video Descriptions via Dialog Agents"
30 / 30 papers shown
Title
Learning Audio-Visual Correlations from Variational Cross-Modal Generation
Ye Zhu
Yu Wu
Hugo Latapie
Yi Yang
Yan Yan
SSL
73
20
0
05 Feb 2021
History for Visual Dialog: Do we really need it?
Shubham Agarwal
Trung Bui
Joon-Young Lee
Ioannis Konstas
Verena Rieser
VLM
36
71
0
08 May 2020
Counterfactual Samples Synthesizing for Robust Visual Question Answering
Long Chen
Xin Yan
Jun Xiao
Hanwang Zhang
Shiliang Pu
Yueting Zhuang
OOD
AAML
187
292
0
14 Mar 2020
Meshed-Memory Transformer for Image Captioning
Marcella Cornia
Matteo Stefanini
Lorenzo Baraldi
Rita Cucchiara
51
873
0
17 Dec 2019
Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing
Vedika Agarwal
Rakshith Shetty
Mario Fritz
CML
AAML
53
157
0
16 Dec 2019
Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao
Tae-Hyun Oh
Kristen Grauman
Lorenzo Torresani
VLM
65
251
0
10 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
56
115
0
05 Dec 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
327
933
0
24 Sep 2019
Self-Supervised Audio-Visual Co-Segmentation
Andrew Rouditchenko
Hang Zhao
Chuang Gan
Josh H. McDermott
Antonio Torralba
VLM
SSL
57
104
0
18 Apr 2019
Factor Graph Attention
Idan Schwartz
Seunghak Yu
Tamir Hazan
Alex Schwing
49
110
0
11 Apr 2019
A Simple Baseline for Audio-Visual Scene-Aware Dialog
Idan Schwartz
Alex Schwing
Tamir Hazan
53
69
0
11 Apr 2019
Revisiting EmbodiedQA: A Simple Baseline and Beyond
Yuehua Wu
Lu Jiang
Yi Yang
LM&Ro
59
30
0
08 Apr 2019
The Sound of Pixels
Hang Zhao
Chuang Gan
Andrew Rouditchenko
Carl Vondrick
Josh H. McDermott
Antonio Torralba
VLM
77
535
0
09 Apr 2018
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
88
527
0
03 Apr 2018
Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering
Unnat Jain
Svetlana Lazebnik
Alex Schwing
MLLM
60
81
0
29 Mar 2018
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog
Sang-Woo Lee
Y. Heo
Byoung-Tak Zhang
55
31
0
12 Feb 2018
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Andrew Owens
Jiajun Wu
Josh H. McDermott
William T. Freeman
Antonio Torralba
SSL
65
177
0
20 Dec 2017
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
87
529
0
18 Dec 2017
Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space
Liwei Wang
Alex Schwing
Svetlana Lazebnik
CoGe
83
175
0
19 Nov 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
109
4,201
0
25 Jul 2017
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Abhishek Das
Satwik Kottur
J. M. F. Moura
Stefan Lee
Dhruv Batra
OffRL
105
425
0
20 Mar 2017
GuessWhat?! Visual object discovery through multi-modal dialogue
H. D. Vries
Florian Strub
A. Chandar
Olivier Pietquin
Hugo Larochelle
Aaron Courville
VLM
88
428
0
23 Nov 2016
Video Captioning with Transferred Semantic Attributes
Yingwei Pan
Ting Yao
Houqiang Li
Tao Mei
57
329
0
23 Nov 2016
Boosting Image Captioning with Attributes
Ting Yao
Yingwei Pan
Yehao Li
Zhaofan Qiu
Tao Mei
VLM
80
621
0
05 Nov 2016
CNN Architectures for Large-Scale Audio Classification
Shawn Hershey
Sourish Chaudhuri
D. Ellis
J. Gemmeke
A. Jansen
...
Rif A. Saurous
Bryan Seybold
M. Slaney
Ron J. Weiss
K. Wilson
101
2,488
0
29 Sep 2016
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
Abhishek Das
Harsh Agrawal
C. L. Zitnick
Devi Parikh
Dhruv Batra
87
465
0
11 Jun 2016
Adversarial Feature Learning
Jiasen Lu
Philipp Krahenbuhl
Trevor Darrell
GAN
92
1,608
0
31 May 2016
Stacked Attention Networks for Image Question Answering
Zichao Yang
Xiaodong He
Jianfeng Gao
Li Deng
Alex Smola
BDL
101
1,875
0
07 Nov 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Ke Xu
Jimmy Ba
Ryan Kiros
Kyunghyun Cho
Aaron Courville
Ruslan Salakhutdinov
R. Zemel
Yoshua Bengio
DiffM
298
10,034
0
10 Feb 2015
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
241
4,451
0
20 Nov 2014
1