Saying the Unseen: Video Descriptions via Dialog Agents

26 June 2021

Ye Zhu

Yu Wu

Yi Yang

Yan Yan

ArXiv PDF HTML

Papers citing "Saying the Unseen: Video Descriptions via Dialog Agents"

30 / 30 papers shown

Title
Learning Audio-Visual Correlations from Variational Cross-Modal Generation Ye Zhu Yu Wu Hugo Latapie Yi Yang Yan Yan SSL 73 20 0 05 Feb 2021
History for Visual Dialog: Do we really need it? Shubham Agarwal Trung Bui Joon-Young Lee Ioannis Konstas Verena Rieser VLM 36 71 0 08 May 2020
Counterfactual Samples Synthesizing for Robust Visual Question Answering Long Chen Xin Yan Jun Xiao Hanwang Zhang Shiliang Pu Yueting Zhuang OOD AAML 187 292 0 14 Mar 2020
Meshed-Memory Transformer for Image Captioning Marcella Cornia Matteo Stefanini Lorenzo Baraldi Rita Cucchiara 51 873 0 17 Dec 2019
Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing Vedika Agarwal Rakshith Shetty Mario Fritz CML AAML 53 157 0 16 Dec 2019
Listen to Look: Action Recognition by Previewing Audio Ruohan Gao Tae-Hyun Oh Kristen Grauman Lorenzo Torresani VLM 65 251 0 10 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline Vishvak Murahari Dhruv Batra Devi Parikh Abhishek Das VLM 56 115 0 05 Dec 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao MLLM VLM 327 933 0 24 Sep 2019
Self-Supervised Audio-Visual Co-Segmentation Andrew Rouditchenko Hang Zhao Chuang Gan Josh H. McDermott Antonio Torralba VLM SSL 57 104 0 18 Apr 2019
Factor Graph Attention Idan Schwartz Seunghak Yu Tamir Hazan Alex Schwing 49 110 0 11 Apr 2019
A Simple Baseline for Audio-Visual Scene-Aware Dialog Idan Schwartz Alex Schwing Tamir Hazan 53 69 0 11 Apr 2019
Revisiting EmbodiedQA: A Simple Baseline and Beyond Yuehua Wu Lu Jiang Yi Yang LM&Ro 59 30 0 08 Apr 2019
The Sound of Pixels Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh H. McDermott Antonio Torralba VLM 77 535 0 09 Apr 2018
End-to-End Dense Video Captioning with Masked Transformer Luowei Zhou Yingbo Zhou Jason J. Corso R. Socher Caiming Xiong 88 527 0 03 Apr 2018
Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering Unnat Jain Svetlana Lazebnik Alex Schwing MLLM 60 81 0 29 Mar 2018
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog Sang-Woo Lee Y. Heo Byoung-Tak Zhang 55 31 0 12 Feb 2018
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning Andrew Owens Jiajun Wu Josh H. McDermott William T. Freeman Antonio Torralba SSL 65 177 0 20 Dec 2017
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 87 529 0 18 Dec 2017
Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space Liwei Wang Alex Schwing Svetlana Lazebnik CoGe 83 175 0 19 Nov 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang AIMat 109 4,201 0 25 Jul 2017
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning Abhishek Das Satwik Kottur J. M. F. Moura Stefan Lee Dhruv Batra OffRL 105 425 0 20 Mar 2017
GuessWhat?! Visual object discovery through multi-modal dialogue H. D. Vries Florian Strub A. Chandar Olivier Pietquin Hugo Larochelle Aaron Courville VLM 88 428 0 23 Nov 2016
Video Captioning with Transferred Semantic Attributes Yingwei Pan Ting Yao Houqiang Li Tao Mei 57 329 0 23 Nov 2016
Boosting Image Captioning with Attributes Ting Yao Yingwei Pan Yehao Li Zhaofan Qiu Tao Mei VLM 80 621 0 05 Nov 2016
CNN Architectures for Large-Scale Audio Classification Shawn Hershey Sourish Chaudhuri D. Ellis J. Gemmeke A. Jansen ... Rif A. Saurous Bryan Seybold M. Slaney Ron J. Weiss K. Wilson 101 2,488 0 29 Sep 2016
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? Abhishek Das Harsh Agrawal C. L. Zitnick Devi Parikh Dhruv Batra 87 465 0 11 Jun 2016
Adversarial Feature Learning Jiasen Lu Philipp Krahenbuhl Trevor Darrell GAN 92 1,608 0 31 May 2016
Stacked Attention Networks for Image Question Answering Zichao Yang Xiaodong He Jianfeng Gao Li Deng Alex Smola BDL 101 1,875 0 07 Nov 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Ke Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhutdinov R. Zemel Yoshua Bengio DiffM 298 10,034 0 10 Feb 2015
CIDEr: Consensus-based Image Description Evaluation Ramakrishna Vedantam C. L. Zitnick Devi Parikh 241 4,451 0 20 Nov 2014