ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2106.14069
  4. Cited By
Saying the Unseen: Video Descriptions via Dialog Agents

Saying the Unseen: Video Descriptions via Dialog Agents

26 June 2021
Ye Zhu
Yu Wu
Yi Yang
Yan Yan
ArXivPDFHTML

Papers citing "Saying the Unseen: Video Descriptions via Dialog Agents"

30 / 30 papers shown
Title
Learning Audio-Visual Correlations from Variational Cross-Modal
  Generation
Learning Audio-Visual Correlations from Variational Cross-Modal Generation
Ye Zhu
Yu Wu
Hugo Latapie
Yi Yang
Yan Yan
SSL
73
20
0
05 Feb 2021
History for Visual Dialog: Do we really need it?
History for Visual Dialog: Do we really need it?
Shubham Agarwal
Trung Bui
Joon-Young Lee
Ioannis Konstas
Verena Rieser
VLM
36
71
0
08 May 2020
Counterfactual Samples Synthesizing for Robust Visual Question Answering
Counterfactual Samples Synthesizing for Robust Visual Question Answering
Long Chen
Xin Yan
Jun Xiao
Hanwang Zhang
Shiliang Pu
Yueting Zhuang
OOD
AAML
187
292
0
14 Mar 2020
Meshed-Memory Transformer for Image Captioning
Meshed-Memory Transformer for Image Captioning
Marcella Cornia
Matteo Stefanini
Lorenzo Baraldi
Rita Cucchiara
51
873
0
17 Dec 2019
Towards Causal VQA: Revealing and Reducing Spurious Correlations by
  Invariant and Covariant Semantic Editing
Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing
Vedika Agarwal
Rakshith Shetty
Mario Fritz
CML
AAML
53
157
0
16 Dec 2019
Listen to Look: Action Recognition by Previewing Audio
Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao
Tae-Hyun Oh
Kristen Grauman
Lorenzo Torresani
VLM
65
251
0
10 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art
  Baseline
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
56
115
0
05 Dec 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
327
933
0
24 Sep 2019
Self-Supervised Audio-Visual Co-Segmentation
Self-Supervised Audio-Visual Co-Segmentation
Andrew Rouditchenko
Hang Zhao
Chuang Gan
Josh H. McDermott
Antonio Torralba
VLM
SSL
57
104
0
18 Apr 2019
Factor Graph Attention
Factor Graph Attention
Idan Schwartz
Seunghak Yu
Tamir Hazan
Alex Schwing
49
110
0
11 Apr 2019
A Simple Baseline for Audio-Visual Scene-Aware Dialog
A Simple Baseline for Audio-Visual Scene-Aware Dialog
Idan Schwartz
Alex Schwing
Tamir Hazan
53
69
0
11 Apr 2019
Revisiting EmbodiedQA: A Simple Baseline and Beyond
Revisiting EmbodiedQA: A Simple Baseline and Beyond
Yuehua Wu
Lu Jiang
Yi Yang
LM&Ro
59
30
0
08 Apr 2019
The Sound of Pixels
The Sound of Pixels
Hang Zhao
Chuang Gan
Andrew Rouditchenko
Carl Vondrick
Josh H. McDermott
Antonio Torralba
VLM
77
535
0
09 Apr 2018
End-to-End Dense Video Captioning with Masked Transformer
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
88
527
0
03 Apr 2018
Two can play this Game: Visual Dialog with Discriminative Question
  Generation and Answering
Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering
Unnat Jain
Svetlana Lazebnik
Alex Schwing
MLLM
60
81
0
29 Mar 2018
Answerer in Questioner's Mind: Information Theoretic Approach to
  Goal-Oriented Visual Dialog
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog
Sang-Woo Lee
Y. Heo
Byoung-Tak Zhang
55
31
0
12 Feb 2018
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual
  Learning
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Andrew Owens
Jiajun Wu
Josh H. McDermott
William T. Freeman
Antonio Torralba
SSL
65
177
0
20 Dec 2017
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
87
529
0
18 Dec 2017
Diverse and Accurate Image Description Using a Variational Auto-Encoder
  with an Additive Gaussian Encoding Space
Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space
Liwei Wang
Alex Schwing
Svetlana Lazebnik
CoGe
83
175
0
19 Nov 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
109
4,201
0
25 Jul 2017
Learning Cooperative Visual Dialog Agents with Deep Reinforcement
  Learning
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Abhishek Das
Satwik Kottur
J. M. F. Moura
Stefan Lee
Dhruv Batra
OffRL
105
425
0
20 Mar 2017
GuessWhat?! Visual object discovery through multi-modal dialogue
GuessWhat?! Visual object discovery through multi-modal dialogue
H. D. Vries
Florian Strub
A. Chandar
Olivier Pietquin
Hugo Larochelle
Aaron Courville
VLM
88
428
0
23 Nov 2016
Video Captioning with Transferred Semantic Attributes
Video Captioning with Transferred Semantic Attributes
Yingwei Pan
Ting Yao
Houqiang Li
Tao Mei
57
329
0
23 Nov 2016
Boosting Image Captioning with Attributes
Boosting Image Captioning with Attributes
Ting Yao
Yingwei Pan
Yehao Li
Zhaofan Qiu
Tao Mei
VLM
80
621
0
05 Nov 2016
CNN Architectures for Large-Scale Audio Classification
CNN Architectures for Large-Scale Audio Classification
Shawn Hershey
Sourish Chaudhuri
D. Ellis
J. Gemmeke
A. Jansen
...
Rif A. Saurous
Bryan Seybold
M. Slaney
Ron J. Weiss
K. Wilson
101
2,488
0
29 Sep 2016
Human Attention in Visual Question Answering: Do Humans and Deep
  Networks Look at the Same Regions?
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
Abhishek Das
Harsh Agrawal
C. L. Zitnick
Devi Parikh
Dhruv Batra
87
465
0
11 Jun 2016
Adversarial Feature Learning
Adversarial Feature Learning
Jiasen Lu
Philipp Krahenbuhl
Trevor Darrell
GAN
92
1,608
0
31 May 2016
Stacked Attention Networks for Image Question Answering
Stacked Attention Networks for Image Question Answering
Zichao Yang
Xiaodong He
Jianfeng Gao
Li Deng
Alex Smola
BDL
101
1,875
0
07 Nov 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual
  Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Ke Xu
Jimmy Ba
Ryan Kiros
Kyunghyun Cho
Aaron Courville
Ruslan Salakhutdinov
R. Zemel
Yoshua Bengio
DiffM
298
10,034
0
10 Feb 2015
CIDEr: Consensus-based Image Description Evaluation
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
241
4,451
0
20 Nov 2014
1