ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.05122
  4. Cited By
Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
  Videos

Pano-AVQA: Grounded Audio-Visual Question Answering on 360∘^\circ∘ Videos

11 October 2021
Heeseung Yun
Youngjae Yu
Wonsuk Yang
Kangil Lee
Gunhee Kim
ArXiv (abs)PDFHTML

Papers citing "Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos"

50 / 59 papers shown
Title
Learning Musical Representations for Music Performance Question Answering
Xingjian Diao
Chunhui Zhang
Tingxuan Wu
Ming Cheng
Z. Ouyang
Weiyi Wu
Jiang Gui
110
12
0
10 Feb 2025
Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou
Dan Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
Ming Wang
VLM
101
5
0
18 Nov 2024
AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
Yizhuo Yang
Shenghai Yuan
Muqing Cao
Jianfei Yang
Lihua Xie
236
8
0
11 Nov 2024
OmniBench: Towards The Future of Universal Omni-Language Models
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
104
19
0
23 Sep 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
92
4
0
10 Jun 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
95
6
0
18 Apr 2024
BAT: Learning to Reason about Spatial Sounds with Large Language Models
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Zhisheng Zheng
Puyuan Peng
Ziyang Ma
Xie Chen
Eunsol Choi
David Harwath
LRM
107
19
0
02 Feb 2024
Audio-Visual Instance Segmentation
Audio-Visual Instance Segmentation
Ruohao Guo
Yaru Chen
Yanyu Qi
Wenzhen Yue
Dantong Niu
...
Wenzhen Yue
Ji Shi
Qixun Wang
Peiliang Zhang
Buwen Liang
VLMVOS
83
2
0
28 Oct 2023
Transformadores: Fundamentos teoricos y Aplicaciones
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
144
0
0
18 Feb 2023
Multimodal Deep Learning
Multimodal Deep Learning
Cem Akkus
Jiquan Ngiam
Vladana Djakovic
Steffen Jauch-Walser
A. Khosla
...
Jann Goschenhofer
Honglak Lee
A. Ng
Daniel Schalk
Matthias Aßenmacher
120
3,176
0
12 Jan 2023
Self-supervised Neural Audio-Visual Sound Source Localization via
  Probabilistic Spatial Modeling
Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
Yoshiki Masuyama
Yoshiaki Bando
Kohei Yatabe
Y. Sasaki
Masaki Onishi
Yasuhiro Oikawa
SSL
81
13
0
28 Jul 2020
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
  Representation
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation
Po-Han Chi
Pei-Hung Chung
Tsung-Han Wu
Chun-Cheng Hsieh
Yen-Hao Chen
Shang-Wen Li
Hung-yi Lee
SSL
65
148
0
18 May 2020
Semantic Object Prediction and Spatial Sound Super-Resolution with
  Binaural Sounds
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
A. Vasudevan
Dengxin Dai
Luc Van Gool
ObjD
119
45
0
09 Mar 2020
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern
  Recognition
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
Qiuqiang Kong
Yin Cao
Turab Iqbal
Yuxuan Wang
Wenwu Wang
Mark D. Plumbley
VLMSSL
192
1,084
0
21 Dec 2019
Temporal Reasoning via Audio Question Answering
Temporal Reasoning via Audio Question Answering
Haytham M. Fayek
Justin Johnson
49
54
0
21 Nov 2019
Self-supervised Moving Vehicle Tracking with Stereo Sound
Self-supervised Moving Vehicle Tracking with Stereo Sound
Chuang Gan
Hang Zhao
Peihao Chen
David D. Cox
Antonio Torralba
48
147
0
25 Oct 2019
KnowIT VQA: Answering Knowledge-Based Questions about Videos
KnowIT VQA: Answering Knowledge-Based Questions about Videos
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
130
80
0
23 Oct 2019
Clotho: An Audio Captioning Dataset
Clotho: An Audio Captioning Dataset
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
98
393
0
21 Oct 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLMVLM
355
942
0
24 Sep 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLMMLLMSSL
169
1,666
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLMMLLM
250
2,488
0
20 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
153
1,963
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
240
3,695
0
06 Aug 2019
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via
  Question Answering
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Zhou Yu
D. Xu
Jun-chen Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao
112
474
0
06 Jun 2019
WoodScape: A multi-task, multi-camera fisheye dataset for autonomous
  driving
WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving
S. Yogamani
Ciarán Hughes
Jonathan Horgan
Ganesh Sistu
P. Varley
...
Sumanth Chennupati
Sanjaya Nayak
Saquib Mansoor
Xavier Perroton
P. Pérez
HAI
64
266
0
04 May 2019
The Sound of Motions
The Sound of Motions
Hang Zhao
Chuang Gan
Wei-Chiu Ma
Antonio Torralba
83
254
0
11 Apr 2019
Embodied Question Answering in Photorealistic Environments with Point
  Cloud Perception
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception
Erik Wijmans
Samyak Datta
Oleksandr Maksymets
Abhishek Das
Georgia Gkioxari
Stefan Lee
Irfan Essa
Devi Parikh
Dhruv Batra
3DPCLM&Ro
75
169
0
06 Apr 2019
Habitat: A Platform for Embodied AI Research
Habitat: A Platform for Embodied AI Research
Manolis Savva
Abhishek Kadian
Oleksandr Maksymets
Yili Zhao
Erik Wijmans
...
Jia-Wei Liu
V. Koltun
Jitendra Malik
Devi Parikh
Dhruv Batra
LM&Ro
120
1,420
0
02 Apr 2019
nuScenes: A multimodal dataset for autonomous driving
nuScenes: A multimodal dataset for autonomous driving
Holger Caesar
Varun Bankiti
Alex H. Lang
Sourabh Vora
Venice Erin Liong
Qiang Xu
Anush Krishnan
Yuxin Pan
G. Baldan
Oscar Beijbom
3DPC
301
5,779
0
26 Mar 2019
Audio-Visual Scene-Aware Dialog
Audio-Visual Scene-Aware Dialog
Huda AlAmri
Vincent Cartillier
Abhishek Das
Jue Wang
A. Cherian
...
Tim K. Marks
Chiori Hori
Peter Anderson
Stefan Lee
Devi Parikh
VGen
54
194
0
25 Jan 2019
2.5D Visual Sound
2.5D Visual Sound
Ruohan Gao
Kristen Grauman
VGen
111
131
0
11 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
95,175
0
11 Oct 2018
Self-Supervised Generation of Spatial Audio for 360 Video
Self-Supervised Generation of Spatial Audio for 360 Video
Pedro Morgado
Nuno Vasconcelos
Timothy R. Langlois
Oliver Wang
MDE
62
174
0
07 Sep 2018
TVQA: Localized, Compositional Video Question Answering
TVQA: Localized, Compositional Video Question Answering
Muhammad Abdul Wahab
Licheng Yu
Mounir Nasr Allah
Tamara L. Berg
97
642
0
05 Sep 2018
ODSQA: Open-domain Spoken Question Answering Dataset
ODSQA: Open-domain Spoken Question Answering Dataset
Chia-Hsuan Lee
Shang-Ming Wang
Huan-Cheng Chang
Hung-yi Lee
RALM
59
54
0
07 Aug 2018
Learning Conditioned Graph Structures for Interpretable Visual Question
  Answering
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Will Norcliffe-Brown
Efstathios Vafeias
Sarah Parisot
GNN
66
237
0
19 Jun 2018
Cube Padding for Weakly-Supervised Saliency Prediction in 360°
  Videos
Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos
Hsien-Tzu Cheng
Chun-Hung Chao
Jin-Dong Dong
Hao Wen
Tyng-Luh Liu
Min Sun
65
193
0
04 Jun 2018
A Memory Network Approach for Story-based Temporal Summarization of
  360° Videos
A Memory Network Approach for Story-based Temporal Summarization of 360° Videos
Sangho Lee
Jinyoung Sung
Youngjae Yu
Gunhee Kim
93
68
0
08 May 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Andrew Owens
Alexei A. Efros
SSL
98
753
0
10 Apr 2018
The Sound of Pixels
The Sound of Pixels
Hang Zhao
Chuang Gan
Andrew Rouditchenko
Carl Vondrick
Josh H. McDermott
Antonio Torralba
VLM
102
536
0
09 Apr 2018
Learning to Navigate in Cities Without a Map
Learning to Navigate in Cities Without a Map
Piotr Wojciech Mirowski
Matthew Koichi Grimes
Mateusz Malinowski
Karl Moritz Hermann
Keith Anderson
Denis Teplyashin
Karen Simonyan
Koray Kavukcuoglu
Andrew Zisserman
R. Hadsell
SSLHAI
101
320
0
31 Mar 2018
A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360
  Video
A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video
Youngjae Yu
Sangho Lee
Joonil Na
Jaeyun Kang
Gunhee Kim
41
44
0
31 Jan 2018
Building Generalizable Agents with a Realistic and Rich 3D Environment
Building Generalizable Agents with a Realistic and Rich 3D Environment
Yi Wu
Yuxin Wu
Georgia Gkioxari
Yuandong Tian
3DV
135
339
0
07 Jan 2018
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjDVOS
113
530
0
18 Dec 2017
IQA: Visual Question Answering in Interactive Environments
IQA: Visual Question Answering in Interactive Environments
Daniel Gordon
Aniruddha Kembhavi
Mohammad Rastegari
Joseph Redmon
Dieter Fox
Ali Farhadi
LM&Ro
93
391
0
09 Dec 2017
Embodied Question Answering
Embodied Question Answering
Abhishek Das
Samyak Datta
Georgia Gkioxari
Stefan Lee
Devi Parikh
Dhruv Batra
LM&Ro
100
651
0
30 Nov 2017
Video Question Answering via Attribute-Augmented Attention Network
  Learning
Video Question Answering via Attribute-Augmented Attention Network Learning
Yunan Ye
Zhou Zhao
Yimeng Li
Long Chen
Jun Xiao
Yueting Zhuang
54
109
0
20 Jul 2017
DeepStory: Video Story QA by Deep Embedded Memory Networks
DeepStory: Video Story QA by Deep Embedded Memory Networks
Kyung-Min Kim
Min-Oh Heo
Seongho Choi
Byoung-Tak Zhang
77
175
0
04 Jul 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
730
132,363
0
12 Jun 2017
Deep 360 Pilot: Learning a Deep Agent for Piloting through 360°
  Sports Video
Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Video
Hou-Ning Hu
Yen-Chen Lin
Ming-Yuan Liu
Hsien-Tzu Cheng
Yung-Ju Chang
Min Sun
70
177
0
04 May 2017
12
Next