Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.05710
Cited By
Look Before you Speak: Visually Contextualized Utterances
10 December 2020
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Look Before you Speak: Visually Contextualized Utterances"
50 / 66 papers shown
Title
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
72
0
0
06 Jun 2024
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
78
248
0
06 Oct 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
531
608
0
21 Jul 2020
Temporal Aggregate Representations for Long-Range Video Understanding
Fadime Sener
Dipika Singhania
Angela Yao
AI4TS
43
7
0
01 Jun 2020
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Max Bain
Arsha Nagrani
A. Brown
Andrew Zisserman
93
101
0
08 May 2020
Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video
Antonino Furnari
G. Farinella
EgoV
45
141
0
04 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
106
503
0
01 May 2020
Local-Global Video-Text Interactions for Temporal Grounding
Jonghwan Mun
Minsu Cho
Bohyung Han
66
269
0
16 Apr 2020
Deep Multimodal Feature Encoding for Video Ordering
Vivek Sharma
Makarand Tapaswi
Rainer Stiefelhagen
58
10
0
05 Apr 2020
DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator
Hwanhee Lee
Seunghyun Yoon
Franck Dernoncourt
Doo Soon Kim
Trung Bui
Kyomin Jung
53
15
0
01 Apr 2020
Speech2Action: Cross-modal Supervision for Action Recognition
Arsha Nagrani
Chen Sun
David A. Ross
Rahul Sukthankar
Cordelia Schmid
Andrew Zisserman
59
54
0
30 Mar 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
Elad Amrani
Rami Ben-Ari
Daniel Rotman
A. Bronstein
72
123
0
06 Mar 2020
Hierarchical Conditional Relation Networks for Video Question Answering
T. Le
Vuong Le
Svetha Venkatesh
T. Tran
77
259
0
25 Feb 2020
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge
Hung Le
Nancy F. Chen
40
9
0
25 Feb 2020
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
Zekang Li
Zongjia Li
Jinchao Zhang
Yang Feng
Cheng Niu
Jie Zhou
119
37
0
01 Feb 2020
Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System
Yun-Wei Chu
Kuan-Yen Lin
Chao-Chun Hsu
Lun-Wei Ku
100
22
0
17 Jan 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
116
711
0
13 Dec 2019
Reinforcing an Image Caption Generator Using Off-Line Human Feedback
Paul Hongsuck Seo
Piyush Sharma
Tomer Levinboim
Bohyung Han
Radu Soricut
OffRL
52
22
0
21 Nov 2019
The Eighth Dialog System Technology Challenge
Seokhwan Kim
Michel Galley
Chulaka Gunasekara
Sungjin Lee
Adam Atkinson
...
Tim K. Marks
Abhinav Rastogi
Xiaoxue Zang
Srinivas Sunkara
Raghav Gupta
VLM
60
65
0
14 Nov 2019
Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis
Zhongkai Sun
P. Sarma
W. Sethares
Yingyu Liang
52
325
0
13 Nov 2019
UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
VLM
OT
107
447
0
25 Sep 2019
Re-ID Driven Localization Refinement for Person Search
Chuchu Han
Jiacheng Ye
Mingliang Xu
Xin Tan
Chi Zhang
Changxin Gao
Nong Sang
43
121
0
18 Sep 2019
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering
Soravit Changpinyo
Bo Pang
Piyush Sharma
Radu Soricut
ObjD
51
20
0
04 Sep 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
226
3,678
0
06 Aug 2019
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Yang Liu
Samuel Albanie
Arsha Nagrani
Andrew Zisserman
76
389
0
31 Jul 2019
Identifying Visible Actions in Lifestyle Vlogs
Oana Ignat
Laura Burdick
Jia Deng
Rada Mihalcea
32
14
0
10 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
110
1,200
0
07 Jun 2019
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Zhou Yu
D. Xu
Jun-chen Yu
Ting Yu
Zhou Zhao
Yueting Zhuang
Dacheng Tao
107
463
0
06 Jun 2019
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Chenyou Fan
Xiaofan Zhang
Shu Zhang
Wensheng Wang
Chi Zhang
Heng-Chiao Huang
49
278
0
08 Apr 2019
Streamlined Dense Video Captioning
Jonghwan Mun
L. Yang
Zhou Ren
N. Xu
Bohyung Han
51
140
0
08 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
77
1,246
0
03 Apr 2019
Cross-task weakly supervised learning from instructional videos
Dimitri Zhukov
Jean-Baptiste Alayrac
R. G. Cinbis
David Fouhey
Ivan Laptev
Josef Sivic
SSL
115
249
0
19 Mar 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang
Dajun Ding
Yongming Rao
Yu Zheng
Danyang Zhang
Lili Zhao
Jiwen Lu
Jie Zhou
119
315
0
07 Mar 2019
Graph-RISE: Graph-Regularized Image Semantic Embedding
Da-Cheng Juan
Chun-Ta Lu
Zerui Li
Futang Peng
Aleksei Timofeev
Yi-Ting Chen
Yaxi Gao
Tom Duerig
Andrew Tomkins
Sujith Ravi
71
40
0
14 Feb 2019
Audio-Visual Scene-Aware Dialog
Huda AlAmri
Vincent Cartillier
Abhishek Das
Jue Wang
A. Cherian
...
Tim K. Marks
Chiori Hori
Peter Anderson
Stefan Lee
Devi Parikh
VGen
52
192
0
25 Jan 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,770
0
11 Oct 2018
MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling
Paweł Budzianowski
Tsung-Hsien Wen
Bo-Hsiang Tseng
I. Casanueva
Stefan Ultes
Osman Ramadan
Milica Gasic
160
1,315
0
29 Sep 2018
Neural Approaches to Conversational AI
Jianfeng Gao
Michel Galley
Lihong Li
80
673
0
21 Sep 2018
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Satwik Kottur
José M. F. Moura
Devi Parikh
Dhruv Batra
Marcus Rohrbach
54
165
0
06 Sep 2018
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
Chiori Hori
Huda AlAmri
Jue Wang
Gordon Wichern
Takaaki Hori
...
Raphael Gontijo-Lopes
Abhishek Das
Irfan Essa
Dhruv Batra
Devi Parikh
VGen
64
125
0
21 Jun 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
65
234
0
07 Apr 2018
Motion-Appearance Co-Memory Networks for Video Question Answering
J. Gao
Runzhou Ge
Kan Chen
Ram Nevatia
113
241
0
29 Mar 2018
Neural Baby Talk
Jiasen Lu
Jianwei Yang
Dhruv Batra
Devi Parikh
VLM
230
435
0
27 Mar 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
142
1,329
0
13 Dec 2017
From Lifestyle Vlogs to Everyday Interactions
David Fouhey
Weicheng Kuo
Alexei A. Efros
Jitendra Malik
60
125
0
06 Dec 2017
Visual Reference Resolution using Attention Memory for Visual Dialog
Paul Hongsuck Seo
Andreas M. Lehrmann
Bohyung Han
Leonid Sigal
59
123
0
23 Sep 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
121
4,215
0
25 Jul 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
692
131,526
0
12 Jun 2017
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
Fanyi Xiao
Leonid Sigal
Yong Jae Lee
63
139
0
03 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
75
555
0
14 Apr 2017
1
2
Next