Look Before you Speak: Visually Contextualized Utterances

10 December 2020

Papers citing "Look Before you Speak: Visually Contextualized Utterances"

50 / 66 papers shown

Title
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition Stefan Gerd Fritsch Cennet Oğuz Vitor Fortes Rey L. Ray Maximilian Kiefer-Emmanouilidis Paul Lukowicz HAI 72 0 0 06 Jun 2024
Support-set bottlenecks for video-text representation learning Mandela Patrick Po-Yao (Bernie) Huang Yuki M. Asano Florian Metze Alexander G. Hauptmann João Henriques Andrea Vedaldi 78 248 0 06 Oct 2020
Multi-modal Transformer for Video Retrieval Valentin Gabeur Chen Sun Alahari Karteek Cordelia Schmid ViT 531 608 0 21 Jul 2020
Temporal Aggregate Representations for Long-Range Video Understanding Fadime Sener Dipika Singhania Angela Yao AI4TS 43 7 0 01 Jun 2020
Condensed Movies: Story Based Retrieval with Contextual Embeddings Max Bain Arsha Nagrani A. Brown Andrew Zisserman 93 101 0 08 May 2020
Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video Antonino Furnari G. Farinella EgoV 45 141 0 04 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training Linjie Li Yen-Chun Chen Yu Cheng Zhe Gan Licheng Yu Jingjing Liu MLLM VLM OffRL AI4TS 106 503 0 01 May 2020
Local-Global Video-Text Interactions for Temporal Grounding Jonghwan Mun Minsu Cho Bohyung Han 66 269 0 16 Apr 2020
Deep Multimodal Feature Encoding for Video Ordering Vivek Sharma Makarand Tapaswi Rainer Stiefelhagen 58 10 0 05 Apr 2020
DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator Hwanhee Lee Seunghyun Yoon Franck Dernoncourt Doo Soon Kim Trung Bui Kyomin Jung 53 15 0 01 Apr 2020
Speech2Action: Cross-modal Supervision for Action Recognition Arsha Nagrani Chen Sun David A. Ross Rahul Sukthankar Cordelia Schmid Andrew Zisserman 59 54 0 30 Mar 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning Elad Amrani Rami Ben-Ari Daniel Rotman A. Bronstein 72 123 0 06 Mar 2020
Hierarchical Conditional Relation Networks for Video Question Answering T. Le Vuong Le Svetha Venkatesh T. Tran 77 259 0 25 Feb 2020
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge Hung Le Nancy F. Chen 40 9 0 25 Feb 2020
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog Zekang Li Zongjia Li Jinchao Zhang Yang Feng Cheng Niu Jie Zhou 119 37 0 01 Feb 2020
Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System Yun-Wei Chu Kuan-Yen Lin Chao-Chun Hsu Lun-Wei Ku 100 22 0 17 Jan 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech Jean-Baptiste Alayrac Lucas Smaira Ivan Laptev Josef Sivic Andrew Zisserman VGen SSL 116 711 0 13 Dec 2019
Reinforcing an Image Caption Generator Using Off-Line Human Feedback Paul Hongsuck Seo Piyush Sharma Tomer Levinboim Bohyung Han Radu Soricut OffRL 52 22 0 21 Nov 2019
The Eighth Dialog System Technology Challenge Seokhwan Kim Michel Galley Chulaka Gunasekara Sungjin Lee Adam Atkinson ... Tim K. Marks Abhinav Rastogi Xiaoxue Zang Srinivas Sunkara Raghav Gupta VLM 60 65 0 14 Nov 2019
Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis Zhongkai Sun P. Sarma W. Sethares Yingyu Liang 52 325 0 13 Nov 2019
UNITER: UNiversal Image-TExt Representation Learning Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng Jingjing Liu VLM OT 107 447 0 25 Sep 2019
Re-ID Driven Localization Refinement for Person Search Chuchu Han Jiacheng Ye Mingliang Xu Xin Tan Chi Zhang Changxin Gao Nong Sang 43 121 0 18 Sep 2019
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering Soravit Changpinyo Bo Pang Piyush Sharma Radu Soricut ObjD 51 20 0 04 Sep 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu Dhruv Batra Devi Parikh Stefan Lee SSL VLM 226 3,678 0 06 Aug 2019
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Yang Liu Samuel Albanie Arsha Nagrani Andrew Zisserman 76 389 0 31 Jul 2019
Identifying Visible Actions in Lifestyle Vlogs Oana Ignat Laura Burdick Jia Deng Rada Mihalcea 32 14 0 10 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic VGen 110 1,200 0 07 Jun 2019
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering Zhou Yu D. Xu Jun-chen Yu Ting Yu Zhou Zhao Yueting Zhuang Dacheng Tao 107 463 0 06 Jun 2019
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering Chenyou Fan Xiaofan Zhang Shu Zhang Wensheng Wang Chi Zhang Heng-Chiao Huang 49 278 0 08 Apr 2019
Streamlined Dense Video Captioning Jonghwan Mun L. Yang Zhou Ren N. Xu Bohyung Han 51 140 0 08 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning Chen Sun Austin Myers Carl Vondrick Kevin Patrick Murphy Cordelia Schmid VLM SSL 77 1,246 0 03 Apr 2019
Cross-task weakly supervised learning from instructional videos Dimitri Zhukov Jean-Baptiste Alayrac R. G. Cinbis David Fouhey Ivan Laptev Josef Sivic SSL 115 249 0 19 Mar 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis Yansong Tang Dajun Ding Yongming Rao Yu Zheng Danyang Zhang Lili Zhao Jiwen Lu Jie Zhou 119 315 0 07 Mar 2019
Graph-RISE: Graph-Regularized Image Semantic Embedding Da-Cheng Juan Chun-Ta Lu Zerui Li Futang Peng Aleksei Timofeev Yi-Ting Chen Yaxi Gao Tom Duerig Andrew Tomkins Sujith Ravi 71 40 0 14 Feb 2019
Audio-Visual Scene-Aware Dialog Huda AlAmri Vincent Cartillier Abhishek Das Jue Wang A. Cherian ... Tim K. Marks Chiori Hori Peter Anderson Stefan Lee Devi Parikh VGen 52 192 0 25 Jan 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 1.7K 94,770 0 11 Oct 2018
MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling Paweł Budzianowski Tsung-Hsien Wen Bo-Hsiang Tseng I. Casanueva Stefan Ultes Osman Ramadan Milica Gasic 160 1,315 0 29 Sep 2018
Neural Approaches to Conversational AI Jianfeng Gao Michel Galley Lihong Li 80 673 0 21 Sep 2018
Visual Coreference Resolution in Visual Dialog using Neural Module Networks Satwik Kottur José M. F. Moura Devi Parikh Dhruv Batra Marcus Rohrbach 54 165 0 06 Sep 2018
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features Chiori Hori Huda AlAmri Jue Wang Gordon Wichern Takaaki Hori ... Raphael Gontijo-Lopes Abhishek Das Irfan Essa Dhruv Batra Devi Parikh VGen 64 125 0 21 Jun 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data Antoine Miech Ivan Laptev Josef Sivic 65 234 0 07 Apr 2018
Motion-Appearance Co-Memory Networks for Video Question Answering J. Gao Runzhou Ge Kan Chen Ram Nevatia 113 241 0 29 Mar 2018
Neural Baby Talk Jiasen Lu Jianwei Yang Dhruv Batra Devi Parikh VLM 230 435 0 27 Mar 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Saining Xie Chen Sun Jonathan Huang Zhuowen Tu Kevin Patrick Murphy 3DH 142 1,329 0 13 Dec 2017
From Lifestyle Vlogs to Everyday Interactions David Fouhey Weicheng Kuo Alexei A. Efros Jitendra Malik 60 125 0 06 Dec 2017
Visual Reference Resolution using Attention Memory for Visual Dialog Paul Hongsuck Seo Andreas M. Lehrmann Bohyung Han Leonid Sigal 59 123 0 23 Sep 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang AIMat 121 4,215 0 25 Jul 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 692 131,526 0 12 Jun 2017
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures Fanyi Xiao Leonid Sigal Yong Jae Lee 63 139 0 03 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering Y. Jang Yale Song Youngjae Yu Youngjin Kim Gunhee Kim 75 555 0 14 Apr 2017