Title
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop O. Scharenborg Laurent Besacier A. Black M. Hasegawa-Johnson Florian Metze ... Elin Larsen Danny Merkx Rachid Riad Liming Wang Emmanuel Dupoux 53 33 0 14 Feb 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 70 529 0 18 Dec 2017
Learning Modality-Invariant Representations for Speech and Images K. Leidal David Harwath James R. Glass SSL 41 29 0 11 Dec 2017
Visual to Sound: Generating Natural Sound for Videos in the Wild Yipin Zhou Zhaowen Wang Chen Fang Trung Bui Tamara L. Berg VGen 49 206 0 04 Dec 2017
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara Hirokatsu Kataoka Y. Satoh 3DPC 109 1,926 0 27 Nov 2017
Semantic speech retrieval with a visually grounded model of untranscribed speech Herman Kamper Gregory Shakhnarovich Karen Livescu 55 53 0 05 Oct 2017
Unsupervised Representation Learning by Sorting Sequences Hsin-Ying Lee Jia-Bin Huang Maneesh Kumar Singh Ming-Hsuan Yang SSL DRL 52 534 0 03 Aug 2017
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set William N. Havard Laurent Besacier O. Rosec 32 28 0 26 Jul 2017
Learnable pooling with Context Gating for video classification Antoine Miech Ivan Laptev Josef Sivic 47 327 0 21 Jun 2017
Look, Listen and Learn Relja Arandjelović Andrew Zisserman SSL 80 900 0 23 May 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset João Carreira Andrew Zisserman 199 7,961 0 22 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos Luowei Zhou Chenliang Xu Jason J. Corso EgoV 59 819 0 28 Mar 2017
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos De-An Huang Joseph J. Lim Li Fei-Fei Juan Carlos Niebles 42 56 0 07 Mar 2017
Representations of language in a model of visually grounded speech signal Grzegorz Chrupała Lieke Gelderloos Afra Alishahi 68 131 0 07 Feb 2017
Learning Word-Like Units from Joint Audio-Visual Analysis David Harwath James R. Glass 46 106 0 25 Jan 2017
Self-Supervised Video Representation Learning With Odd-One-Out Networks Basura Fernando Hakan Bilen E. Gavves Stephen Gould SSL 37 450 0 21 Nov 2016
SoundNet: Learning Sound Representations from Unlabeled Video Y. Aytar Carl Vondrick Antonio Torralba SSL 85 1,040 0 27 Oct 2016
Movie Description Anna Rohrbach Atousa Torabi Marcus Rohrbach Niket Tandon C. Pal Hugo Larochelle Aaron Courville Bernt Schiele 3DV VGen 58 355 0 12 May 2016
Exploring the Limits of Language Modeling Rafal Jozefowicz Oriol Vinyals M. Schuster Noam M. Shazeer Yonghui Wu 120 1,143 0 07 Feb 2016
Visually Indicated Sounds Andrew Owens Phillip Isola Josh H. McDermott Antonio Torralba Edward H. Adelson William T. Freeman 74 382 0 28 Dec 2015
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 1.4K 192,638 0 10 Dec 2015
Deep Multimodal Semantic Embeddings for Speech and Images David Harwath James R. Glass 32 156 0 11 Nov 2015
Unsupervised Learning from Narrated Instruction Videos Jean-Baptiste Alayrac Piotr Bojanowski Nishant Agrawal Josef Sivic Ivan Laptev Simon Lacoste-Julien SSL 63 289 0 30 Jun 2015
Adam: A Method for Stochastic Optimization Diederik P. Kingma Jimmy Ba ODL 813 149,474 0 22 Dec 2014
Object Detectors Emerge in Deep Scene CNNs Bolei Zhou A. Khosla Àgata Lapedriza A. Oliva Antonio Torralba ObjD 120 1,279 0 22 Dec 2014
Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Kai Chen G. Corrado J. Dean 3DV 550 31,406 0 16 Jan 2013