v1v2v3v4v5 (latest)

Word Discovery in Visually Grounded, Self-Supervised Speech Models

28 March 2022

Puyuan Peng

David Harwath

SSL

ArXiv (abs)PDF HTML Github (26★)

Papers citing "Word Discovery in Visually Grounded, Self-Supervised Speech Models"

37 / 37 papers shown

Title
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Ziqiao Ma Jing Ding Xuejun Zhang Dezhi Luo Jiahe Ding Sihan Xu Yuchen Huang Run Peng Joyce Chai 216 0 0 22 Apr 2025
Towards Unsupervised Speech Recognition Without Pronunciation Models Junrui Ni Liming Wang Yang Zhang Kaizhi Qian Heting Gao Mark Hasegawa-Johnson Chang D. Yoo SSL OffRL 146 0 0 10 Jan 2025
Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming Simon Malan Benjamin van Niekerk Herman Kamper 105 0 0 22 Sep 2024
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT Cheol Jun Cho Abdelrahman Mohamed Shang-Wen Li Alan W. Black Gopala K. Anumanchipalli 103 9 0 16 Oct 2023
Learning English with Peppa Pig Mitja Nikolaus Afra Alishahi Grzegorz Chrupała 60 14 0 25 Feb 2022
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring Herman Kamper 91 26 0 24 Feb 2022
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling Puyuan Peng David Harwath SSL 87 26 0 07 Feb 2022
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation Khazar Khorrami Okko Räsänen 77 20 0 29 Sep 2021
Fast-Slow Transformer for Visually Grounding Speech Puyuan Peng David Harwath 139 30 0 16 Sep 2021
Layer-wise Analysis of a Self-supervised Speech Representation Model Ankita Pasad Ju-Chieh Chou Karen Livescu SSL 88 308 0 10 Jul 2021
Attention-Based Keyword Localisation in Speech using Visual Grounding Kayode Olaleye Herman Kamper 61 13 0 16 Jun 2021
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units Wei-Ning Hsu Benjamin Bolte Yao-Hung Hubert Tsai Kushal Lakhotia Ruslan Salakhutdinov Abdel-rahman Mohamed SSL 188 3,004 0 14 Jun 2021
Cross-Modal Discrete Representation Learning Alexander H. Liu SouYoung Jin Cheng-I Jeff Lai Andrew Rouditchenko A. Oliva James R. Glass SSL 73 41 0 10 Jun 2021
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation Saurabhchand Bhati Jesús Villalba Piotr Żelasko Laureano Moro-Velazquez Najim Dehak SSL 78 37 0 03 Jun 2021
Emerging Properties in Self-Supervised Vision Transformers Mathilde Caron Hugo Touvron Ishan Misra Hervé Jégou Julien Mairal Piotr Bojanowski Armand Joulin 743 6,139 0 29 Apr 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units Wei-Ning Hsu David Harwath Christopher Song James R. Glass CLIP 85 67 0 31 Dec 2020
Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks Herman Kamper Benjamin van Niekerk SSL MQ 86 36 0 14 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 684 41,563 0 22 Oct 2020
Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics Okko Räsänen María Andrea Cruz Blandón 73 25 0 03 Aug 2020
Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery Saurabhchand Bhati Jesús Villalba Piotr Żelasko Najim Dehak SSL 79 16 0 26 Jul 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski Henry Zhou Abdel-rahman Mohamed Michael Auli SSL 303 5,853 0 20 Jun 2020
Learning to Recognise Words using Visually Grounded Speech Sebastiaan Scholten Danny Merkx O. Scharenborg 57 13 0 31 May 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech David Harwath Wei-Ning Hsu James R. Glass 86 85 0 21 Nov 2019
Large-scale representation learning from visually grounded untranscribed speech Gabriel Ilharco Yuan Zhang Jason Baldridge SSL 79 61 0 19 Sep 2019
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech William N. Havard Jean-Pierre Chevrot Laurent Besacier 60 21 0 18 Sep 2019
Transfer Learning from Audio-Visual Grounding to Speech Recognition Wei-Ning Hsu David Harwath James R. Glass SSL 59 32 0 09 Jul 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 1.8K 95,324 0 11 Oct 2018
Representation Learning with Contrastive Predictive Coding Aaron van den Oord Yazhe Li Oriol Vinyals DRL SSL 356 10,369 0 10 Jul 2018
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input David Harwath Adrià Recasens Dídac Surís Galen Chuang Antonio Torralba James R. Glass 104 201 0 04 Apr 2018
The Zero Resource Speech Challenge 2017 Maarten Versteegh Xuan-Nga Cao Roland Thiollière Thomas Schatz Mathieu Bernard A. Jansen Xavier Anguera Miró Emmanuel Dupoux 81 204 0 12 Dec 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 819 132,725 0 12 Jun 2017
An embedded segmental K-means model for unsupervised segmentation and clustering of speech Herman Kamper Karen Livescu Sharon Goldwater 60 98 0 23 Mar 2017
Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner Emmanuel Dupoux 77 158 0 29 Jul 2016
A segmental framework for fully-unsupervised large-vocabulary speech recognition Herman Kamper A. Jansen Sharon Goldwater 84 104 0 22 Jun 2016
Deep Multimodal Semantic Embeddings for Speech and Images David Harwath James R. Glass 73 157 0 11 Nov 2015
Deep Visual-Semantic Alignments for Generating Image Descriptions A. Karpathy Li Fei-Fei 154 5,595 0 07 Dec 2014
Microsoft COCO: Common Objects in Context Nayeon Lee Michael Maire Serge J. Belongie Lubomir Bourdev Ross B. Girshick James Hays Pietro Perona Deva Ramanan C. L. Zitnick Piotr Dollár ObjD 442 43,875 0 01 May 2014