ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.15081
  4. Cited By
Word Discovery in Visually Grounded, Self-Supervised Speech Models
v1v2v3v4v5 (latest)

Word Discovery in Visually Grounded, Self-Supervised Speech Models

28 March 2022
Puyuan Peng
David Harwath
    SSL
ArXiv (abs)PDFHTMLGithub (26★)

Papers citing "Word Discovery in Visually Grounded, Self-Supervised Speech Models"

37 / 37 papers shown
Title
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma
Jing Ding
Xuejun Zhang
Dezhi Luo
Jiahe Ding
Sihan Xu
Yuchen Huang
Run Peng
Joyce Chai
216
0
0
22 Apr 2025
Towards Unsupervised Speech Recognition Without Pronunciation Models
Towards Unsupervised Speech Recognition Without Pronunciation Models
Junrui Ni
Liming Wang
Yang Zhang
Kaizhi Qian
Heting Gao
Mark Hasegawa-Johnson
Chang D. Yoo
SSLOffRL
146
0
0
10 Jan 2025
Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming
Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming
Simon Malan
Benjamin van Niekerk
Herman Kamper
105
0
0
22 Sep 2024
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT
Cheol Jun Cho
Abdelrahman Mohamed
Shang-Wen Li
Alan W. Black
Gopala K. Anumanchipalli
103
9
0
16 Oct 2023
Learning English with Peppa Pig
Learning English with Peppa Pig
Mitja Nikolaus
Afra Alishahi
Grzegorz Chrupała
60
14
0
25 Feb 2022
Word Segmentation on Discovered Phone Units with Dynamic Programming and
  Self-Supervised Scoring
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring
Herman Kamper
91
26
0
24 Feb 2022
Self-Supervised Representation Learning for Speech Using Visual
  Grounding and Masked Language Modeling
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Puyuan Peng
David Harwath
SSL
87
26
0
07 Feb 2022
Can phones, syllables, and words emerge as side-products of
  cross-situational audiovisual learning? -- A computational investigation
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
Khazar Khorrami
Okko Räsänen
77
20
0
29 Sep 2021
Fast-Slow Transformer for Visually Grounding Speech
Fast-Slow Transformer for Visually Grounding Speech
Puyuan Peng
David Harwath
139
30
0
16 Sep 2021
Layer-wise Analysis of a Self-supervised Speech Representation Model
Layer-wise Analysis of a Self-supervised Speech Representation Model
Ankita Pasad
Ju-Chieh Chou
Karen Livescu
SSL
88
308
0
10 Jul 2021
Attention-Based Keyword Localisation in Speech using Visual Grounding
Attention-Based Keyword Localisation in Speech using Visual Grounding
Kayode Olaleye
Herman Kamper
61
13
0
16 Jun 2021
HuBERT: Self-Supervised Speech Representation Learning by Masked
  Prediction of Hidden Units
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Wei-Ning Hsu
Benjamin Bolte
Yao-Hung Hubert Tsai
Kushal Lakhotia
Ruslan Salakhutdinov
Abdel-rahman Mohamed
SSL
188
3,004
0
14 Jun 2021
Cross-Modal Discrete Representation Learning
Cross-Modal Discrete Representation Learning
Alexander H. Liu
SouYoung Jin
Cheng-I Jeff Lai
Andrew Rouditchenko
A. Oliva
James R. Glass
SSL
73
41
0
10 Jun 2021
Segmental Contrastive Predictive Coding for Unsupervised Word
  Segmentation
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
Saurabhchand Bhati
Jesús Villalba
Piotr Żelasko
Laureano Moro-Velazquez
Najim Dehak
SSL
78
37
0
03 Jun 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
743
6,139
0
29 Apr 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu
David Harwath
Christopher Song
James R. Glass
CLIP
85
67
0
31 Dec 2020
Towards unsupervised phone and word segmentation using self-supervised
  vector-quantized neural networks
Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
Herman Kamper
Benjamin van Niekerk
SSLMQ
86
36
0
14 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
684
41,563
0
22 Oct 2020
Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic
  Adaptive Metrics
Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics
Okko Räsänen
María Andrea Cruz Blandón
73
25
0
03 Aug 2020
Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
Saurabhchand Bhati
Jesús Villalba
Piotr Żelasko
Najim Dehak
SSL
79
16
0
26 Jul 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
  Representations
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
SSL
303
5,853
0
20 Jun 2020
Learning to Recognise Words using Visually Grounded Speech
Learning to Recognise Words using Visually Grounded Speech
Sebastiaan Scholten
Danny Merkx
O. Scharenborg
57
13
0
31 May 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded
  Speech
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath
Wei-Ning Hsu
James R. Glass
86
85
0
21 Nov 2019
Large-scale representation learning from visually grounded untranscribed
  speech
Large-scale representation learning from visually grounded untranscribed speech
Gabriel Ilharco
Yuan Zhang
Jason Baldridge
SSL
79
61
0
19 Sep 2019
Word Recognition, Competition, and Activation in a Model of Visually
  Grounded Speech
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
William N. Havard
Jean-Pierre Chevrot
Laurent Besacier
60
21
0
18 Sep 2019
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu
David Harwath
James R. Glass
SSL
59
32
0
09 Jul 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
95,324
0
11 Oct 2018
Representation Learning with Contrastive Predictive Coding
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRLSSL
356
10,369
0
10 Jul 2018
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory
  Input
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath
Adrià Recasens
Dídac Surís
Galen Chuang
Antonio Torralba
James R. Glass
104
201
0
04 Apr 2018
The Zero Resource Speech Challenge 2017
The Zero Resource Speech Challenge 2017
Maarten Versteegh
Xuan-Nga Cao
Roland Thiollière
Thomas Schatz
Mathieu Bernard
A. Jansen
Xavier Anguera Miró
Emmanuel Dupoux
81
204
0
12 Dec 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
819
132,725
0
12 Jun 2017
An embedded segmental K-means model for unsupervised segmentation and
  clustering of speech
An embedded segmental K-means model for unsupervised segmentation and clustering of speech
Herman Kamper
Karen Livescu
Sharon Goldwater
60
98
0
23 Mar 2017
Cognitive Science in the era of Artificial Intelligence: A roadmap for
  reverse-engineering the infant language-learner
Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner
Emmanuel Dupoux
77
158
0
29 Jul 2016
A segmental framework for fully-unsupervised large-vocabulary speech
  recognition
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Herman Kamper
A. Jansen
Sharon Goldwater
84
104
0
22 Jun 2016
Deep Multimodal Semantic Embeddings for Speech and Images
Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath
James R. Glass
73
157
0
11 Nov 2015
Deep Visual-Semantic Alignments for Generating Image Descriptions
Deep Visual-Semantic Alignments for Generating Image Descriptions
A. Karpathy
Li Fei-Fei
154
5,595
0
07 Dec 2014
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
442
43,875
0
01 May 2014
1