YFACC: A Yorùbá speech-image dataset for cross-lingual keyword
localisation through visual grounding

v1v2 (latest)

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

10 October 2022

ArXiv (abs)PDF HTML

Papers citing "YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding"

13 / 13 papers shown

Title
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus Josh Meyer David Ifeoluwa Adelani Edresson Casanova A. Oktem Daniel Whitenack Julian Weber ... Victor Akinode Bernard Opoku S. Olanrewaju Jesujoba Oluwadara Alabi Shamsuddeen Hassan Muhammad 36 23 0 07 Jul 2022
Building African Voices Perez Ogayo Graham Neubig A. Black 120 15 0 01 Jul 2022
Keyword localisation in untranscribed speech using visually grounded speech models Kayode Olaleye Dan Oneaţă Herman Kamper 51 7 0 02 Feb 2022
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech David Harwath Wei-Ning Hsu James R. Glass 75 84 0 21 Nov 2019
On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval Ankita Pasad Bowen Shi Herman Kamper Karen Livescu 36 12 0 24 Apr 2019
End-to-End Automatic Speech Translation of Audiobooks Alexandre Berard Laurent Besacier A. Kocabiyikoglu Olivier Pietquin 112 193 0 12 Feb 2018
Semantic speech retrieval with a visually grounded model of untranscribed speech Herman Kamper Gregory Shakhnarovich Karen Livescu 67 53 0 05 Oct 2017
Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss J. Chorowski Navdeep Jaitly Yonghui Wu Zhiwen Chen 79 344 0 24 Mar 2017
Visually grounded learning of keyword prediction from untranscribed speech Herman Kamper Shane Settle Gregory Shakhnarovich Karen Livescu 114 63 0 23 Mar 2017
Representations of language in a model of visually grounded speech signal Grzegorz Chrupała Lieke Gelderloos Afra Alishahi 75 131 0 07 Feb 2017
Learning Word-Like Units from Joint Audio-Visual Analysis David Harwath James R. Glass 68 106 0 25 Jan 2017
Deep Multimodal Semantic Embeddings for Speech and Images David Harwath James R. Glass 62 157 0 11 Nov 2015
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Bryan A. Plummer Liwei Wang Christopher M. Cervantes Juan C. Caicedo Julia Hockenmaier Svetlana Lazebnik 202 2,071 0 19 May 2015