ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.04823
  4. Cited By
Cascaded Multilingual Audio-Visual Learning from Videos

Cascaded Multilingual Audio-Visual Learning from Videos

8 November 2021
Andrew Rouditchenko
Angie Boggust
David Harwath
Samuel Thomas
Hilde Kuehne
Brian Chen
Yikang Shen
Rogerio Feris
Brian Kingsbury
M. Picheny
James R. Glass
ArXivPDFHTML

Papers citing "Cascaded Multilingual Audio-Visual Learning from Videos"

27 / 27 papers shown
Title
Spoken Moments: Learning Joint Audio-Visual Representations from Video
  Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogerio Feris
James Glass
Aude Oliva
38
59
0
10 May 2021
Multimodal Clustering Networks for Self-supervised Learning from
  Unlabeled Videos
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Hilde Kuehne
Samuel Thomas
...
Rogerio Feris
David Harwath
James R. Glass
M. Picheny
Shih-Fu Chang
SSL
39
91
0
26 Apr 2021
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Ramon Sanabria
Austin Waters
Jason Baldridge
3DV
39
25
0
05 Apr 2021
QuerYD: A video dataset with high-quality text and audio narrations
QuerYD: A video dataset with high-quality text and audio narrations
Andreea-Maria Oncescu
João F. Henriques
Yang Liu
Andrew Zisserman
Samuel Albanie
VGen
43
11
0
22 Nov 2020
Self-Supervised MultiModal Versatile Networks
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
117
374
0
29 Jun 2020
Unsupervised Cross-lingual Representation Learning for Speech
  Recognition
Unsupervised Cross-lingual Representation Learning for Speech Recognition
Alexis Conneau
Alexei Baevski
R. Collobert
Abdel-rahman Mohamed
Michael Auli
SSL
135
778
0
24 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from
  Instructional Videos
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
...
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
57
142
0
16 Jun 2020
Visual Grounding in Video for Unsupervised Word Translation
Visual Grounding in Video for Unsupervised Word Translation
Gunnar Sigurdsson
Jean-Baptiste Alayrac
Aida Nematzadeh
Lucas Smaira
Mateusz Malinowski
João Carreira
Phil Blunsom
Andrew Zisserman
VGen
64
50
0
11 Mar 2020
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded
  Speech
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath
Wei-Ning Hsu
James R. Glass
69
84
0
21 Nov 2019
Large-scale representation learning from visually grounded untranscribed
  speech
Large-scale representation learning from visually grounded untranscribed speech
Gabriel Ilharco
Yuan Zhang
Jason Baldridge
SSL
66
60
0
19 Sep 2019
Language learning using Speech to Image retrieval
Language learning using Speech to Image retrieval
Danny Merkx
S. Frank
M. Ernestus
41
43
0
09 Sep 2019
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech
  Embeddings
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
Michael Wray
Diane Larlus
G. Csurka
Dima Damen
81
152
0
09 Aug 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
105
1,199
0
07 Jun 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for
  Video-and-Language Research
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Eric Wang
Jiawei Wu
Junkun Chen
Lei Li
Yuan-fang Wang
William Yang Wang
93
549
0
06 Apr 2019
Models of Visually Grounded Speech Signal Pay Attention To Nouns: a
  Bilingual Experiment on English and Japanese
Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese
William N. Havard
Jean-Pierre Chevrot
Laurent Besacier
45
24
0
08 Feb 2019
Learning from Multiview Correlations in Open-Domain Videos
Learning from Multiview Correlations in Open-Domain Videos
Nils Holzenberger
Shruti Palaskar
Pranava Madhyastha
Florian Metze
R. Arora
SSL
46
11
0
21 Nov 2018
How2: A Large-scale Dataset for Multimodal Language Understanding
How2: A Large-scale Dataset for Multimodal Language Understanding
Ramon Sanabria
Ozan Caglayan
Shruti Palaskar
Desmond Elliott
Loïc Barrault
Lucia Specia
Florian Metze
VGen
MLLM
81
288
0
01 Nov 2018
Multilingual sequence-to-sequence speech recognition: architecture,
  transfer learning, and language modeling
Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling
Jaejin Cho
M. Baskar
Ruizhi Li
Sanjeev Khudanpur
Sri Harish Reddy Mallidi
Nelson Yalta
M. Karafiát
Shinji Watanabe
Takaaki Hori
61
122
0
04 Oct 2018
Visually grounded cross-lingual keyword spotting in speech
Visually grounded cross-lingual keyword spotting in speech
Herman Kamper
Michael Roth
43
34
0
13 Jun 2018
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of
  Untranscribed Speech
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath
Galen Chuang
James R. Glass
55
58
0
09 Apr 2018
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory
  Input
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath
Adrià Recasens
Dídac Surís
Galen Chuang
Antonio Torralba
James R. Glass
68
201
0
04 Apr 2018
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
118
1,931
0
27 Nov 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
72
825
0
28 Mar 2017
Representations of language in a model of visually grounded speech
  signal
Representations of language in a model of visually grounded speech signal
Grzegorz Chrupała
Lieke Gelderloos
Afra Alishahi
73
131
0
07 Feb 2017
Learning Word-Like Units from Joint Audio-Visual Analysis
Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath
James R. Glass
68
106
0
25 Jan 2017
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.1K
193,426
0
10 Dec 2015
Deep Multimodal Semantic Embeddings for Speech and Images
Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath
James R. Glass
55
157
0
11 Nov 2015
1