ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.09199
  4. Cited By
AVLnet: Learning Audio-Visual Language Representations from
  Instructional Videos

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

16 June 2020
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
Samuel Thomas
Kartik Audhkhasi
Hilde Kuehne
Yikang Shen
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
    SSL
ArXivPDFHTML

Papers citing "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos"

26 / 76 papers shown
Title
Linguistic unit discovery from multi-modal inputs in unwritten
  languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
O. Scharenborg
Laurent Besacier
A. Black
M. Hasegawa-Johnson
Florian Metze
...
Elin Larsen
Danny Merkx
Rachid Riad
Liming Wang
Emmanuel Dupoux
53
33
0
14 Feb 2018
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
70
529
0
18 Dec 2017
Learning Modality-Invariant Representations for Speech and Images
Learning Modality-Invariant Representations for Speech and Images
K. Leidal
David Harwath
James R. Glass
SSL
41
29
0
11 Dec 2017
Visual to Sound: Generating Natural Sound for Videos in the Wild
Visual to Sound: Generating Natural Sound for Videos in the Wild
Yipin Zhou
Zhaowen Wang
Chen Fang
Trung Bui
Tamara L. Berg
VGen
49
206
0
04 Dec 2017
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
109
1,926
0
27 Nov 2017
Semantic speech retrieval with a visually grounded model of
  untranscribed speech
Semantic speech retrieval with a visually grounded model of untranscribed speech
Herman Kamper
Gregory Shakhnarovich
Karen Livescu
55
53
0
05 Oct 2017
Unsupervised Representation Learning by Sorting Sequences
Unsupervised Representation Learning by Sorting Sequences
Hsin-Ying Lee
Jia-Bin Huang
Maneesh Kumar Singh
Ming-Hsuan Yang
SSL
DRL
52
534
0
03 Aug 2017
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO
  Data Set
SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
William N. Havard
Laurent Besacier
O. Rosec
32
28
0
26 Jul 2017
Learnable pooling with Context Gating for video classification
Learnable pooling with Context Gating for video classification
Antoine Miech
Ivan Laptev
Josef Sivic
47
327
0
21 Jun 2017
Look, Listen and Learn
Look, Listen and Learn
Relja Arandjelović
Andrew Zisserman
SSL
80
900
0
23 May 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
199
7,961
0
22 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
59
819
0
28 Mar 2017
Unsupervised Visual-Linguistic Reference Resolution in Instructional
  Videos
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
De-An Huang
Joseph J. Lim
Li Fei-Fei
Juan Carlos Niebles
42
56
0
07 Mar 2017
Representations of language in a model of visually grounded speech
  signal
Representations of language in a model of visually grounded speech signal
Grzegorz Chrupała
Lieke Gelderloos
Afra Alishahi
68
131
0
07 Feb 2017
Learning Word-Like Units from Joint Audio-Visual Analysis
Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath
James R. Glass
46
106
0
25 Jan 2017
Self-Supervised Video Representation Learning With Odd-One-Out Networks
Self-Supervised Video Representation Learning With Odd-One-Out Networks
Basura Fernando
Hakan Bilen
E. Gavves
Stephen Gould
SSL
37
450
0
21 Nov 2016
SoundNet: Learning Sound Representations from Unlabeled Video
SoundNet: Learning Sound Representations from Unlabeled Video
Y. Aytar
Carl Vondrick
Antonio Torralba
SSL
85
1,040
0
27 Oct 2016
Movie Description
Movie Description
Anna Rohrbach
Atousa Torabi
Marcus Rohrbach
Niket Tandon
C. Pal
Hugo Larochelle
Aaron Courville
Bernt Schiele
3DV
VGen
58
355
0
12 May 2016
Exploring the Limits of Language Modeling
Exploring the Limits of Language Modeling
Rafal Jozefowicz
Oriol Vinyals
M. Schuster
Noam M. Shazeer
Yonghui Wu
120
1,143
0
07 Feb 2016
Visually Indicated Sounds
Visually Indicated Sounds
Andrew Owens
Phillip Isola
Josh H. McDermott
Antonio Torralba
Edward H. Adelson
William T. Freeman
74
382
0
28 Dec 2015
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
1.4K
192,638
0
10 Dec 2015
Deep Multimodal Semantic Embeddings for Speech and Images
Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath
James R. Glass
32
156
0
11 Nov 2015
Unsupervised Learning from Narrated Instruction Videos
Unsupervised Learning from Narrated Instruction Videos
Jean-Baptiste Alayrac
Piotr Bojanowski
Nishant Agrawal
Josef Sivic
Ivan Laptev
Simon Lacoste-Julien
SSL
63
289
0
30 Jun 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
813
149,474
0
22 Dec 2014
Object Detectors Emerge in Deep Scene CNNs
Object Detectors Emerge in Deep Scene CNNs
Bolei Zhou
A. Khosla
Àgata Lapedriza
A. Oliva
Antonio Torralba
ObjD
120
1,279
0
22 Dec 2014
Efficient Estimation of Word Representations in Vector Space
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
550
31,406
0
16 Jan 2013
Previous
12