ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.09199
  4. Cited By
AVLnet: Learning Audio-Visual Language Representations from
  Instructional Videos

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

16 June 2020
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
Samuel Thomas
Kartik Audhkhasi
Hilde Kuehne
Yikang Shen
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
    SSL
ArXivPDFHTML

Papers citing "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos"

50 / 76 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
58
0
0
02 May 2025
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Jahagirdar
Jayasree Saha
C. V. Jawahar
70
0
0
11 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
111
0
0
20 Feb 2025
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
158
16
0
06 Jun 2024
Spoken Moments: Learning Joint Audio-Visual Representations from Video
  Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogerio Feris
James Glass
Aude Oliva
27
59
0
10 May 2021
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
Ramon Sanabria
Austin Waters
Jason Baldridge
3DV
30
25
0
05 Apr 2021
QuerYD: A video dataset with high-quality text and audio narrations
QuerYD: A video dataset with high-quality text and audio narrations
Andreea-Maria Oncescu
João F. Henriques
Yang Liu
Andrew Zisserman
Samuel Albanie
VGen
29
11
0
22 Nov 2020
Self-Supervised MultiModal Versatile Networks
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
96
373
0
29 Jun 2020
Telling Left from Right: Learning Spatial Correspondence of Sight and
  Sound
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound
Karren D. Yang
Bryan C. Russell
Justin Salamon
SSL
49
75
0
11 Jun 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal
  Learning
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
Elad Amrani
Rami Ben-Ari
Daniel Rotman
A. Bronstein
43
122
0
06 Mar 2020
A Simple Framework for Contrastive Learning of Visual Representations
A Simple Framework for Contrastive Learning of Visual Representations
Ting-Li Chen
Simon Kornblith
Mohammad Norouzi
Geoffrey E. Hinton
SSL
186
18,523
0
13 Feb 2020
End-to-End Learning of Visual Representations from Uncurated
  Instructional Videos
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGen
SSL
93
705
0
13 Dec 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
211
42,038
0
03 Dec 2019
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Humam Alwassel
D. Mahajan
Bruno Korbar
Lorenzo Torresani
Guohao Li
Du Tran
SSL
63
429
0
28 Nov 2019
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded
  Speech
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath
Wei-Ning Hsu
James R. Glass
39
84
0
21 Nov 2019
Self-supervised Moving Vehicle Tracking with Stereo Sound
Self-supervised Moving Vehicle Tracking with Stereo Sound
Chuang Gan
Hang Zhao
Peihao Chen
David D. Cox
Antonio Torralba
27
147
0
25 Oct 2019
Large-scale representation learning from visually grounded untranscribed
  speech
Large-scale representation learning from visually grounded untranscribed speech
Gabriel Ilharco
Yuan Zhang
Jason Baldridge
SSL
46
60
0
19 Sep 2019
Language learning using Speech to Image retrieval
Language learning using Speech to Image retrieval
Danny Merkx
S. Frank
M. Ernestus
23
43
0
09 Sep 2019
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech
  Embeddings
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
Michael Wray
Diane Larlus
G. Csurka
Dima Damen
71
152
0
09 Aug 2019
Use What You Have: Video Retrieval Using Representations From
  Collaborative Experts
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Yang Liu
Samuel Albanie
Arsha Nagrani
Andrew Zisserman
55
387
0
31 Jul 2019
Learning Video Representations using Contrastive Bidirectional
  Transformer
Learning Video Representations using Contrastive Bidirectional Transformer
Chen Sun
Fabien Baradel
Kevin Patrick Murphy
Cordelia Schmid
SSL
ViT
94
133
0
13 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
87
1,186
0
07 Jun 2019
Mining YouTube - A dataset for learning fine-grained action concepts
  from webly supervised video data
Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data
Hilde Kuehne
Ahsan Iqbal
Alexander Richard
Juergen Gall
25
17
0
03 Jun 2019
Large-scale weakly-supervised pre-training for video action recognition
Large-scale weakly-supervised pre-training for video action recognition
Deepti Ghadiyaram
Matt Feiszli
Du Tran
Xueting Yan
Heng Wang
D. Mahajan
40
299
0
02 May 2019
Self-Supervised Audio-Visual Co-Segmentation
Self-Supervised Audio-Visual Co-Segmentation
Andrew Rouditchenko
Hang Zhao
Chuang Gan
Josh H. McDermott
Antonio Torralba
VLM
SSL
30
103
0
18 Apr 2019
Co-Separating Sounds of Visual Objects
Co-Separating Sounds of Visual Objects
Ruohan Gao
Kristen Grauman
97
208
0
16 Apr 2019
Semantic query-by-example speech search using visual grounding
Semantic query-by-example speech search using visual grounding
Herman Kamper
Aristotelis Anastassiou
Karen Livescu
41
29
0
15 Apr 2019
The Sound of Motions
The Sound of Motions
Hang Zhao
Chuang Gan
Wei-Chiu Ma
Antonio Torralba
51
252
0
11 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
37
1,238
0
03 Apr 2019
Cross-task weakly supervised learning from instructional videos
Cross-task weakly supervised learning from instructional videos
Dimitri Zhukov
Jean-Baptiste Alayrac
R. G. Cinbis
David Fouhey
Ivan Laptev
Josef Sivic
SSL
103
245
0
19 Mar 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video
  Analysis
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
Yansong Tang
Dajun Ding
Yongming Rao
Yu Zheng
Danyang Zhang
Lili Zhao
Jiwen Lu
Jie Zhou
96
308
0
07 Mar 2019
2.5D Visual Sound
2.5D Visual Sound
Ruohan Gao
Kristen Grauman
VGen
97
130
0
11 Dec 2018
Learning from Multiview Correlations in Open-Domain Videos
Learning from Multiview Correlations in Open-Domain Videos
Nils Holzenberger
Shruti Palaskar
Pranava Madhyastha
Florian Metze
R. Arora
SSL
38
11
0
21 Nov 2018
Multimodal One-Shot Learning of Speech and Images
Multimodal One-Shot Learning of Speech and Images
Ryan Eloff
H. Engelbrecht
Herman Kamper
SSL
VLM
27
35
0
09 Nov 2018
How2: A Large-scale Dataset for Multimodal Language Understanding
How2: A Large-scale Dataset for Multimodal Language Understanding
Ramon Sanabria
Ozan Caglayan
Shruti Palaskar
Desmond Elliott
Loïc Barrault
Lucia Specia
Florian Metze
VGen
MLLM
57
287
0
01 Nov 2018
Self-Supervised Generation of Spatial Audio for 360 Video
Self-Supervised Generation of Spatial Audio for 360 Video
Pedro Morgado
Nuno Vasconcelos
Timothy R. Langlois
Oliver Wang
MDE
42
171
0
07 Sep 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Youngjae Yu
Jongseok Kim
Gunhee Kim
59
343
0
07 Aug 2018
Representation Learning with Contrastive Predictive Coding
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord
Yazhe Li
Oriol Vinyals
DRL
SSL
215
10,152
0
10 Jul 2018
Cooperative Learning of Audio and Video Models from Self-Supervised
  Synchronization
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Bruno Korbar
Du Tran
Lorenzo Torresani
67
473
0
30 Jun 2018
Visually grounded cross-lingual keyword spotting in speech
Visually grounded cross-lingual keyword spotting in speech
Herman Kamper
Michael Roth
34
34
0
13 Jun 2018
Disentangling by Partitioning: A Representation Learning Framework for
  Multimodal Sensory Data
Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data
Wei-Ning Hsu
James R. Glass
DRL
56
43
0
29 May 2018
On Learning Associations of Faces and Voices
On Learning Associations of Faces and Voices
Changil Kim
Hijung Valentina Shin
Tae-Hyun Oh
Alexandre Kaspar
Mohamed A. Elgharib
Wojciech Matusik
CVBM
29
83
0
15 May 2018
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and
  Object Interaction
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
Luowei Zhou
Nathan Louis
Jason J. Corso
57
94
0
08 May 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Andrew Owens
Alexei A. Efros
SSL
69
747
0
10 Apr 2018
The Sound of Pixels
The Sound of Pixels
Hang Zhao
Chuang Gan
Andrew Rouditchenko
Carl Vondrick
Josh H. McDermott
Antonio Torralba
VLM
58
532
0
09 Apr 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
41
233
0
07 Apr 2018
Learning to Separate Object Sounds by Watching Unlabeled Video
Learning to Separate Object Sounds by Watching Unlabeled Video
Ruohan Gao
Rogerio Feris
Kristen Grauman
SSL
42
284
0
05 Apr 2018
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory
  Input
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath
Adrià Recasens
Dídac Surís
Galen Chuang
Antonio Torralba
James R. Glass
59
200
0
04 Apr 2018
Seeing Voices and Hearing Faces: Cross-modal biometric matching
Seeing Voices and Hearing Faces: Cross-modal biometric matching
Arsha Nagrani
Samuel Albanie
Andrew Zisserman
CVBM
46
220
0
01 Apr 2018
Unsupervised Learning and Segmentation of Complex Activities from Video
Unsupervised Learning and Segmentation of Complex Activities from Video
Fadime Sener
Angela Yao
39
112
0
26 Mar 2018
12
Next