ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.19132
  4. Cited By
From Vision to Audio and Beyond: A Unified Model for Audio-Visual
  Representation and Generation

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

27 September 2024
Kun Su
Xiulong Liu
Eli Shlizerman
    VGen
ArXivPDFHTML

Papers citing "From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation"

28 / 28 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
79
0
0
02 May 2025
High-Fidelity Audio Compression with Improved RVQGAN
High-Fidelity Audio Compression with Improved RVQGAN
Rithesh Kumar
Prem Seetharaman
Alejandro Luebs
I. Kumar
Kundan Kumar
88
326
0
11 Jun 2023
Multimodal Deep Learning
Multimodal Deep Learning
Cem Akkus
Jiquan Ngiam
Vladana Djakovic
Steffen Jauch-Walser
A. Khosla
...
Jann Goschenhofer
Honglak Lee
A. Ng
Daniel Schalk
Matthias Aßenmacher
112
3,169
0
12 Jan 2023
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
62
54
0
15 Dec 2022
MAGE: MAsked Generative Encoder to Unify Representation Learning and
  Image Synthesis
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Tianhong Li
Huiwen Chang
Shlok Kumar Mishra
Han Zhang
Dina Katabi
Dilip Krishnan
63
166
0
16 Nov 2022
Contrastive Audio-Visual Masked Autoencoder
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
70
125
0
02 Oct 2022
AudioGen: Textually Guided Audio Generation
AudioGen: Textually Guided Audio Generation
Felix Kreuk
Gabriel Synnaeve
Adam Polyak
Uriel Singer
Alexandre Défossez
Jade Copet
Devi Parikh
Yaniv Taigman
Yossi Adi
DiffM
60
308
0
30 Sep 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
133
639
0
22 Aug 2022
Classifier-Free Diffusion Guidance
Classifier-Free Diffusion Guidance
Jonathan Ho
Tim Salimans
FaML
188
3,885
0
26 Jul 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
371
3,535
0
29 Apr 2022
MaskGIT: Masked Generative Image Transformer
MaskGIT: Masked Generative Image Transformer
Huiwen Chang
Han Zhang
Lu Jiang
Ce Liu
William T. Freeman
ViT
125
674
0
08 Feb 2022
Video Background Music Generation with Controllable Music Transformer
Video Background Music Generation with Controllable Music Transformer
Shangzhe Di
Jiang
Sihan Liu
Zhaokai Wang
Leyan Zhu
Zexin He
Hongming Liu
Shuicheng Yan
53
92
0
16 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
451
7,739
0
11 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
69
553
0
03 Nov 2021
Taming Visually Guided Sound Generation
Taming Visually Guided Sound Generation
Vladimir E. Iashin
Esa Rahtu
VLM
85
127
0
17 Oct 2021
Efficient Training of Audio Transformers with Patchout
Efficient Training of Audio Transformers with Patchout
Khaled Koutini
Jan Schluter
Hamid Eghbalzadeh
Gerhard Widmer
ViT
126
257
0
11 Oct 2021
Train Short, Test Long: Attention with Linear Biases Enables Input
  Length Extrapolation
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
321
755
0
27 Aug 2021
Attention Bottlenecks for Multimodal Fusion
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
98
565
0
30 Jun 2021
Perceiver: General Perception with Iterative Attention
Perceiver: General Perception with Iterative Attention
Andrew Jaegle
Felix Gimeno
Andrew Brock
Andrew Zisserman
Oriol Vinyals
João Carreira
VLM
ViT
MDE
185
1,014
0
04 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
900
29,372
0
26 Feb 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
632
41,003
0
22 Oct 2020
Jukebox: A Generative Model for Music
Jukebox: A Generative Model for Music
Prafulla Dhariwal
Heewoo Jun
Christine Payne
Jong Wook Kim
Alec Radford
Ilya Sutskever
VLM
107
746
0
30 Apr 2020
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Pedro Morgado
Nuno Vasconcelos
Ishan Misra
SSL
80
273
0
27 Apr 2020
PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition
PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition
Kun Su
Xiulong Liu
Eli Shlizerman
83
189
0
27 Nov 2019
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
98
530
0
18 Dec 2017
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
Soroush Mehri
Kundan Kumar
Ishaan Gulrajani
Rithesh Kumar
Shubham Jain
Jose M. R. Sotelo
Aaron Courville
Yoshua Bengio
106
599
0
22 Dec 2016
CNN Architectures for Large-Scale Audio Classification
CNN Architectures for Large-Scale Audio Classification
Shawn Hershey
Sourish Chaudhuri
D. Ellis
J. Gemmeke
A. Jansen
...
Rif A. Saurous
Bryan Seybold
M. Slaney
Ron J. Weiss
K. Wilson
120
2,498
0
29 Sep 2016
Bidirectional Recurrent Neural Networks as Generative Models -
  Reconstructing Gaps in Time Series
Bidirectional Recurrent Neural Networks as Generative Models - Reconstructing Gaps in Time Series
Mathias Berglund
T. Raiko
Mikko Honkala
L. Kärkkäinen
A. Vetek
J. Karhunen
BDL
70
127
0
07 Apr 2015
1