ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10522
  4. Cited By
AudioX: Diffusion Transformer for Anything-to-Audio Generation
v1v2 (latest)

AudioX: Diffusion Transformer for Anything-to-Audio Generation

13 March 2025
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Yu Guo
ArXiv (abs)PDFHTML

Papers citing "AudioX: Diffusion Transformer for Anything-to-Audio Generation"

50 / 66 papers shown
Title
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Chang Liu
Haomin Zhang
Shiyu Xia
Zihao Chen
Chaofan Ding
Xin Yue
Huizhe Chen
Xinhan Di
48
0
0
26 May 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li
Mining Tan
Feier Shen
Minyan Luo
Zijiao Yin
Fan Tang
W. Dong
Changsheng Xu
118
1
0
17 Apr 2025
Vision-to-Music Generation: A Survey
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVMVGen
116
1
0
27 Mar 2025
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Andros Tjandra
Yi-Chiao Wu
Baishan Guo
John Hoffman
Brian Ellis
...
Matt Le
Nick Zacharov
Carleigh Wood
Ann Lee
Wei-Ning Hsu
212
18
0
07 Feb 2025
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
Alex Schwing
Yuki Mitsufuji
VGen
278
18
0
19 Dec 2024
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large
  Language Models
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu
Atin Sakkeer Hussain
Qilong Wu
Chenshuo Sun
Ying Shan
AuLLM
114
4
0
09 Dec 2024
VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment
  via Hierarchical Visual Features
VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features
Sifei Li
Binxin Yang
Chunji Yin
Chong Sun
Yuxin Zhang
Weiming Dong
Chen Li
VGen
90
4
0
09 Dec 2024
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic
  Synchronization
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
Ruiqi Li
Siqi Zheng
Xize Cheng
Ziang Zhang
Shengpeng Ji
Zhou Zhao
VGen
118
9
0
16 Oct 2024
MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
T. Pham
Tri Ton
Chang D. Yoo
93
3
0
03 Oct 2024
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music
  Videos
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
Yan-Bo Lin
Yu Tian
L. Yang
Gedas Bertasius
Heng Wang
VGen
72
8
0
11 Sep 2024
Stable Audio Open
Stable Audio Open
Zach Evans
Julian Parker
CJ Carr
Zack Zukowski
Josiah Taylor
Jordi Pons
252
53
0
19 Jul 2024
Qwen2-Audio Technical Report
Qwen2-Audio Technical Report
Yunfei Chu
Jin Xu
Qian Yang
Haojie Wei
Xipin Wei
...
Yuanjun Lv
Jinzheng He
Junyang Lin
Chang Zhou
Jingren Zhou
AuLLMVLM
96
162
0
15 Jul 2024
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized
  Sounds
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Yiming Zhang
Yicheng Gu
Yanhong Zeng
Zhening Xing
Yuancheng Wang
Zhizheng Wu
Kai Chen
VGen
94
41
0
01 Jul 2024
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for
  Efficient Audio Synthesis and Beyond
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
Marco Comunità
Zhi-Wei Zhong
Akira Takahashi
Shiqi Yang
Mengjie Zhao
Koichi Saito
Yukara Ikemiya
Takashi Shibuya
Shusuke Takahashi
Yuki Mitsufuji
110
6
0
25 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
267
17
0
06 Jun 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Yongqi Wang
Wenxiang Guo
Rongjie Huang
Jia-Bin Huang
Zehan Wang
Fuming You
Ruiqi Li
Zhou Zhao
VGenDiffM
102
13
0
01 Jun 2024
LLMs Meet Multimodal Generation and Editing: A Survey
LLMs Meet Multimodal Generation and Editing: A Survey
Yin-Yin He
Zhaoyang Liu
Jingye Chen
Zeyue Tian
Hongyu Liu
...
Yong Zhang
Wei Xue
Qi-fei Liu
Yi-Ting Guo
Qifeng Chen
75
21
0
29 May 2024
Diff-BGM: A Diffusion Model for Video Background Music Generation
Diff-BGM: A Diffusion Model for Video Background Music Generation
Sizhe Li
Yiming Qin
Minghang Zheng
Xin Jin
Yang Liu
DiffM
48
15
0
20 May 2024
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
Qixin Deng
Qikai Yang
Ruibin Yuan
Yipeng Huang
Yi Wang
...
Emmanouil Benetos
Wenwu Wang
Guangyu Xia
Wei Xue
Yi-Ting Guo
LLMAG
92
36
0
28 Apr 2024
Long-form music generation with latent diffusion
Long-form music generation with latent diffusion
Zach Evans
Julian Parker
CJ Carr
Zack Zukowski
Josiah Taylor
Jordi Pons
MGenDiffM
115
45
0
16 Apr 2024
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through
  Direct Preference Optimization
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Navonil Majumder
Chia-Yu Hung
Deepanway Ghosal
Wei-Ning Hsu
Rada Mihalcea
Soujanya Poria
124
61
0
15 Apr 2024
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
  Latent Aligners
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Yazhou Xing
Yin-Yin He
Zeyue Tian
Xintao Wang
Qifeng Chen
113
57
0
27 Feb 2024
ChatMusician: Understanding and Generating Music Intrinsically with LLM
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ti-Fen Pan
Hanfeng Lin
Yi Wang
Zeyue Tian
Shangda Wu
...
Gus Xia
Roger Dannenberg
Wei Xue
Shiyin Kang
Yike Guo
174
44
0
25 Feb 2024
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan
Junqi Dai
Jiasheng Ye
Yunhua Zhou
Dong Zhang
...
Jie Fu
Tao Gui
Tianxiang Sun
Yugang Jiang
Xipeng Qiu
MLLM
92
136
0
19 Feb 2024
Masked Audio Generation using a Single Non-Autoregressive Transformer
Masked Audio Generation using a Single Non-Autoregressive Transformer
Alon Ziv
Itai Gat
Gaël Le Lan
Tal Remez
Felix Kreuk
Alexandre Défossez
Jade Copet
Gabriel Synnaeve
Yossi Adi
107
40
0
09 Jan 2024
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
345
709
0
16 Nov 2023
Video2Music: Suitable Music Generation from Videos using an Affective
  Multimodal Transformer model
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
Jaeyong Kang
Soujanya Poria
Dorien Herremans
MGenVGen
74
36
0
02 Nov 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen
Menghan Xia
Yin-Yin He
Yong Zhang
Xiaodong Cun
...
Yaofang Liu
Qifeng Chen
Xintao Wang
Chao-Liang Weng
Ying Shan
DiffM
81
312
0
30 Oct 2023
NExT-GPT: Any-to-Any Multimodal LLM
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
105
507
0
11 Sep 2023
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised
  Pretraining
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
Haohe Liu
Yiitan Yuan
Xubo Liu
Xinhao Mei
Qiuqiang Kong
Qiao Tian
Yuping Wang
Wenwu Wang
Yuxuan Wang
Mark D. Plumbley
DiffM
121
246
0
10 Aug 2023
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models
  without Specific Tuning
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo
Ceyuan Yang
Anyi Rao
Zhengyang Liang
Yaohui Wang
Yu Qiao
Maneesh Agrawala
Dahua Lin
Bo Dai
VGen
142
877
0
10 Jul 2023
VampNet: Music Generation via Masked Acoustic Token Modeling
VampNet: Music Generation via Masked Acoustic Token Modeling
Hugo Flores Garcia
Prem Seetharaman
Rithesh Kumar
Bryan Pardo
MGen
93
68
0
10 Jul 2023
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion
  Models
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
Simian Luo
Chuanhao Yan
Chenxu Hu
Hang Zhao
DiffM
102
83
0
29 Jun 2023
Simple and Controllable Music Generation
Simple and Controllable Music Generation
Jade Copet
Felix Kreuk
Itai Gat
Tal Remez
David Kant
Gabriel Synnaeve
Yossi Adi
Alexandre Défossez
MGen
136
377
0
08 Jun 2023
ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
176
941
0
09 May 2023
Text-to-Audio Generation using Instruction-Tuned LLM and Latent
  Diffusion Model
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
Deepanway Ghosal
Navonil Majumder
Ambuj Mehrish
Soujanya Poria
228
151
0
24 Apr 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDaVLMMLLM
575
4,936
0
17 Apr 2023
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for
  Audio-Language Multimodal Research
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Xinhao Mei
Chutong Meng
Haohe Liu
Qiuqiang Kong
Tom Ko
Chengqi Zhao
Mark D. Plumbley
Yuexian Zou
Wenwu Wang
157
220
0
30 Mar 2023
MusicLM: Generating Music From Text
MusicLM: Generating Music From Text
A. Agostinelli
Timo I. Denk
Zalan Borsos
Jesse Engel
Mauro Verzetti
...
Adam Roberts
Marco Tagliasacchi
Matthew Sharifi
Neil Zeghidour
Christian Frank
MGen
147
450
0
26 Jan 2023
MAGVIT: Masked Generative Video Transformer
MAGVIT: Masked Generative Video Transformer
Lijun Yu
Yong Cheng
Kihyuk Sohn
José Lezama
Han Zhang
...
Alexander G. Hauptmann
Ming-Hsuan Yang
Yuan Hao
Irfan Essa
Lu Jiang
DiffMVGen
82
248
0
10 Dec 2022
InstructPix2Pix: Learning to Follow Image Editing Instructions
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks
Aleksander Holynski
Alexei A. Efros
DiffM
215
1,835
0
17 Nov 2022
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
150
542
0
12 Nov 2022
I Hear Your True Colors: Image Guided Audio Generation
I Hear Your True Colors: Image Guided Audio Generation
Roy Sheffer
Yossi Adi
VLM
73
76
0
06 Nov 2022
AudioGen: Textually Guided Audio Generation
AudioGen: Textually Guided Audio Generation
Felix Kreuk
Gabriel Synnaeve
Adam Polyak
Uriel Singer
Alexandre Défossez
Jade Copet
Devi Parikh
Yaniv Taigman
Yossi Adi
DiffM
106
309
0
30 Sep 2022
Masked Autoencoders that Listen
Masked Autoencoders that Listen
Po-Yao (Bernie) Huang
Hu Xu
Juncheng Billy Li
Alexei Baevski
Michael Auli
Wojciech Galuba
Florian Metze
Christoph Feichtenhofer
121
287
0
13 Jul 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya A. Ramesh
Prafulla Dhariwal
Alex Nichol
Casey Chu
Mark Chen
VLMDiffM
425
6,921
0
13 Apr 2022
Video Diffusion Models
Video Diffusion Models
Jonathan Ho
Tim Salimans
Alexey A. Gritsenko
William Chan
Mohammad Norouzi
David J. Fleet
DiffMVGen
230
1,642
0
07 Apr 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
235
1,207
0
23 Mar 2022
MaskGIT: Masked Generative Image Transformer
MaskGIT: Masked Generative Image Transformer
Huiwen Chang
Han Zhang
Lu Jiang
Ce Liu
William T. Freeman
ViT
156
695
0
08 Feb 2022
High-Resolution Image Synthesis with Latent Diffusion Models
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach
A. Blattmann
Dominik Lorenz
Patrick Esser
Bjorn Ommer
3DV
523
15,809
0
20 Dec 2021
12
Next