ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.06687
  4. Cited By
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

12 November 2022
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
    CLIP
ArXivPDFHTML

Papers citing "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"

50 / 70 papers shown
Title
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Yong Ren
Chenxing Li
Le Xu
Hao Gu
Duzhen Zhang
Yujie Chen
Manjie Xu
Ruibo Fu
Shan Yang
Dong Yu
LRM
61
0
0
19 May 2025
Fast Text-to-Audio Generation with Adversarial Post-Training
Fast Text-to-Audio Generation with Adversarial Post-Training
Zachary Novack
Zach Evans
Zack Zukowski
Josiah Taylor
CJ Carr
...
Adnan Al-Sinan
Gian Marco Iodice
Julian McAuley
Taylor Berg-Kirkpatrick
Jordi Pons
83
0
0
13 May 2025
Versatile Framework for Song Generation with Prompt-based Control
Versatile Framework for Song Generation with Prompt-based Control
Yanzhe Zhang
Wenxiang Guo
Changhao Pan
Zehan Zhu
Ruiqi Li
...
Rongjie Huang
Ruiyuan Zhang
Zhiqing Hong
Ziyue Jiang
Zhou Zhao
148
2
0
27 Apr 2025
DOSE : Drum One-Shot Extraction from Music Mixture
DOSE : Drum One-Shot Extraction from Music Mixture
Suntae Hwang
Seonghyeon Kang
Kyungsu Kim
Semin Ahn
K. Lee
92
0
0
25 Apr 2025
FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry
FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry
Mohammad Farahmand
A. Jamzad
Fahimeh Fooladgar
Laura Connolly
Martin Kaufmann
Kevin Yi Mi Ren
John Rudan
Doug McKay
Gabor Fichtinger
P. Mousavi
93
0
0
15 Apr 2025
LoopGen: Training-Free Loopable Music Generation
LoopGen: Training-Free Loopable Music Generation
Davide Marincione
Giorgio Strano
Donato Crisostomi
Roberto Ribuoli
Emanuele Rodolà
MGen
132
0
0
06 Apr 2025
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Shentong Mo
Zehua Chen
Fan Bao
Jun-Jie Zhu
DiffM
88
1
0
15 Mar 2025
M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
Jiaming Zhou
Songtao Zhao
Jiabei He
Hui Wang
Wenjia Zeng
Yong Chen
Haoqin Sun
Aobo Kong
Yong Qin
111
1
0
13 Mar 2025
AudioX: Diffusion Transformer for Anything-to-Audio Generation
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Yu Guo
84
5
0
13 Mar 2025
DGFM: Full Body Dance Generation Driven by Music Foundation Models
DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu
Zhenhua Feng
Diptesh Kanojia
Wenwu Wang
DiffM
134
1
0
27 Feb 2025
AudioBERT: Audio Knowledge Augmented Language Model
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
64
0
0
17 Jan 2025
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Yi Yuan
Xubo Liu
Haohe Liu
Mark D. Plumbley
Wenwu Wang
104
8
0
10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
127
2
0
10 Jan 2025
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
Ruben Ciranni
Emilian Postolache
Giorgio Mariani
Michele Mancusi
Giorgio Fabbro
Emanuele Rodolà
Luca Cosmo
207
8
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
203
126
0
10 Jan 2025
Text2midi: Generating Symbolic Music from Captions
Text2midi: Generating Symbolic Music from Captions
Keshav Bhandari
Abhinaba Roy
Kyra Wang
Geeta Puri
Simon Colton
Dorien Herremans
120
5
0
03 Jan 2025
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Chia-Yu Hung
Navonil Majumder
Zhifeng Kong
Ambuj Mehrish
Rafael Valle
Bryan Catanzaro
Soujanya Poria
Bryan Catanzaro
Soujanya Poria
100
9
0
30 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
Alex Schwing
Yuki Mitsufuji
VGen
254
17
0
19 Dec 2024
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
Chenyu Yang
Shuai Wang
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Yongjun Xu
Yizhi Zhou
Haina Zhu
Haoyang Li
KELM
382
2
0
18 Dec 2024
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
Dragos-Alexandru Boldisor
Stefan Smeu
Dan Oneaţă
Elisabeta Oneata
146
2
0
29 Nov 2024
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
Alain Riou
Antonin Gagnere
Gaëtan Hadjeres
Stefan Lattner
Geoffroy Peeters
123
0
0
29 Nov 2024
Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
Wei Guo
Heng Wang
Jianbo Ma
Weidong Cai
DiffM
153
5
0
23 Nov 2024
Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou
Dan Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
Ming Wang
VLM
95
5
0
18 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
149
2
0
11 Nov 2024
Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks
Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks
Felipe Marra
Lucas N. Ferreira
88
0
0
06 Nov 2024
Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection
Han Yin
Yang Xiao
Jisheng Bai
Rohan Kumar Das
93
0
0
02 Nov 2024
Music Foundation Model as Generic Booster for Music Downstream Tasks
Music Foundation Model as Generic Booster for Music Downstream Tasks
Weihsiang Liao
Yuhta Takida
Yukara Ikemiya
Zhi-Wei Zhong
Chieh-Hsin Lai
...
Stefan Uhlich
Taketo Akama
Woosung Choi
Yuichiro Koyama
Yuki Mitsufuji
196
1
0
02 Nov 2024
Human-inspired Perspectives: A Survey on AI Long-term Memory
Human-inspired Perspectives: A Survey on AI Long-term Memory
Zihong He
Weizhe Lin
Hao Zheng
Fan Zhang
Matt Jones
Laurence Aitchison
X. Xu
Miao Liu
Per Ola Kristensson
Junxiao Shen
177
3
0
01 Nov 2024
Do Audio-Language Models Understand Linguistic Variations?
Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar
Sonal Kumar
Hemant Kumar Giri
Nishit Anand
Ashish Seth
Sreyan Ghosh
Dinesh Manocha
AuLLM
VLM
113
1
0
21 Oct 2024
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu
Yashan Wang
Ruibin Yuan
Zhancheng Guo
Xu Tan
...
Yuanliang Dong
Jiafeng Liu
Xiaobing Li
Feng Yu
Maosong Sun
177
4
0
17 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent
  Approach
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
94
4
0
14 Oct 2024
Presto! Distilling Steps and Layers for Accelerating Music Generation
Presto! Distilling Steps and Layers for Accelerating Music Generation
Zachary Novack
Ge Zhu
Jonah Casebeer
Julian McAuley
Taylor Berg-Kirkpatrick
Nicholas J. Bryan
100
7
0
07 Oct 2024
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Zixuan Wang
Chi-Keung Tang
Chi-Keung Tang
DiffM
VGen
LLMAG
91
4
0
04 Oct 2024
Language-based Audio Moment Retrieval
Language-based Audio Moment Retrieval
Hokuto Munakata
Taichi Nishimura
Shota Nakada
Tatsuya Komatsu
105
2
0
24 Sep 2024
OmniBench: Towards The Future of Universal Omni-Language Models
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
99
18
0
23 Sep 2024
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
Lam Pham
Phat Lam
Dat Tran
Hieu Tang
Tin Nguyen
Alexander Schindler
Canh Vu
Alexander Polonsky
Canh Vu
97
5
0
23 Sep 2024
Exploring Text-Queried Sound Event Detection with Audio Source Separation
Exploring Text-Queried Sound Event Detection with Audio Source Separation
Han Yin
Jisheng Bai
Yang Xiao
Hui Wang
Siqi Zheng
Yafeng Chen
Rohan Kumar Das
Chong Deng
Jianfeng Chen
75
3
0
20 Sep 2024
Language-Queried Target Sound Extraction Without Parallel Training Data
Language-Queried Target Sound Extraction Without Parallel Training Data
Hao Ma
Zhiyuan Peng
Xu Li
Yukai Li
Mingjie Shao
Qiuqiang Kong
Xuelong Li
VLM
128
2
0
14 Sep 2024
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
Yong Ren
Chenxing Li
Manjie Xu
Wei Liang
Yu Gu
Rilin Chen
Dong Yu
VGen
DiffM
74
9
0
13 Sep 2024
Video-to-Audio Generation with Hidden Alignment
Video-to-Audio Generation with Hidden Alignment
Manjie Xu
Chenxing Li
Yong Ren
Rilin Chen
Yu Gu
Yu Gu
Dong Yu
Dong Yu
DiffM
VGen
76
12
0
10 Jul 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
80
2
0
08 Jul 2024
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
Feiyang Xiao
Jian Guan
Qiaoxi Zhu
Xubo Liu
Wenbo Wang
Shuhan Qi
Kejia Zhang
Jianyuan Sun
Wenwu Wang
54
6
0
06 Jul 2024
Contrastive Learning from Synthetic Audio Doppelgängers
Contrastive Learning from Synthetic Audio Doppelgängers
Manuel Cherep
Nikhil Singh
83
1
0
09 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
247
17
0
06 Jun 2024
Listenable Maps for Zero-Shot Audio Classifiers
Listenable Maps for Zero-Shot Audio Classifiers
Francesco Paissan
Luca Della Libera
Mirco Ravanelli
Cem Subakan
83
4
0
27 May 2024
Images that Sound: Composing Images and Sounds on a Single Canvas
Images that Sound: Composing Images and Sounds on a Single Canvas
Ziyang Chen
Daniel Geng
Andrew Owens
DiffM
91
9
0
20 May 2024
A Diffusion-Based Generative Equalizer for Music Restoration
A Diffusion-Based Generative Equalizer for Music Restoration
Eloi Moliner
Maija Turunen
Filip Elvander
Vesa Valimaki
131
5
0
27 Mar 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Yongqi Wang
Ruofan Hu
Rongjie Huang
Zhiqing Hong
Ruiqi Li
Wenrui Liu
Fuming You
Tao Jin
Zhou Zhao
82
12
0
18 Mar 2024
Audio-Visual Instance Segmentation
Audio-Visual Instance Segmentation
Ruohao Guo
Yaru Chen
Yanyu Qi
Wenzhen Yue
Dantong Niu
...
Wenzhen Yue
Ji Shi
Qixun Wang
Peiliang Zhang
Buwen Liang
VLM
VOS
74
2
0
28 Oct 2023
Audio Retrieval with WavText5K and CLAP Training
Audio Retrieval with WavText5K and CLAP Training
Soham Deshmukh
Benjamin Elizalde
Huaming Wang
3DV
CLIP
138
53
0
28 Sep 2022
12
Next