ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.06687
  4. Cited By
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation
v1v2v3v4 (latest)

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

12 November 2022
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
    CLIP
ArXiv (abs)PDFHTML

Papers citing "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"

50 / 383 papers shown
Title
Audio Dialogues: Dialogues dataset for audio and music understanding
Audio Dialogues: Dialogues dataset for audio and music understanding
Arushi Goel
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
AuLLM
100
5
0
11 Apr 2024
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Tiantian Geng
Teng Wang
Yanfu Zhang
Jinming Duan
Weili Guan
Feng Zheng
84
2
0
04 Apr 2024
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Tanvir Mahmud
Saeed Amizadeh
K. Koishida
Diana Marculescu
AI4TS
65
3
0
02 Apr 2024
SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers
SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers
Junghyun Koo
Gordon Wichern
François Germain
Sameer Khurana
Jonathan Le Roux
92
5
0
02 Apr 2024
A Diffusion-Based Generative Equalizer for Music Restoration
A Diffusion-Based Generative Equalizer for Music Restoration
Eloi Moliner
Maija Turunen
Filip Elvander
Vesa Valimaki
156
5
0
27 Mar 2024
Synthetic training set generation using text-to-audio models for
  environmental sound classification
Synthetic training set generation using text-to-audio models for environmental sound classification
Francesca Ronchini
Luca Comanducci
Fabio Antonacci
80
2
0
26 Mar 2024
Correlation of Fréchet Audio Distance With Human Perception of
  Environmental Audio Is Embedding Dependant
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
Modan Tailleur
Junwon Lee
Mathieu Lagrange
Keunwoo Choi
Laurie M. Heller
Keisuke Imoto
Yuki Okamoto
89
10
0
26 Mar 2024
Building speech corpus with diverse voice characteristics for its
  prompt-based representation
Building speech corpus with diverse voice characteristics for its prompt-based representation
Aya Watanabe
Shinnosuke Takamichi
Yuki Saito
Wataru Nakata
Detai Xin
Hiroshi Saruwatari
65
1
0
20 Mar 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Yongqi Wang
Ruofan Hu
Rongjie Huang
Zhiqing Hong
Ruiqi Li
Wenrui Liu
Fuming You
Tao Jin
Zhou Zhao
114
13
0
18 Mar 2024
Generalized Multi-Source Inference for Text Conditioned Music Diffusion
  Models
Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models
Emilian Postolache
Giorgio Mariani
Luca Cosmo
Emmanouil Benetos
Emanuele Rodolà
DiffM
87
11
0
18 Mar 2024
Multiscale Matching Driven by Cross-Modal Similarity Consistency for
  Audio-Text Retrieval
Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval
Qian Wang
Jia-Chen Gu
Zhen-Hua Ling
61
2
0
15 Mar 2024
Knowledge Conflicts for LLMs: A Survey
Knowledge Conflicts for LLMs: A Survey
Rongwu Xu
Zehan Qi
Zhijiang Guo
Cunxiang Wang
Hongru Wang
Yue Zhang
Wei Xu
296
122
0
13 Mar 2024
Text-to-Audio Generation Synchronized with Videos
Text-to-Audio Generation Synchronized with Videos
Shentong Mo
Jing Shi
Yapeng Tian
DiffMVGen
88
18
0
08 Mar 2024
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
Xuenan Xu
Xiaohang Xu
Zeyu Xie
Pingyue Zhang
Mengyue Wu
Kai Yu
58
6
0
07 Mar 2024
Time Weaver: A Conditional Time Series Generation Model
Time Weaver: A Conditional Time Series Generation Model
Sai Shankar Narasimhan
Shubhankar Agarwal
Oguzhan Akcin
Sujay Sanghavi
Sandeep Chinchali
DiffMMedIm
118
21
0
05 Mar 2024
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao
Hailin Zhang
Qinhan Yu
Zhengren Wang
Yunteng Geng
Fangcheng Fu
Ling Yang
Wentao Zhang
Jie Jiang
Tengjiao Wang
3DV
288
284
0
29 Feb 2024
A SOUND APPROACH: Using Large Language Models to generate audio
  descriptions for egocentric text-audio retrieval
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
Andreea-Maria Oncescu
João F. Henriques
Andrew Zisserman
Samuel Albanie
A. Sophia Koepke
67
5
0
29 Feb 2024
ArcSin: Adaptive ranged cosine Similarity injected noise for
  Language-Driven Visual Tasks
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
66
2
0
27 Feb 2024
EDTC: enhance depth of text comprehension in automated audio captioning
EDTC: enhance depth of text comprehension in automated audio captioning
Liwen Tan
Yin Cao
Yi Zhou
70
0
0
27 Feb 2024
Music Style Transfer with Time-Varying Inversion of Diffusion Models
Music Style Transfer with Time-Varying Inversion of Diffusion Models
Sifei Li
Yuxin Zhang
Fan Tang
Chongyang Ma
Weiming Dong
Changsheng Xu
DiffM
72
11
0
21 Feb 2024
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan
Junqi Dai
Jiasheng Ye
Yunhua Zhou
Dong Zhang
...
Jie Fu
Tao Gui
Tianxiang Sun
Yugang Jiang
Xipeng Qiu
MLLM
95
136
0
19 Feb 2024
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
Hila Manor
T. Michaeli
DiffM
100
29
0
15 Feb 2024
Domain Adaptation for Contrastive Audio-Language Models
Domain Adaptation for Contrastive Audio-Language Models
Soham Deshmukh
Rita Singh
Bhiksha Raj
VLM
65
8
0
14 Feb 2024
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation
  and Editing via Content-based Controls
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls
Liwei Lin
Gus Xia
Yixiao Zhang
Junyan Jiang
72
13
0
14 Feb 2024
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and
  Instruction Tuning
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Hang Zhao
Yifei Xin
Zhesong Yu
Bilei Zhu
Lu Lu
Zejun Ma
AuLLM
93
4
0
12 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
90
12
0
10 Feb 2024
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
Yixiao Zhang
Yukara Ikemiya
Gus Xia
Naoki Murata
Marco A. Martínez-Ramírez
Wei-Hsiang Liao
Yuki Mitsufuji
Simon Dixon
123
23
0
09 Feb 2024
Fast Timing-Conditioned Latent Audio Diffusion
Fast Timing-Conditioned Latent Audio Diffusion
Zach Evans
CJ Carr
Josiah Taylor
Scott H. Hawley
Jordi Pons
DiffM
137
117
0
07 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLMLM&MAMLLM
163
94
0
02 Feb 2024
PAM: Prompting Audio-Language Models for Audio Quality Assessment
PAM: Prompting Audio-Language Models for Audio Quality Assessment
Soham Deshmukh
Dareen Alharthi
Benjamin Elizalde
Hannes Gamper
Mahmoud Al Ismail
Rita Singh
Bhiksha Raj
Huaming Wang
96
13
0
01 Feb 2024
Binding Touch to Everything: Learning Unified Multimodal Tactile
  Representations
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang
Chao Feng
Ziyang Chen
Hyoungseob Park
Daniel Wang
...
Ziyao Zeng
Xien Chen
Rit Gangopadhyay
Andrew Owens
Alex Wong
126
71
0
31 Jan 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for
  Automated Audio Captioning
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIPVLM
72
25
0
31 Jan 2024
Masked Audio Modeling with CLAP and Multi-Objective Learning
Masked Audio Modeling with CLAP and Multi-Objective Learning
Yifei Xin
Xiulian Peng
Yan Lu
110
8
0
29 Jan 2024
Exploring Musical Roots: Applying Audio Embeddings to Empower Influence
  Attribution for a Generative Music Model
Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model
Julia Barnett
Hugo Flores Garcia
Bryan Pardo
83
7
0
25 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
164
216
0
24 Jan 2024
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
Cheng-i Wang
Julian McAuley
Taylor Berg-Kirkpatrick
Nicholas J. Bryan
DiffM
119
41
0
22 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
71
0
0
22 Jan 2024
On the Audio Hallucinations in Large Audio-Video Language Models
On the Audio Hallucinations in Large Audio-Video Language Models
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
60
7
0
18 Jan 2024
Learning Audio Concepts from Counterfactual Natural Language
Learning Audio Concepts from Counterfactual Natural Language
Ali Vosoughi
Luca Bondi
Ho-Hsiang Wu
Chenliang Xu
CML
86
5
0
10 Jan 2024
Masked Audio Generation using a Single Non-Autoregressive Transformer
Masked Audio Generation using a Single Non-Autoregressive Transformer
Alon Ziv
Itai Gat
Gaël Le Lan
Tal Remez
Felix Kreuk
Alexandre Défossez
Jade Copet
Gabriel Synnaeve
Yossi Adi
110
40
0
09 Jan 2024
AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident
  Analysis
AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis
Kebin Wu
Wenbin Li
Xiaofei Xiao
30
4
0
05 Jan 2024
Towards Weakly Supervised Text-to-Audio Grounding
Towards Weakly Supervised Text-to-Audio Grounding
Xuenan Xu
Ziyang Ma
Mengyue Wu
Kai Yu
AI4TS
65
9
0
05 Jan 2024
Oceanship: A Large-Scale Dataset for Underwater Audio Target Recognition
Oceanship: A Large-Scale Dataset for Underwater Audio Target Recognition
Zeyu Li
Suncheng Xiang
Tong Yu
Jingsheng Gao
Jiacheng Ruan
Yanping Hu
Ting Liu
Yuzhuo Fu
42
0
0
04 Jan 2024
Audiobox: Unified Audio Generation with Natural Language Prompts
Audiobox: Unified Audio Generation with Natural Language Prompts
Apoorv Vyas
Bowen Shi
Matt Le
Andros Tjandra
Yi-Chiao Wu
...
Chris Summers
Carleigh Wood
Joshua Lane
Mary Williamson
Wei-Ning Hsu
124
94
0
25 Dec 2023
A Language-based solution to enable Metaverse Retrieval
A Language-based solution to enable Metaverse Retrieval
Ali Abdari
Alex Falcon
Giuseppe Serra
DiffM
62
4
0
22 Dec 2023
SECap: Speech Emotion Captioning with Large Language Model
SECap: Speech Emotion Captioning with Large Language Model
Yaoxun Xu
Hangting Chen
Jianwei Yu
Qiaochu Huang
Zhiyong Wu
Shixiong Zhang
Guangzhi Li
Yi Luo
Rongzhi Gu
110
27
0
16 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
M. Volkovs
107
3
0
15 Dec 2023
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio
Benno Weck
Holger Kirchhoff
Peter Grosche
Xavier Serra
VLM
46
2
0
14 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
57
44
0
11 Dec 2023
Speaker-Text Retrieval via Contrastive Learning
Speaker-Text Retrieval via Contrastive Learning
Xuechen Liu
Xin Wang
Erica Cooper
Xiaoxiao Miao
Junichi Yamagishi
VLM
45
1
0
11 Dec 2023
Previous
12345678
Next