ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.06687
  4. Cited By
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation
v1v2v3v4 (latest)

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

12 November 2022
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
    CLIP
ArXiv (abs)PDFHTML

Papers citing "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"

50 / 383 papers shown
Title
SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and
  Exploration
SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration
Stephen Brade
Bryan Wang
Maurício Sousa
Gregory Lee Newsome
Sageev Oore
Tovi Grossman
64
2
0
07 Dec 2023
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
Juntao Zhang
Yuehuai Liu
Yu-Wing Tai
Chi-Keung Tang
DiffM
76
5
0
29 Nov 2023
ViT-Lens: Towards Omni-modal Representations
ViT-Lens: Towards Omni-modal Representations
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
90
20
0
27 Nov 2023
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
Zhixi Cai
Shreya Ghosh
Aman Pankaj Adatia
Munawar Hayat
Abhinav Dhall
Kalin Stefanov
83
37
0
26 Nov 2023
PortfolioMentor: Multimodal Generative AI Companion for Learning and
  Crafting Interactive Digital Art Portfolios
PortfolioMentor: Multimodal Generative AI Companion for Learning and Crafting Interactive Digital Art Portfolios
Tao Long
Weirui Peng
50
1
0
23 Nov 2023
Boosting Audio-visual Zero-shot Learning with Large Language Models
Boosting Audio-visual Zero-shot Learning with Large Language Models
Haoxing Chen
Yaohui Li
Yan Hong
Zizheng Huang
Zhuoer Xu
Zhangxuan Gu
Jun Lan
Huijia Zhu
Weiqiang Wang
VLM
78
1
0
21 Nov 2023
A Study on Altering the Latent Space of Pretrained Text to Speech Models
  for Improved Expressiveness
A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness
Mathias Vogel
DiffM
47
0
0
17 Nov 2023
The Song Describer Dataset: a Corpus of Audio Captions for
  Music-and-Language Evaluation
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
Ilaria Manco
Benno Weck
Seungheon Doh
Minz Won
Yixiao Zhang
...
Philip Tovstogan
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
Juhan Nam
80
30
0
16 Nov 2023
EDMSound: Spectrogram Based Diffusion Models for Efficient and
  High-Quality Audio Synthesis
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Ge Zhu
Yutong Wen
M. Carbonneau
Zhiyao Duan
DiffM
74
8
0
15 Nov 2023
Zero-shot audio captioning with audio-language model guidance and audio
  context keywords
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Leonard Salewski
Stefan Fauth
A. Sophia Koepke
Zeynep Akata
49
11
0
14 Nov 2023
The taste of IPA: Towards open-vocabulary keyword spotting and forced
  alignment in any language
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Jian Zhu
Changbing Yang
Farhan Samir
Jahurul Islam
77
6
0
14 Nov 2023
Music ControlNet: Multiple Time-varying Controls for Music Generation
Music ControlNet: Multiple Time-varying Controls for Music Generation
Shih-Lun Wu
Chris Donahue
Shinji Watanabe
Nicholas J. Bryan
DiffMMGen
104
61
0
13 Nov 2023
InstrumentGen: Generating Sample-Based Musical Instruments From Text
InstrumentGen: Generating Sample-Based Musical Instruments From Text
S. Nercessian
Johannes Imort
60
2
0
07 Nov 2023
FLAP: Fast Language-Audio Pre-training
FLAP: Fast Language-Audio Pre-training
Ching-Feng Yeh
Po-Yao Huang
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
CLIPVLM
65
9
0
02 Nov 2023
In-Context Prompt Editing For Conditional Audio Generation
In-Context Prompt Editing For Conditional Audio Generation
Ernie Chang
Pin-Jie Lin
Yang Li
Sidd Srinivasan
Gaël Le Lan
David Kant
Yangyang Shi
Forrest N. Iandola
Vikas Chandra
DiffM
42
4
0
01 Nov 2023
Audio-Visual Instance Segmentation
Audio-Visual Instance Segmentation
Ruohao Guo
Yaru Chen
Yanyu Qi
Wenzhen Yue
Dantong Niu
...
Wenzhen Yue
Ji Shi
Qixun Wang
Peiliang Zhang
Buwen Liang
VLMVOS
122
2
0
28 Oct 2023
Content-based Controls For Music Large Language Modeling
Content-based Controls For Music Large Language Modeling
Liwei Lin
Gus Xia
Junyan Jiang
Yixiao Zhang
56
16
0
26 Oct 2023
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
Daniela Ben-David
Tzuf Paz-Argaman
Reut Tsarfaty
MoE
68
0
0
25 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
58
0
0
20 Oct 2023
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative
  Editing
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Yixiao Zhang
Akira Maezawa
Gus Xia
Kazuhiko Yamamoto
Simon Dixon
77
17
0
19 Oct 2023
High-Fidelity Noise Reduction with Differentiable Signal Processing
High-Fidelity Noise Reduction with Differentiable Signal Processing
C. Steinmetz
Thomas Walther
Joshua D. Reiss
48
3
0
17 Oct 2023
Generation or Replication: Auscultating Audio Latent Diffusion Models
Generation or Replication: Auscultating Audio Latent Diffusion Models
Dimitrios Bralios
Gordon Wichern
François Germain
Zexu Pan
Sameer Khurana
Chiori Hori
Jonathan Le Roux
DiffM
60
6
0
16 Oct 2023
Extending Multi-modal Contrastive Representations
Extending Multi-modal Contrastive Representations
Zehan Wang
Ziang Zhang
Luping Liu
Yang Zhao
Haifeng Huang
Tao Jin
Zhou Zhao
63
7
0
13 Oct 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language
  Models
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLMVLMCoGe
122
26
0
12 Oct 2023
LLark: A Multimodal Instruction-Following Language Model for Music
LLark: A Multimodal Instruction-Following Language Model for Music
Josh Gardner
Simon Durand
Daniel Stoller
Rachel M. Bittner
AuLLM
76
16
0
11 Oct 2023
LanguageBind: Extending Video-Language Pretraining to N-modality by
  Language-based Semantic Alignment
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
...
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLMMLLM
173
229
0
03 Oct 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Avamarie Brueggeman
Andrea Madotto
Zhaojiang Lin
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
84
94
0
27 Sep 2023
Online Active Learning For Sound Event Detection
Online Active Learning For Sound Event Detection
Mark Lindsey
Ankit Shah
Francis Kubala
R. M. Stern
33
0
0
25 Sep 2023
VoiceLDM: Text-to-Speech with Environmental Context
VoiceLDM: Text-to-Speech with Environmental Context
Yeong-Won Lee
In-won Yeon
Juhan Nam
Joon Son Chung
VLMDiffM
75
15
0
24 Sep 2023
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics
  Description for Prompt-based Control
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control
Aya Watanabe
Shinnosuke Takamichi
Yuki Saito
Wataru Nakata
Detai Xin
Hiroshi Saruwatari
52
11
0
24 Sep 2023
Weakly-supervised Automated Audio Captioning via text only training
Weakly-supervised Automated Audio Captioning via text only training
Theodoros Kouzelis
Vassilis Katsouros
CLIP
77
7
0
21 Sep 2023
A Large-scale Dataset for Audio-Language Representation Learning
A Large-scale Dataset for Audio-Language Representation Learning
Luoyi Sun
Xuenan Xu
Mengyue Wu
Weidi Xie
87
27
0
20 Sep 2023
Investigating Personalization Methods in Text to Music Generation
Investigating Personalization Methods in Text to Music Generation
Manos Plitsis
Theodoros Kouzelis
Georgios Paraskevopoulos
Vassilis Katsouros
Yannis Panagakis
DiffM
70
10
0
20 Sep 2023
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation
  with Consistency Distillation
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Yatong Bai
Trung D. Q. Dang
Dung N. Tran
K. Koishida
Somayeh Sojoudi
DiffM
157
23
0
19 Sep 2023
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Subash Khanal
Srikumar Sastry
Aayush Dhakal
Nathan Jacobs
102
10
0
19 Sep 2023
RECAP: Retrieval-Augmented Audio Captioning
RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh
Sonal Kumar
Chandra Kiran Reddy Evuru
R. Duraiswami
Tianyi Zhou
VLM
100
21
0
18 Sep 2023
Zero- and Few-shot Sound Event Localization and Detection
Zero- and Few-shot Sound Event Localization and Detection
Kazuki Shimada
Kengo Uchida
Yuichiro Koyama
Takashi Shibuya
Shusuke Takahashi
Yuki Mitsufuji
Tatsuya Kawahara
79
4
0
17 Sep 2023
MusiLingo: Bridging Music and Text with Pre-trained Language Models for
  Music Captioning and Query Response
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Zihao Deng
Yi Ma
Yudong Liu
Rongchen Guo
Ge Zhang
Wenhu Chen
Wenhao Huang
Emmanouil Benetos
MLLMAuLLM
114
27
0
15 Sep 2023
Audio-free Prompt Tuning for Language-Audio Models
Audio-free Prompt Tuning for Language-Audio Models
Yiming Li
Xiangdong Wang
Hong Liu
CLIPVLM
71
10
0
15 Sep 2023
Retrieval-Augmented Text-to-Audio Generation
Retrieval-Augmented Text-to-Audio Generation
Yiitan Yuan
Haohe Liu
Xubo Liu
Qiushi Huang
Mark D. Plumbley
Wenwu Wang
RALM
80
28
0
14 Sep 2023
Training Audio Captioning Models without Audio
Training Audio Captioning Models without Audio
Soham Deshmukh
Benjamin Elizalde
Dimitra Emmanouilidou
Bhiksha Raj
Rita Singh
Huaming Wang
61
20
0
14 Sep 2023
Diffusion models for audio semantic communication
Diffusion models for audio semantic communication
Eleonora Grassucci
Christian Marinoni
Andrea Rodriguez
Danilo Comminiello
DiffM
52
24
0
13 Sep 2023
Natural Language Supervision for General-Purpose Audio Representations
Natural Language Supervision for General-Purpose Audio Representations
Benjamin Elizalde
Soham Deshmukh
Huaming Wang
AuLLMAI4TS
79
59
0
11 Sep 2023
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of
  SSWP
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Jinzuomu Zhong
Yang Li
Hui Huang
Korin Richmond
Jie Liu
Zhiba Su
Jing Guo
Benlai Tang
Fengjie Zhu
62
1
0
11 Sep 2023
Parameter Efficient Audio Captioning With Faithful Guidance Using
  Audio-text Shared Latent Representation
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation
A. Sridhar
Yinyi Guo
Erik M. Visser
Rehana Mahfuz
93
5
0
06 Sep 2023
Sparks of Large Audio Models: A Survey and Outlook
Sparks of Large Audio Models: A Survey and Outlook
S. Latif
Moazzam Shoukat
Fahad Shamshad
Muhammad Usama
Yi Ren
...
Wenwu Wang
Xulong Zhang
Roberto Togneri
Min Zhang
Björn W. Schuller
LM&MAAuLLM
183
39
0
24 Aug 2023
Emotion-Aligned Contrastive Learning Between Images and Music
Emotion-Aligned Contrastive Learning Between Images and Music
Shanti Stewart
Kleanthis Avramidis
Tiantian Feng
Shrikanth Narayanan
43
1
0
24 Aug 2023
A Survey of AI Music Generation Tools and Models
A Survey of AI Music Generation Tools and Models
Yueyue Zhu
Jared Baca
Banafsheh Rekabdar
Reza Rawassizadeh
MGen
102
18
0
24 Aug 2023
Music Understanding LLaMA: Advancing Text-to-Music Generation with
  Question Answering and Captioning
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
Shansong Liu
Atin Sakkeer Hussain
Chenshuo Sun
Yin Shan
MLLM
77
55
0
22 Aug 2023
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
  Connecting Foundation Models
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Heng Wang
Jianbo Ma
Santiago Pascual
Richard Cartwright
Weidong (Tom) Cai
VGen
110
43
0
18 Aug 2023
Previous
12345678
Next