ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.12995
  4. Cited By
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
  Head

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

25 April 2023
Rongjie Huang
Mingze Li
Dongchao Yang
Jiatong Shi
Xuankai Chang
Zhenhui Ye
Yuning Wu
Zhiqing Hong
Jia-Bin Huang
Jinglin Liu
Yixiang Ren
Zhou Zhao
Shinji Watanabe
    LM&MA
    AuLLM
ArXivPDFHTML

Papers citing "AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head"

50 / 158 papers shown
Title
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
Shengpeng Ji
Tianle Liang
Yong Li
Jialong Zuo
Minghui Fang
...
Xize Cheng
Siqi Zheng
Jin Xu
Junyang Lin
Zhou Zhao
AuLLM
ALM
33
0
0
14 May 2025
SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation
SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation
Yu-Ren Guo
Wen-Kai Tai
57
0
0
06 May 2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Zuwei Long
Yunhang Shen
Chaoyou Fu
Heting Gao
Lijiang Li
...
Jinlong Peng
Haoyu Cao
Ke Li
Rongrong Ji
Xing Sun
32
0
0
06 May 2025
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
Chengkai Huang
Hongtao Huang
Tong Yu
Kaige Xie
Junda Wu
Shuai Zhang
Julian McAuley
Dietmar Jannach
Lina Yao
LRM
AI4CE
29
0
0
23 Apr 2025
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Andrea Amaduzzi
Pierluigi Zama Ramirez
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
34
0
0
18 Apr 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li
Mining Tan
Feier Shen
Minyan Luo
Zijiao Yin
Fan Tang
W. Dong
Changsheng Xu
69
0
0
17 Apr 2025
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation
Yan Rong
Shan Yang
Guangzhi Lei
Li Liu
28
1
0
15 Apr 2025
Spatial Audio Processing with Large Language Model on Wearable Devices
Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra
Yang Bai
Priyadarshan Narayanasamy
Nakul Garg
Nirupam Roy
30
0
0
11 Apr 2025
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang
Heyang Liu
Ziyang Cheng
Ronghua Wu
Qunshan Gu
Yanfeng Wang
Yu Wang
172
0
0
05 Apr 2025
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Runnan Fang
Xiaobin Wang
Yuan Liang
Shuofei Qiao
Jialong Wu
...
N. Zhang
Yong Jiang
Pengjun Xie
Fei Huang
Hongyu Chen
LLMAG
74
0
0
04 Apr 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
31
0
0
28 Mar 2025
FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
Yupeng Cao
Haohang Li
Yangyang Yu
Shashidhar Reddy Javaji
Yueru He
...
Xiao-Yang Liu
K. P. Subbalakshmi
Meikang Qiu
Sophia Ananiadou
J. Nie
AuLLM
74
0
0
26 Mar 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
59
0
0
22 Mar 2025
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Zhedong Zhang
Liang-Sheng Li
C. Yan
Chunshan Liu
Anton Van Den Hengel
Yuankai Qi
91
2
0
15 Mar 2025
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Kshitij Ambilduke
Ben Peters
Sonal Sannigrahi
Anil Keshwani
Tsz Kin Lam
Bruno Martins
Marcely Zanon Boito
André F. T. Martins
52
0
0
13 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Yuxiao Chen
DiffM
VGen
53
0
0
13 Mar 2025
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Siddhant Arora
Yifan Peng
Jiatong Shi
Jinchuan Tian
William Chen
...
Yosuke Kashiwagi
E. Tsunoo
Shuichiro Shimizu
Vaibhav Srivastav
Shinji Watanabe
42
0
0
11 Mar 2025
ReelWave: A Multi-Agent Framework Toward Professional Movie Sound Generation
Zixuan Wang
Chi-Keung Tang
Yu-Wing Tai
DiffM
VGen
63
0
0
10 Mar 2025
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Umberto Cappellazzo
Minsu Kim
Stavros Petridis
57
0
0
09 Mar 2025
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Alexander H. Liu
Sang-gil Lee
Chao-Han Huck Yang
Yuan Gong
Yu-Chun Wang
James Glass
Rafael Valle
Bryan Catanzaro
SSL
52
0
0
02 Mar 2025
PodAgent: A Comprehensive Framework for Podcast Generation
Yujia Xiao
Lei He
Haohan Guo
Fenglong Xie
Tan Lee
171
0
0
01 Mar 2025
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
Borui Liao
Yulong Xu
Jiao Ou
Kaiyuan Yang
Weihua Jian
Pengfei Wan
Di Zhang
AuLLM
67
0
0
20 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Zehan Li
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Yansen Wang
51
0
0
19 Feb 2025
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
Mingni Tang
Jiajia Li
Lu Yang
Zhiqiang Zhang
Jinghao Tian
Zehan Li
Lefei Zhang
P. Wang
56
0
0
17 Feb 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Ailin Huang
Boyong Wu
Bruce Wang
Chao Yan
Chen Hu
...
Tianyu Wang
Wenjin Deng
Wuxun Xie
Weipeng Ming
Wenqing He
AuLLM
77
9
0
17 Feb 2025
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
Xiangyu Lu
Wang Xu
Haoyu Wang
Hongyun Zhou
Haiyan Zhao
Conghui Zhu
T. Zhao
M. Yang
Mamba
AuLLM
66
0
0
16 Feb 2025
SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer
SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer
Zhengyan Sheng
Zhihao Du
Shiliang Zhang
Zhijie Yan
Yexin Yang
Zhenhua Ling
51
1
0
16 Feb 2025
Learning Musical Representations for Music Performance Question Answering
Xingjian Diao
Chunhui Zhang
Tingxuan Wu
Ming Cheng
Z. Ouyang
Weiyi Wu
Jiang Gui
73
7
0
10 Feb 2025
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
Ke-Han Lu
Zhehuai Chen
Szu-Wei Fu
Chao-Han Huck Yang
Jagadeesh Balam
Boris Ginsburg
Yu-Te Wang
Hung-yi Lee
AuLLM
SyDa
118
5
0
28 Jan 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
99
2
0
28 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
105
18
0
17 Jan 2025
Generative AI for Cel-Animation: A Survey
Generative AI for Cel-Animation: A Survey
Yunlong Tang
Junjia Guo
Pinxin Liu
Zhiyuan Wang
Hang Hua
...
Jing Bi
Mingqian Feng
Xuzhao Li
Zeliang Zhang
Chenliang Xu
VGen
90
7
0
08 Jan 2025
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Tsz Kin Lam
Marco Gaido
Sara Papi
L. Bentivogli
Barry Haddow
36
0
0
04 Jan 2025
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
Se Jin Park
Yeonju Kim
Hyeongseop Rha
Bella Godiva
Y. Ro
36
1
0
23 Dec 2024
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D
  Human Motion
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Changan Chen
Juze Zhang
S. K. Lakshmikanth
Yusu Fang
Ruizhi Shao
Gordon Wetzstein
L. Fei-Fei
Ehsan Adeli
VGen
82
3
0
13 Dec 2024
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large
  Language Models
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu
Atin Sakkeer Hussain
Qilong Wu
Chenshuo Sun
Ying Shan
AuLLM
69
3
0
09 Dec 2024
State-Space Large Audio Language Models
State-Space Large Audio Language Models
Saurabhchand Bhati
Yuan Gong
Leonid Karlinsky
Hilde Kuehne
Rogerio Feris
James Glass
99
0
0
24 Nov 2024
Spider: Any-to-Many Multimodal LLM
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
69
2
0
14 Nov 2024
CT2C-QA: Multimodal Question Answering over Chinese Text, Table and
  Chart
CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart
Bowen Zhao
Tianhao Cheng
Yuejie Zhang
Ying Cheng
Rui Feng
Xiaobo Zhang
LMTD
32
1
0
28 Oct 2024
Robust 3D Point Clouds Classification based on Declarative Defenders
Robust 3D Point Clouds Classification based on Declarative Defenders
Kaidong Li
Tianxiao Zhang
Cuncong Zhong
Zhenru Zhang
G. Wang
3DPC
45
1
0
13 Oct 2024
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice
  Interaction Abilities
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Xin Zhang
Xiang Lyu
Zhihao Du
Qian Chen
Dong Zhang
...
Yuxuan Wang
Bin Zhang
Heng Lu
Yaqian Zhou
Xipeng Qiu
AuLLM
41
6
0
09 Oct 2024
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Zixuan Wang
Chi-Keung Tang
Chi-Keung Tang
DiffM
VGen
LLMAG
49
4
0
04 Oct 2024
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Tengfei Yu
Xuebo Liu
Zhiyi Hou
Liang Ding
Dacheng Tao
Min Zhang
32
0
0
04 Oct 2024
Recent Advances in Speech Language Models: A Survey
Recent Advances in Speech Language Models: A Survey
Wenqian Cui
Dianzhi Yu
Xiaoqi Jiao
Ziqiao Meng
Guangyan Zhang
Qichao Wang
Yiwen Guo
Irwin King
AuLLM
61
14
0
01 Oct 2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large
  Language Models
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Yiming Chen
Xianghu Yue
Xiaoxue Gao
Chen Zhang
L. F. D’Haro
R. Tan
Haizhou Li
AuLLM
32
0
0
27 Sep 2024
Speechworthy Instruction-tuned Language Models
Speechworthy Instruction-tuned Language Models
Hyundong Justin Cho
Nicolaas Jedema
Leonardo F. R. Ribeiro
Karishma Sharma
Pedro Szekely
Alessandro Moschitti
Ruben Janssen
Jonathan May
ALM
44
1
0
23 Sep 2024
A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models
A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models
Ryandhimas E. Zezario
Sabato Marco Siniscalchi
Hsin-Min Wang
Yu Tsao
34
2
0
16 Sep 2024
Large Language Model Based Generative Error Correction: A Challenge and
  Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Chao-Han Huck Yang
Taejin Park
Yuan Gong
Yuanchao Li
Zhehuai Chen
...
E. Chng
Peter Bell
Catherine Lai
Shinji Watanabe
A. Stolcke
AuLLM
ELM
37
4
0
15 Sep 2024
From Experts to the Public: Governing Multimodal Language Models in
  Politically Sensitive Video Analysis
From Experts to the Public: Governing Multimodal Language Models in Politically Sensitive Video Analysis
Tanusree Sharma
Yujin Potter
Zachary Kilhoffer
Yun Huang
Dawn Song
Yang Wang
59
3
0
15 Sep 2024
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
Yanda Li
Chi Zhang
Wanqi Yang
Bin-Bin Fu
Pei Cheng
Xin Chen
Ling Chen
Yunchao Wei
LLMAG
LM&Ro
36
11
0
05 Aug 2024
1234
Next