ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.04160
  4. Cited By
X-LLM: Bootstrapping Advanced Large Language Models by Treating
  Multi-Modalities as Foreign Languages
v1v2v3 (latest)

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

7 May 2023
Feilong Chen
Minglun Han
Haozhi Zhao
Qingyang Zhang
Jing Shi
Shuang Xu
Bo Xu
    MLLM
ArXiv (abs)PDFHTML

Papers citing "X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages"

50 / 95 papers shown
Title
Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu
Luis A. Leiva
19
0
0
10 Jun 2025
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Pooneh Mousavi
Yingzhi Wang
Mirco Ravanelli
Cem Subakan
AuLLM
77
0
0
26 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
258
0
0
23 May 2025
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
Zirui Song
Qian Jiang
Mingxuan Cui
Mingzhe Li
Lang Gao
...
Yanbo Wang
Chenxi Wang
Guangxian Ouyang
Zhenhao Chen
Xiuying Chen
AuLLMAAML
98
0
0
21 May 2025
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Junlin Li
Guodong DU
Jing Li
Sim Kuan Goh
Wenya Wang
...
Fangming Liu
Jing Li
Saleh Alharbi
Daojing He
Min Zhang
MoMeCLL
141
1
0
21 May 2025
AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
Md Asaduzzaman Jabin
Hanqi Jiang
Yuchen Li
Patrick Kaggwa
Eugene Douglass
Juliet N. Sekandi
Tianming Liu
LM&MA
157
0
0
01 May 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
189
0
0
29 Apr 2025
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation
Keqi Deng
Wenxi Chen
Xie Chen
P. Woodland
122
0
0
22 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
100
1
0
15 Apr 2025
On The Landscape of Spoken Language Models: A Comprehensive Survey
On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora
Kai-Wei Chang
Chung-Ming Chien
Yifan Peng
Haibin Wu
Yossi Adi
Emmanuel Dupoux
Hung-yi Lee
Karen Livescu
Shinji Watanabe
158
14
0
11 Apr 2025
Think When You Need: Self-Adaptive Chain-of-Thought Learning
Think When You Need: Self-Adaptive Chain-of-Thought Learning
Junjie Yang
Ke Lin
Xing Yu
ReLMLRMAI4CE
125
2
0
04 Apr 2025
SocialGen: Modeling Multi-Human Social Interaction with Language Models
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu
Juze Zhang
Changan Chen
Tiange Xiang
Yusu Fang
Juan Carlos Niebles
Ehsan Adeli
VGen
93
1
0
28 Mar 2025
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
Henglyu Liu
Andong Chen
Kehai Chen
X. Bai
M. Zhong
Yuan Qiu
Min Zhang
78
0
0
13 Mar 2025
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
Jian Ma
Qirong Peng
Xu Guo
Chen Chen
H. Lu
Zhenyu Yang
VLM
158
1
0
08 Mar 2025
Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs
Dingkun Zhang
Shuhan Qi
Xinyu Xiao
Kehai Chen
Xuan Wang
CLLMoMe
119
0
0
08 Mar 2025
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
Guotao Liang
Baoquan Zhang
Zhiyuan Wen
Junteng Zhao
Yunming Ye
Kola Ye
Yao He
96
0
0
03 Mar 2025
Retrieval-Augmented Speech Recognition Approach for Domain Challenges
Retrieval-Augmented Speech Recognition Approach for Domain Challenges
Peng Shen
Xugang Lu
Hisashi Kawai
RALM
105
0
0
24 Feb 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
254
134
0
10 Jan 2025
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Tsz Kin Lam
Marco Gaido
Sara Papi
L. Bentivogli
Barry Haddow
128
0
0
04 Jan 2025
Unveiling Visual Perception in Language Models: An Attention Head
  Analysis Approach
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi
Junjia Guo
Yunlong Tang
Lianggong Wen
Zhang Liu
Chenliang Xu
52
6
0
24 Dec 2024
A Review of Multimodal Explainable Artificial Intelligence: Past,
  Present and Future
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Shilin Sun
Wenbin An
Feng Tian
Fang Nan
Qidong Liu
Jing Liu
N. Shah
Ping Chen
162
6
0
18 Dec 2024
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
Wanqi Yang
Yongqian Li
Meng Fang
Yunchao Wei
Dinesh Manocha
AAMLELMAuLLM
121
1
0
22 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and
  Outlook
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAMLELMPILM
112
1
0
12 Nov 2024
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language
  Models
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
VLM
69
1
0
07 Nov 2024
MME-Finance: A Multimodal Finance Benchmark for Expert-level
  Understanding and Reasoning
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Ziliang Gan
Yu Lu
D. Zhang
Haohan Li
Che Liu
...
Haipang Wu
Chaoyou Fu
Z. Xu
Rongjunchen Zhang
Yong Dai
108
13
0
05 Nov 2024
Vision Search Assistant: Empower Vision-Language Models as Multimodal
  Search Engines
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
Zhixin Zhang
Yiyuan Zhang
Xiaohan Ding
Xiangyu Yue
75
4
0
28 Oct 2024
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Tengfei Yu
Xuebo Liu
Zhiyi Hou
Liang Ding
Dacheng Tao
Min Zhang
65
1
0
04 Oct 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task
  Learning Via Connector-MoE
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
Xun Zhu
Ying Hu
Fanbin Mo
Miao Li
Ji Wu
127
9
0
26 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
152
2
0
19 Sep 2024
From Text to Multimodality: Exploring the Evolution and Impact of Large
  Language Models in Medical Practice
From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice
Qian Niu
Keyu Chen
Ming Li
Pohsun Feng
Ziqian Bi
...
Junyu Liu
Benji Peng
Tianyang Wang
Yunze Wang
Silin Chen
LM&MA
109
7
0
14 Sep 2024
Affective Computing Has Changed: The Foundation Model Disruption
Affective Computing Has Changed: The Foundation Model Disruption
Björn Schuller
Adria Mallol-Ragolta
Alejandro Pena Almansa
Iosif Tsangko
Mostafa M. Amin
A. Semertzidou
Lukas Christ
Shahin Amiriparian
113
1
0
13 Sep 2024
NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training
NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training
Minglun Han
Ye Bai
Chen Shen
Youjia Huang
Mingkun Huang
Zehua Lin
Linhao Dong
Lu Lu
Yuxuan Wang
76
1
0
13 Sep 2024
A Comprehensive Review of Multimodal Large Language Models: Performance
  and Challenges Across Different Tasks
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
Jiaqi Wang
Hanqi Jiang
Yi-Hsueh Liu
Chong Ma
Xu-Yao Zhang
...
Xin Zhang
Wei Zhang
Dinggang Shen
Tianming Liu
Shu Zhang
VLMAI4TS
111
36
0
02 Aug 2024
SynesLM: A Unified Approach for Audio-visual Speech Recognition and
  Translation via Language Model and Synthetic Data
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
Yichen Lu
Álvaro Huertas-García
Xuankai Chang
Hengwei Bian
Soumi Maiti
Shinji Watanabe
93
2
0
01 Aug 2024
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based
  Speech Recognition
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Ye Bai
Jingping Chen
Jitong Chen
Wei Chen
Zhuo Chen
...
Wanyi Zhang
Yang Zhang
Yawei Zhang
Yijie Zheng
Ming Zou
AuLLM
121
28
0
05 Jul 2024
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control
  Speech Foundation Models
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models
Vyas Raina
Mark Gales
AAML
67
2
0
05 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLMMLLM
98
15
0
01 Jul 2024
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
Chenming Zhu
Tai Wang
Wenwei Zhang
Kai Chen
Xihui Liu
ReLMLRM
112
24
0
01 Jul 2024
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Yuxuan Wang
Chao Zhang
95
35
0
22 Jun 2024
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible
  Acoustic Reception and Reaction
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
Haoqiu Yan
Yongxin Zhu
Kai Zheng
Bing Liu
Haoyu Cao
Deqiang Jiang
Linli Xu
AuLLM
90
5
0
18 Jun 2024
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in
  Zero and Few-shot Learning
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Keqi Deng
Guangzhi Sun
Phil Woodland
VLM
67
4
0
01 Jun 2024
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge
  Distillation
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
Chen Wang
Minpeng Liao
Zhongqiang Huang
Jiajun Zhang
ALMAuLLM
92
3
0
29 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
111
18
0
28 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
139
45
0
26 May 2024
User-Friendly Customized Generation with Multi-Modal Prompts
User-Friendly Customized Generation with Multi-Modal Prompts
Linhao Zhong
Yan Hong
Wentao Chen
Binglin Zhou
Yiyi Zhang
Jianfu Zhang
Liqing Zhang
DiffM
75
1
0
26 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
335
54
0
23 May 2024
Listen Again and Choose the Right Answer: A New Paradigm for Automatic
  Speech Recognition with Large Language Models
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
Yuchen Hu
Chen Chen
Chengwei Qin
Qiushi Zhu
Eng Siong Chng
Ruizhe Li
AuLLMKELM
98
7
0
16 May 2024
HRVDA: High-Resolution Visual Document Assistant
HRVDA: High-Resolution Visual Document Assistant
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
102
26
0
10 Apr 2024
Facial Affective Behavior Analysis with Instruction Tuning
Facial Affective Behavior Analysis with Instruction Tuning
Yifan Li
Anh Dao
Wentao Bao
Zhen Tan
Tianlong Chen
Huan Liu
Yu Kong
CVBM
116
15
0
07 Apr 2024
RegionGPT: Towards Region Understanding Vision Language Model
RegionGPT: Towards Region Understanding Vision Language Model
Qiushan Guo
Shalini De Mello
Hongxu Yin
Wonmin Byeon
Ka Chun Cheung
Yizhou Yu
Ping Luo
Sifei Liu
VLM
100
37
0
04 Mar 2024
12
Next