Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification

15 April 2025

Zhenyu Yang

Abstract

Background: Deep learning has significantly advanced medical image analysis, with Vision Transformers (ViTs) offering a powerful alternative to convolutional models by modeling long-range dependencies through self-attention. However, ViTs are inherently data-intensive and lack domain-specific inductive biases, limiting their applicability in medical imaging. In contrast, radiomics provides interpretable, handcrafted descriptors of tissue heterogeneity but suffers from limited scalability and integration into end-to-end learning frameworks. In this work, we propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone.

View on arXiv

@article{yang2025_2504.10916,
  title={ Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification },
  author={ Zhenyu Yang and Haiming Zhu and Rihui Zhang and Haipeng Zhang and Jianliang Wang and Chunhao Wang and Minbin Chen and Fang-Fang Yin },
  journal={arXiv preprint arXiv:2504.10916},
  year={ 2025 }
}

Comments on this paper