Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models

Cardiovascular disease (CVD) risk prediction models are essential for identifying high-risk individuals and guiding preventive actions. However, existing models struggle with the challenges of real-world clinical practice as they oversimplify patient profiles, rely on rigid input schemas, and are sensitive to distribution shifts. We developed AdaCVD, an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank. In benchmark comparisons, AdaCVD surpasses established risk scores and standard machine learning approaches, achieving state-of-the-art performance. Crucially, for the first time, it addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data. In stratified analyses, it demonstrates robust performance across demographic, socioeconomic, and clinical subgroups, including underrepresented cohorts. AdaCVD offers a promising path toward more flexible, AI-driven clinical decision support tools suited to the realities of heterogeneous and dynamic healthcare environments.
View on arXiv@article{lübeck2025_2505.24655, title={ Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models }, author={ Frederike Lübeck and Jonas Wildberger and Frederik Träuble and Maximilian Mordig and Sergios Gatidis and Andreas Krause and Bernhard Schölkopf }, journal={arXiv preprint arXiv:2505.24655}, year={ 2025 } }