The use of machine learning and AI on electronic health records (EHRs) holds substantial potential for clinical insight. However, this approach faces challenges due to data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes. In this context, we leverage a linked EHR dataset of approximately one million de-identified individuals from Bristol, North Somerset, and South Gloucestershire, UK, to characterize urinary tract infections (UTIs). We implemented a data pre-processing and curation pipeline that transforms the raw EHR data into a structured format suitable for developing predictive models focused on data fairness, accountability and transparency. Given the limited availability and biases of ground truth UTI outcomes, we introduce a UTI risk estimation framework informed by clinical expertise to estimate UTI risk across individual patient timelines. Pairwise XGBoost models are trained using this framework to differentiate UTI risk categories with explainable AI techniques applied to identify key predictors and support interpretability. Our findings reveal differences in clinical and demographic predictors across risk groups. While this study highlights the potential of AI-driven insights to support UTI clinical decision-making, further investigation of patient sub-strata and extensive validation are needed to ensure robustness and applicability in clinical practice.
View on arXiv@article{dai2025_2411.17645, title={ Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset }, author={ Yujie Dai and Brian Sullivan and Axel Montout and Amy Dillon and Chris Waller and Peter Acs and Rachel Denholm and Philip Williams and Alastair D Hay and Raul Santos-Rodriguez and Andrew Dowsey }, journal={arXiv preprint arXiv:2411.17645}, year={ 2025 } }