40
1

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Main:4 Pages
6 Figures
Bibliography:1 Pages
Abstract

Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

View on arXiv
@article{kallakurik2025_2506.11105,
  title={ Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation },
  author={ Uttej Kallakurik and Edward Humes and Rithvik Jonna and Xiaomin Lin and Tinoosh Mohsenin },
  journal={arXiv preprint arXiv:2506.11105},
  year={ 2025 }
}
Comments on this paper