The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grainedthis http URLthe second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to reduction in training time and increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.
View on arXiv@article{gong2025_2505.16563, title={ A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices }, author={ Chen Gong and Rui Xing and Zhenzhe Zheng and Fan Wu }, journal={arXiv preprint arXiv:2505.16563}, year={ 2025 } }