A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

22 May 2025

Abstract

The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grainedthis http URLthe second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to $43\%$ reduction in training time and $6.2\%$ increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

View on arXiv

@article{gong2025_2505.16563,
  title={ A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices },
  author={ Chen Gong and Rui Xing and Zhenzhe Zheng and Fan Wu },
  journal={arXiv preprint arXiv:2505.16563},
  year={ 2025 }
}

Comments on this paper