Less is More: Undertraining Experts Improves Model Upcycling

17 June 2025

Stefan Horoi

Guy Wolf

Eugene Belilovsky

Gintare Karolina Dziugaite

MoMe

MoE

ArXiv (abs)PDF HTML

Main:9 Pages

7 Figures

Bibliography:6 Pages

2 Tables

Appendix:1 Pages

Abstract

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.

View on arXiv

@article{horoi2025_2506.14126,
  title={ Less is More: Undertraining Experts Improves Model Upcycling },
  author={ Stefan Horoi and Guy Wolf and Eugene Belilovsky and Gintare Karolina Dziugaite },
  journal={arXiv preprint arXiv:2506.14126},
  year={ 2025 }
}

Comments on this paper