MoKD: Multi-Task Optimization for Knowledge Distillation

13 May 2025

Zeeshan Hayder

Main:8 Pages

11 Figures

Bibliography:2 Pages

5 Tables

Appendix:4 Pages

Abstract

Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.

View on arXiv

@article{hayder2025_2505.08170,
  title={ MoKD: Multi-Task Optimization for Knowledge Distillation },
  author={ Zeeshan Hayder and Ali Cheraghian and Lars Petersson and Mehrtash Harandi },
  journal={arXiv preprint arXiv:2505.08170},
  year={ 2025 }
}

Comments on this paper