Understanding the Capabilities and Limitations of Weak-to-Strong Generalization

3 February 2025

Wei Yao

Main:8 Pages

4 Figures

Bibliography:5 Pages

Appendix:21 Pages

Abstract

Weak-to-strong generalization, where weakly supervised strong models outperform their weaker teachers, offers a promising approach to aligning superhuman models with human values. To deepen the understanding of this approach, we provide theoretical insights into its capabilities and limitations. First, in the classification setting, we establish upper and lower generalization error bounds for the strong model, identifying the primary limitations as stemming from the weak model's generalization error and the optimization objective itself. Additionally, we derive lower and upper bounds on the calibration error of the strong model. These theoretical bounds reveal two critical insights: (1) the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions, and (2) the strong model's training process must strike a careful balance, as excessive optimization could undermine its generalization capability by over-relying on the weak supervision signals. Finally, in the regression setting, we extend the work of Charikar et al. (2024) to a loss function based on Kullback-Leibler (KL) divergence, offering guarantees that the strong student can outperform its weak teacher by at least the magnitude of their disagreement. We conduct sufficient experiments to validate our theory.

View on arXiv

@article{yao2025_2502.01458,
  title={ The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration },
  author={ Wei Yao and Wenkai Yang and Gengze Xu and Ziqiao Wang and Yankai Lin and Yong Liu },
  journal={arXiv preprint arXiv:2502.01458},
  year={ 2025 }
}

Comments on this paper