19
2
v1v2 (latest)

Self-Play Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma

Main:9 Pages
7 Figures
Bibliography:3 Pages
5 Tables
Appendix:12 Pages
Abstract

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner's dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated "always defect" policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

View on arXiv
@article{bertrand2025_2312.08484,
  title={ Self-Play Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma },
  author={ Quentin Bertrand and Juan Duque and Emilio Calvano and Gauthier Gidel },
  journal={arXiv preprint arXiv:2312.08484},
  year={ 2025 }
}
Comments on this paper