MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference OptimizationInternational Conference on Learning Representations (ICLR), 2024 |
MA-RLHF: Reinforcement Learning from Human Feedback with Macro ActionsInternational Conference on Learning Representations (ICLR), 2024 |