48

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Chandler Smith
Marwa Abdulhai
Manfred Diaz
Marko Tesic
Rakshit S. Trivedi
Alexander Sasha Vezhnevets
Lewis Hammond
Jesse Clifton
Minsuk Chang
Edgar A. Duéñez-Guzmán
John P. Agapiou
Jayd Matyas
Danny Karmon
Akash Kundu
Aliaksei Korshuk
Ananya Ananya
Arrasy Rahman
Avinaash Anand Kulandaivel
Bain McHale
Beining Zhang
Buyantuev Alexander
Carlos Saith Rodriguez Rojas
Caroline Wang
Chetan Talele
Chenao Liu
Chichen Lin
Diana Riazi
Di Yang Shi
Emanuel Tewolde
Elizaveta Tennant
Fangwei Zhong
Fuyang Cui
Gang Zhao
Gema Parreño Piqueras
Hyeonggeun Yun
Ilya Makarov
Jiaxun Cui
Jebish Purbey
Jim Dilkes
Jord Nguyen
Lingyun Xiao
Luis Felipe Giraldo
Manuela Chacon-Chamorro
Manuel Sebastian Rios Beltran
Marta Emili García Segura
Mengmeng Wang
Mogtaba Alim
Nicanor Quijano
Nico Schiavone
Olivia Macmillan-Scott
Oswaldo Peña
Peter Stone
Ram Mohan Rao Kadiyala
Rolando Fernandez
Ruben Manrique
Sunjia Lu
Sheila A. McIlraith
Shamika Dhuri
Shuqing Shi
Siddhant Gupta
Sneheel Sarangi
Sriram Ganapathi Subramanian
Taehun Cha
Toryn Q. Klassen
Wenming Tu
Weijian Fan
Wu Ruiyang
Xue Feng
Yali Du
Yang Liu
Yiding Wang
Yipeng Kang
Yoonchang Sung
Yuxuan Chen
Zhaowei Zhang
Zhihan Wang
Zhiqiang Wu
Ziang Chen
Zilong Zheng
Zixia Jia
Ziyan Wang
Dylan Hadfield-Menell
Natasha Jaques
Tim Baarslag
Jose Hernandez-Orallo
Joel Z. Leibo
Main:9 Pages
9 Figures
Bibliography:6 Pages
12 Tables
Appendix:22 Pages
Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

View on arXiv
Comments on this paper