Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.01557
Cited By
v1
v2
v3
v4
v5 (latest)
SmartPlay: A Benchmark for LLMs as Intelligent Agents
2 October 2023
Yue Wu
Xuan Tang
Tom Michael Mitchell
Yuanzhi Li
ELM
LLMAG
Re-assign community
ArXiv (abs)
PDF
HTML
Github (137★)
Papers citing
"SmartPlay: A Benchmark for LLMs as Intelligent Agents"
20 / 20 papers shown
Title
WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models
Qiyue Yin
Pei Xu
Qiaozhe Li
Shengda Liu
S. Shen
...
Lei Cui
Chengxin Yan
Jie Sun
Xiangquan Tang
K. Huang
LLMAG
ELM
LRM
114
0
0
12 Jun 2025
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
Chiyu Zhang
Marc-Alexandre Cote
Michael Albada
Anush Sankaran
Jack W. Stokes
Tong Wang
Amir H. Abdi
William Blum
Muhammad Abdul-Mageed
LLMAG
AAML
ELM
60
0
0
31 May 2025
lmgame-Bench: How Good are LLMs at Playing Games?
Lanxiang Hu
Mingjia Huo
Yu Zhang
Haoyang Yu
Eric P. Xing
Ion Stoica
Tajana Rosing
Haojian Jin
Hao Zhang
140
1
0
21 May 2025
The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners
Vince Trencsenyi
Agnieszka Mensfelt
Kostas Stathis
LRM
162
0
0
14 May 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
151
2
0
10 Mar 2025
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Wenjia Jiang
Yangyang Zhuang
Chenxi Song
Xu Yang
Chi Zhang
Chi Zhang
LLMAG
200
6
0
04 Mar 2025
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
Sherzod Hakimov
Lara Pfennigschmidt
David Schlangen
ELM
140
0
0
17 Feb 2025
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
Hui Wei
Zihao Zhang
Shenghua He
Tian Xia
Shijia Pan
Fei Liu
199
11
0
16 Feb 2025
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Wenye Lin
Jonathan Roberts
Yunhan Yang
Samuel Albanie
Zongqing Lu
Kai Han
LRM
ELM
135
1
0
18 Dec 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri
Bartłomiej Cupiał
Samuel Coward
Ulyana Piterbarg
Maciej Wolczyk
...
Lerrel Pinto
Rob Fergus
Jakob Foerster
Jack Parker-Holder
Tim Rocktaschel
LLMAG
LRM
213
22
0
20 Nov 2024
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
Chenglin Li
Qianglong Chen
Zhi Li
Feng Tao
Yin Zhang
135
0
0
14 Nov 2024
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
Jingxuan Chen
Derek Yuen
Bin Xie
Yue Yang
Gongwei Chen
...
Liqiang Nie
Yasheng Wang
Jianye Hao
Jun Wang
Kun Shao
LLMAG
217
15
0
19 Oct 2024
Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang
Juluan Shi
Chaozheng Wang
Cheryl Lee
Chaozheng Wang
Cheryl Lee
Youliang Yuan
Jen-tse Huang
Wenxiang Jiao
Michael R. Lyu
LLMAG
185
12
0
31 Aug 2024
PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action
Yijia Shao
Tianshi Li
Weiyan Shi
Yanchen Liu
Diyi Yang
PILM
165
31
0
29 Aug 2024
A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
Chris Madge
Massimo Poesio
LLMAG
51
3
0
17 Jul 2024
WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment
Jiefu Ou
Arda Uzunoglu
Benjamin Van Durme
Daniel Khashabi
LM&Ro
VGen
90
3
0
10 Jul 2024
LLM-Craft: Robotic Crafting of Elasto-Plastic Objects with Large Language Models
Alison Bartsch
A. Farimani
160
7
0
12 Jun 2024
Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs
Zhenlan Ji
Daoyuan Wu
Pingchuan Ma
Zongjie Li
Shuai Wang
LLMAG
87
6
0
27 Apr 2024
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
Tula Masterman
Sandi Besen
Mason Sawtell
Alex Chao
LM&Ro
LLMAG
114
58
0
17 Apr 2024
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Jen-tse Huang
E. Li
Man Ho Lam
Tian Liang
Wenxuan Wang
Youliang Yuan
Wenxiang Jiao
Xing Wang
Zhaopeng Tu
Michael R. Lyu
ELM
LLMAG
201
39
0
18 Mar 2024
1