Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.10934
Cited By
Agent-as-a-Judge: Evaluate Agents with Agents
14 October 2024
Mingchen Zhuge
Changsheng Zhao
Dylan R. Ashley
Wenyi Wang
Dmitrii Khizbullin
Yunyang Xiong
Zechun Liu
E. Chang
Raghuraman Krishnamoorthi
Yuandong Tian
Yangyang Shi
Vikas Chandra
Jürgen Schmidhuber
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Agent-as-a-Judge: Evaluate Agents with Agents"
11 / 11 papers shown
Title
On the Evaluation of Engineering Artificial General Intelligence
Sandeep Neema
Susmit Jha
Adam Nagel
Ethan Lew
Chandrasekar Sureshkumar
Aleksa Gordic
Chase Shimmin
Hieu Nguygen
Paul Eremenko
ELM
22
0
0
15 May 2025
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
27
0
0
13 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
109
0
1
01 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Yuxiao Chen
Qingyun Wu
49
1
0
30 Apr 2025
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace
Oliver Jaffe
Dane Sherburn
James Aung
Jun Shern Chan
...
Benjamin Kinsella
Wyatt Thompson
Johannes Heidecke
Amelia Glaese
Tejal Patwardhan
ALM
ELM
802
7
0
02 Apr 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
67
1
0
10 Mar 2025
WildIFEval: Instruction Following in the Wild
Gili Lior
Asaf Yehudai
Ariel Gera
L. Ein-Dor
71
0
0
09 Mar 2025
BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems
Nikita Mehandru
Amanda K. Hall
Olesya Melnichenko
Yulia Dubinina
Daniel Tsirulnikov
David Bamman
Ahmed Alaa
Scott Saponas
Venkat S. Malladi
44
3
0
10 Jan 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
123
70
0
25 Nov 2024
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan
Neil Chowdhury
Oliver Jaffe
James Aung
Dane Sherburn
...
Kevin Liu
Leon Maksin
Tejal Patwardhan
Lilian Weng
Aleksander Mądry
ELM
LLMAG
54
48
0
09 Oct 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Zhimin Zhao
A. A. Bangash
F. Côgo
Bram Adams
Ahmed E. Hassan
62
1
0
04 Jul 2024
1