Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.00936
Cited By
A Survey of Useful LLM Evaluation
3 June 2024
Ji-Lun Peng
Sijia Cheng
Egil Diau
Yung-Yu Shih
Po-Heng Chen
Yen-Ting Lin
Yun-Nung Chen
LLMAG
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Survey of Useful LLM Evaluation"
23 / 23 papers shown
Title
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri
Melissa Z. Pan
Shuyi Yang
Lakshya A Agrawal
Bhavya Chopra
...
Dan Klein
Kannan Ramchandran
Matei A. Zaharia
Joseph E. Gonzalez
Ion Stoica
LLMAG
Presented at
ResearchTrend Connect | LLMAG
on
23 Apr 2025
129
8
0
17 Mar 2025
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?
Bo Wang
Yiqiao Li
Jianlong Zhou
Fang Chen
XAI
ELM
42
0
0
28 Feb 2025
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Khanh-Tung Tran
Dung Dao
Minh-Duong Nguyen
Quoc-Viet Pham
Barry O’Sullivan
Hoang D. Nguyen
LLMAG
95
27
0
10 Jan 2025
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
74
0
0
16 Dec 2024
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs
Hyeonwoo Kim
Dahyun Kim
Jihoo Kim
Sukyung Lee
Y. Kim
Chanjun Park
44
0
0
16 Oct 2024
TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction
Wanying Wang
Zeyu Ma
Pengfei Liu
Mingang Chen
LLMAG
47
1
0
15 Oct 2024
Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models
Dahyun Kim
Sukyung Lee
Yungi Kim
Attapol Rutherford
Chanjun Park
ELM
31
1
0
07 Oct 2024
A Survey on Complex Tasks for Goal-Directed Interactive Agents
Mareike Hartmann
Alexander Koller
LM&Ro
LLMAG
34
0
0
27 Sep 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
44
5
0
26 Aug 2024
On Protecting the Data Privacy of Large Language Models (LLMs): A Survey
Biwei Yan
Kun Li
Minghui Xu
Yueyan Dong
Yue Zhang
Zhaochun Ren
Xiuzhen Cheng
AILaw
PILM
70
76
0
08 Mar 2024
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan
Zhixiang Liang
Zifan Ying
Daniel Kang
LLMAG
46
73
0
05 Mar 2024
Measuring and Reducing LLM Hallucination without Gold-Standard Answers
Jiaheng Wei
Yuanshun Yao
Jean-François Ton
Hongyi Guo
Andrew Estornell
Yang Liu
HILM
55
18
0
16 Feb 2024
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation
Xiaoying Zhang
Baolin Peng
Ye Tian
Jingyan Zhou
Lifeng Jin
Linfeng Song
Haitao Mi
Helen Meng
HILM
42
43
0
14 Feb 2024
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
240
2,494
0
06 Oct 2022
Multiple-Choice Question Generation: Towards an Automated Assessment Framework
Vatsal Raina
Mark J. F. Gales
AI4Ed
ELM
26
32
0
23 Sep 2022
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
Dhruv Shah
B. Osinski
Brian Ichter
Sergey Levine
LM&Ro
158
436
0
10 Jul 2022
Large Language Models are Few-Shot Clinical Information Extractors
Monica Agrawal
S. Hegselmann
Hunter Lang
Yoon Kim
David Sontag
BDL
LM&MA
162
334
0
25 May 2022
Teaching language models to support answers with verified quotes
Jacob Menick
Maja Trebacz
Vladimir Mikulik
John Aslanides
Francis Song
...
Mia Glaese
Susannah Young
Lucy Campbell-Gillingham
G. Irving
Nat McAleese
ELM
RALM
243
257
0
21 Mar 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
217
367
0
15 Oct 2021
ALL-IN-ONE: Multi-Task Learning BERT models for Evaluating Peer Assessments
Qinjin Jia
Jiali Cui
Yunkai Xiao
Chengyuan Liu
Parvez Rashid
E. Gehringer
32
43
0
08 Oct 2021
Explaining Answers with Entailment Trees
Bhavana Dalvi
Peter Alexander Jansen
Oyvind Tafjord
Zhengnan Xie
Hannah Smith
Leighanna Pipatanangkura
Peter Clark
ReLM
FAtt
LRM
239
184
0
17 Apr 2021
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
250
673
0
06 Jan 2021
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
415
2,586
0
03 Sep 2019
1