Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.05087
Cited By
v1
v2 (latest)
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
8 June 2023
Yidong Wang
Zhuohao Yu
Zhengran Zeng
Linyi Yang
Cunxiang Wang
Hao Chen
Chaoya Jiang
Rui Xie
Jindong Wang
Xingxu Xie
Wei Ye
Shi-Bo Zhang
Yue Zhang
ALM
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (914★)
Papers citing
"PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization"
50 / 184 papers shown
Title
GRAM: A Generative Foundation Reward Model for Reward Generalization
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Qiaozhi He
...
Bei Li
Tong Xiao
Chunliang Zhang
Tongran Liu
Jingbo Zhu
ALM
OffRL
LRM
57
0
0
17 Jun 2025
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
Tzu-Heng Huang
Harit Vishwakarma
Frederic Sala
ELM
141
0
0
12 Jun 2025
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
Songze Li
Chuokun Xu
Jiaying Wang
Xueluan Gong
Chen Chen
J. Zhang
Jun Wang
K. Lam
Shouling Ji
AAML
ELM
91
0
0
11 Jun 2025
Unlocking Recursive Thinking of LLMs: Alignment via Refinement
Haoke Zhang
Xiaobo Liang
Cunxiang Wang
Juntao Li
Min Zhang
LRM
47
0
0
06 Jun 2025
RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu
Jiali Zeng
Weizheng Gu
Yidong Wang
Jindong Wang
Fandong Meng
Jie Zhou
Yue Zhang
Shikun Zhang
Wei Ye
LRM
119
1
0
04 Jun 2025
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
63
0
0
03 Jun 2025
Data Swarms: Optimizable Generation of Synthetic Evaluation Data
Shangbin Feng
Yike Wang
Weijia Shi
Yulia Tsvetkov
59
0
0
31 May 2025
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization
Utsav Maskey
Chencheng Zhu
Usman Naseem
AAML
ELM
27
1
0
30 May 2025
Judging LLMs on a Simplex
Patrick Vossler
Fan Xia
Yifan Mai
Jean Feng
65
0
0
28 May 2025
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Zhuo Liu
Moxin Li
Xun Deng
Qifan Wang
Fuli Feng
ELM
74
0
0
25 May 2025
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Qian Cao
Xiting Wang
Yuzhuo Yuan
Yahui Liu
Fang Luo
Ruihua Song
53
0
0
25 May 2025
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELM
LRM
218
0
0
24 May 2025
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text
Ibrahim Al Azher
Miftahul Jannat Mokarrama
Zhishuai Guo
Sagnik Ray Choudhury
Hamed Alhoori
96
0
0
22 May 2025
LLM-based Evaluation Policy Extraction for Ecological Modeling
Qi Cheng
Licheng Liu
Qing Zhu
Runlong Yu
Zhenong Jin
Yiqun Xie
Xiaowei Jia
63
0
0
20 May 2025
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Xiechi Zhang
Zetian Ouyang
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Ya Zhang
Yanfeng Wang
Liang He
LM&MA
ELM
124
0
0
17 May 2025
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
81
2
0
16 May 2025
am-ELO: A Stable Framework for Arena-based LLM Evaluation
Zirui Liu
Jiatong Li
Yan Zhuang
Qiang Liu
Shuanghong Shen
Jie Ouyang
Mingyue Cheng
Shijin Wang
188
1
0
06 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
Haoyang Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
81
0
0
04 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
263
0
1
01 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
253
7
0
26 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
58
0
0
25 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq Joty
ELM
ALM
LRM
174
5
0
21 Apr 2025
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Reya Vir
Shreya Shankar
Harrison Chase
Will Fu-Hinthorn
Aditya G. Parameswaran
AI4TS
87
0
0
20 Apr 2025
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Ding Chen
Qingchen Yu
P. Wang
Wentao Zhang
Simin Niu
Feiyu Xiong
Xiaochen Li
Minchuan Yang
Zhiyu Li
ALM
LRM
136
6
0
14 Apr 2025
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
A. Ramesh
Shailesh Nanisetty
Chirag Goel
David Vazquez
Christopher Pal
Spandana Gella
Giuseppe Carenini
I. Laradji
86
0
0
10 Apr 2025
JudgeLRM: Large Reasoning Models as a Judge
Nuo Chen
Zhiyuan Hu
Qingyun Zou
Jiaying Wu
Qian Wang
Bryan Hooi
Bingsheng He
ReLM
ELM
LRM
190
15
0
31 Mar 2025
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
Tuo Liang
Zhe Hu
Jing Li
Hao Zhang
Yiren Lu
...
Yiran Qiao
Disheng Liu
Jeirui Peng
Jing Ma
Yu Yin
136
0
0
29 Mar 2025
Data Poisoning in Deep Learning: A Survey
Pinlong Zhao
Weiyao Zhu
Pengfei Jiao
Di Gao
Ou Wu
AAML
155
1
0
27 Mar 2025
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
148
2
0
24 Mar 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev
Alena Fenogenova
Vladislav Mikhailov
Ekaterina Artemova
111
0
0
17 Mar 2025
DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
Minjun Zhu
Yixuan Weng
Linyi Yang
Yue Zhang
ALM
LRM
112
7
0
11 Mar 2025
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
Jiyue Jiang
Pengan Chen
Jinqiao Wang
Dongchen He
Ziqin Wei
...
Yimin Fan
Xiangyu Shi
Jimeng Sun
Chuan Wu
Yuan Li
LM&MA
121
3
0
06 Mar 2025
Argument Summarization and its Evaluation in the Era of Large Language Models
Moritz Altemeyer
Steffen Eger
Johannes Daxenberger
Yanran Chen
Tim Altendorf
Philipp Cimiano
Benjamin Schiller
LM&MA
ELM
LRM
124
1
0
02 Mar 2025
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Liangyou Li
Xin Jiang
Lifeng Shang
Qiang Liu
Wenjie Li
ELM
159
0
0
26 Feb 2025
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu
Xinze Li
Zhenghao Liu
Yukun Yan
Cheng Yang
Zheni Zeng
Zhiyuan Liu
Maosong Sun
Ge Yu
RALM
264
3
0
26 Feb 2025
Leveraging Large Models for Evaluating Novel Content: A Case Study on Advertisement Creativity
Zhaoyi Joey Hou
Adriana Kovashka
Xiang Lorraine Li
85
0
0
26 Feb 2025
PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning
Shuo Yang
Yu-Yang Liu
Jia-Yu Yao
Zhen-Hui Liu
Yu Wang
Ming Pang
Li Yuan
ALM
217
9
0
24 Feb 2025
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Kimia Noorbakhsh
Joseph Chandler
Pantea Karimi
M. Alizadeh
H. Balakrishnan
LRM
109
1
0
18 Feb 2025
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
Mingni Tang
Jiajia Li
Lu Yang
Zhiqiang Zhang
Jinghao Tian
Hui Yuan
Lefei Zhang
Peijie Wang
94
0
0
17 Feb 2025
Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models
Yue Xu
Chengyan Fu
Li Xiong
Sibei Yang
Wenjie Wang
123
0
0
17 Feb 2025
Uncertainty-Aware Step-wise Verification with Generative Reward Models
Zihuiwen Ye
Luckeciano C. Melo
Younesse Kaddar
Phil Blunsom
Shivalika Singh
Yarin Gal
LRM
147
5
0
16 Feb 2025
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie
Qingqiu Li
Zhuohao Yu
Yuejie Zhang
Yue Zhang
Linyi Yang
ELM
134
5
0
15 Feb 2025
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education
Yanhao Jia
Xinyi Wu
Hao Li
Qinglin Zhang
Yuxiao Hu
Shuai Zhao
Wenqi Fan
183
5
0
09 Feb 2025
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Shunfan Zheng
Xiechi Zhang
Gerard de Melo
Xiaoling Wang
Linlin Wang
LM&MA
ELM
49
1
0
12 Jan 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
213
13
0
31 Dec 2024
Explaining Length Bias in LLM-Based Preference Evaluations
Zhengyu Hu
Linxin Song
Jieyu Zhang
Zheyuan Xiao
Jingang Wang
Zhengyu Chen
N. Yuan
Jianxun Lian
Kaize Ding
Hui Xiong
ALM
91
7
0
31 Dec 2024
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation
Zhuohao Yu
Weizheng Gu
Yidong Wang
Xingru Jiang
Zhengran Zeng
Jindong Wang
Wei Ye
Shikun Zhang
LRM
195
3
0
19 Dec 2024
ACE-
M
3
M^3
M
3
: Automatic Capability Evaluator for Multimodal Medical Models
Xiechi Zhang
Shunfan Zheng
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Liang He
ELM
151
0
0
16 Dec 2024
Towards Action Hijacking of Large Language Model-based Agent
Yuyang Zhang
Kangjie Chen
Xudong Jiang
Yuxiang Sun
Run Wang
Lina Wang
Tianwei Zhang
LLMAG
AAML
185
0
0
14 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
377
112
0
25 Nov 2024
1
2
3
4
Next