ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.17926
  4. Cited By
Large Language Models are not Fair Evaluators
v1v2 (latest)

Large Language Models are not Fair Evaluators

29 May 2023
Peiyi Wang
Lei Li
Liang Chen
Zefan Cai
Dawei Zhu
Binghuai Lin
Yunbo Cao
Qi Liu
Tianyu Liu
Zhifang Sui
    ALM
ArXiv (abs)PDFHTMLGithub (137★)

Papers citing "Large Language Models are not Fair Evaluators"

50 / 133 papers shown
Title
The Role of Model Confidence on Bias Effects in Measured Uncertainties
The Role of Model Confidence on Bias Effects in Measured Uncertainties
Xinyi Liu
Weiguang Wang
Hangfeng He
21
0
0
20 Jun 2025
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
Yongqi Fan
Yating Wang
Guandong Wang
Jie Zhai
Jingping Liu
Qi Ye
Tong Ruan
26
0
0
18 Jun 2025
GRAM: A Generative Foundation Reward Model for Reward Generalization
GRAM: A Generative Foundation Reward Model for Reward Generalization
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Qiaozhi He
...
Bei Li
Tong Xiao
Chunliang Zhang
Tongran Liu
Jingbo Zhu
ALMOffRLLRM
57
0
0
17 Jun 2025
WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models
WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models
Qiyue Yin
Pei Xu
Qiaozhe Li
Shengda Liu
S. Shen
...
Lei Cui
Chengxin Yan
Jie Sun
Xiangquan Tang
K. Huang
LLMAGELMLRM
119
0
0
12 Jun 2025
Towards Understanding Bias in Synthetic Data for Evaluation
Towards Understanding Bias in Synthetic Data for Evaluation
Hossein A. Rahmani
Varsha Ramineni
Nick Craswell
Bhaskar Mitra
Emine Yilmaz
118
0
0
12 Jun 2025
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
Saptarshi Sengupta
Shuhua Yang
Paul Kwong Yu
Fali Wang
Suhang Wang
56
0
0
06 Jun 2025
daDPO: Distribution-Aware DPO for Distilling Conversational Abilities
daDPO: Distribution-Aware DPO for Distilling Conversational Abilities
Zhengze Zhang
Shiqi Wang
Yiqun Shen
Simin Guo
Dahua Lin
Xiaoliang Wang
Nguyen Cam-Tu
Fei Tan
19
0
0
03 Jun 2025
Quantitative LLM Judges
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
66
0
0
03 Jun 2025
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training
Keyeun Lee
Seolhee Lee
Esther Hehsun Kim
Yena Ko
Jinsu Eun
...
Haiyi Zhu
Robert E. Kraut
Eunyoung Suh
Eun-mee Kim
Hajin Lim
34
0
0
31 May 2025
Flex-Judge: Think Once, Judge Anywhere
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELMLRM
218
0
0
24 May 2025
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li
Guangda Huzhang
Haibo Zhang
Xiting Wang
Anxiang Zeng
44
0
0
24 May 2025
Large Language Models for Predictive Analysis: How Far Are They?
Large Language Models for Predictive Analysis: How Far Are They?
Qin Chen
Yuanyi Ren
Xiaojun Ma
Yuyang Shi
101
1
0
22 May 2025
Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning
Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning
Korel Gundem
Juncheng Dong
Dennis Zhang
Vahid Tarokh
Zhengling Qi
24
0
0
22 May 2025
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
Austin Xu
Yilun Zhou
Xuan-Phi Nguyen
Caiming Xiong
Shafiq Joty
ELMLRM
146
0
0
19 May 2025
R3: Robust Rubric-Agnostic Reward Models
R3: Robust Rubric-Agnostic Reward Models
David Anugraha
Zilu Tang
Lester James V. Miranda
Hanyang Zhao
Mohammad Rifqi Farhansyah
Garry Kuwanto
Derry Wijaya
Genta Indra Winata
220
1
0
19 May 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Jonas Fischer
Barbara Plank
89
0
0
22 Apr 2025
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Andrea Santilli
Adam Goliñski
Michael Kirchhof
Federico Danieli
Arno Blaas
Miao Xiong
Luca Zappella
Sinead Williamson
71
3
0
18 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
141
0
0
16 Apr 2025
Large Language Models as Span Annotators
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
124
0
0
11 Apr 2025
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov
Tita Ranveig Enstad
David Samuel
Hans Christian Farsethås
Andrey Kutuzov
Erik Velldal
Lilja Øvrelid
ELM
115
1
0
10 Apr 2025
Summarizing Speech: A Comprehensive Survey
Summarizing Speech: A Comprehensive Survey
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
Alexander H. Waibel
119
0
0
10 Apr 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
128
3
0
17 Mar 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
169
2
0
16 Mar 2025
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
So Young Lee
Russell Scheinberg
Amber Shore
Ameeta Agrawal
79
1
0
13 Mar 2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen
Seraphina Goldfarb-Tarrant
74
0
0
12 Mar 2025
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Jiho Jin
Woosung Kang
Junho Myung
Alice Oh
72
0
0
10 Mar 2025
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
Bang Nguyen
Tingting Du
Mengxia Yu
Lawrence Angrave
Meng Jiang
AI4Ed
111
0
0
07 Mar 2025
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm
Hui Yuan
Yuhao Du
Xiaoqi Jiao
Yiwen Guo
Yuege Feng
Xiang Wan
Anningzhe Gao
Jinpeng Hu
98
0
0
04 Mar 2025
Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks
Umar Ali Khan
Ekram Khan
Fiza Khan
A. A. Moinuddin
90
0
0
02 Mar 2025
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Juntai Cao
Xiang Zhang
Raymond Li
Chuyuan Li
Shafiq Joty
Shafiq Joty
Giuseppe Carenini
181
2
0
27 Feb 2025
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Liangyou Li
Xin Jiang
Lifeng Shang
Qiang Liu
Wenjie Li
ELM
159
0
0
26 Feb 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
Han Wang
Qi Han
Yanghua Xiao
202
14
0
26 Feb 2025
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation
K. Yan
Hongcheng Guo
Xuanqing Shi
Jinfeng Xu
Yaonan Gu
Hui Yuan
ALM
144
1
0
26 Feb 2025
Investigating Non-Transitivity in LLM-as-a-Judge
Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu
Laura Ruis
Tim Rocktaschel
Robert Kirk
120
3
0
19 Feb 2025
Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison
Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison
George Saad
Scott Sanner
38
0
0
18 Feb 2025
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Kimia Noorbakhsh
Joseph Chandler
Pantea Karimi
M. Alizadeh
H. Balakrishnan
LRM
109
1
0
18 Feb 2025
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie
Qingqiu Li
Zhuohao Yu
Yuejie Zhang
Yue Zhang
Linyi Yang
ELM
134
5
0
15 Feb 2025
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Xu Huang
Wenhao Zhu
Hanxu Hu
Zeang Sheng
Lei Li
Shujian Huang
Fei Yuan
ELM
180
4
0
11 Feb 2025
AI Alignment at Your Discretion
AI Alignment at Your Discretion
Maarten Buyl
Hadi Khalaf
C. M. Verdun
Lucas Monteiro Paes
Caio Vieira Machado
Flavio du Pin Calmon
114
1
0
10 Feb 2025
Aligning Black-box Language Models with Human Judgments
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
141
0
0
07 Feb 2025
MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer
Qi Chen
Dexi Liu
113
0
0
28 Jan 2025
Explaining Decisions of Agents in Mixed-Motive Games
Explaining Decisions of Agents in Mixed-Motive Games
Maayan Orner
Oleg Maksimov
Akiva Kleinerman
Charles Ortiz
Sarit Kraus
146
1
0
28 Jan 2025
Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation
Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation
Junhong Lian
Xiang Ao
Xinyu Liu
Yang Liu
Qing He
117
0
0
21 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
214
83
0
20 Jan 2025
Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis
Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis
Lanling Xu
Junjie Zhang
Bingqian Li
Jinpeng Wang
Sheng Chen
Wayne Xin Zhao
Ji-Rong Wen
185
18
0
17 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAGALM
187
102
0
03 Jan 2025
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
Sheikh Shafayat
Dongkeun Yoon
Woori Jang
Jiwoo Choi
Alice Oh
Seohyon Jung
210
1
0
03 Jan 2025
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Dong Yuan
Eti Rastogi
Fen Zhao
Sagar Goyal
Gautam Naik
Sree Prasanna Rajagopal
65
0
0
31 Dec 2024
Geometric-Averaged Preference Optimization for Soft Preference Labels
Geometric-Averaged Preference Optimization for Soft Preference Labels
Hiroki Furuta
Kuang-Huei Lee
Shixiang Shane Gu
Y. Matsuo
Aleksandra Faust
Heiga Zen
Izzeddin Gur
146
13
0
31 Dec 2024
LLM-based relevance assessment still can't replace human relevance assessment
LLM-based relevance assessment still can't replace human relevance assessment
Charles L. A. Clarke
Laura Dietz
ELM
54
13
0
22 Dec 2024
123
Next