ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.01534
  4. Cited By
Preference Leakage: A Contamination Problem in LLM-as-a-judge

Preference Leakage: A Contamination Problem in LLM-as-a-judge

3 February 2025
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
ArXivPDFHTML

Papers citing "Preference Leakage: A Contamination Problem in LLM-as-a-judge"

50 / 87 papers shown
Title
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Keheliya Gallaba
Ali Arabat
Dayi Lin
Mohammed Sayagh
Ahmed E. Hassan
AI4CE
26
0
0
27 May 2025
Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
Chiyu Ma
Enpei Zhang
Yilun Zhao
Wenjun Liu
Yaning Jia
Peijun Qing
Lin Shi
Arman Cohan
Yujun Yan
Soroush Vosoughi
LLMAG
ELM
41
0
0
26 May 2025
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Guang Yang
Yu Zhou
Xiang Chen
Wei-Shi Zheng
Xing Hu
Xin Zhou
David Lo
Taolue Chen
ALM
LRM
65
0
0
26 May 2025
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Ruichen Zhang
Rana Muhammad Shahroz Khan
Zhen Tan
Dawei Li
Song Wang
Tianlong Chen
LRM
35
0
0
24 May 2025
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan
Yongqi Tong
Xin Zhang
Xiaolu Zhang
Jun Zhou
Zhixuan Chu
29
0
0
23 May 2025
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Kaixuan Fan
Kaituo Feng
Haoming Lyu
Dongzhan Zhou
Xiangyu Yue
ReLM
LRM
86
0
0
22 May 2025
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
Burak Erinç Çetin
Yıldırım Özen
Elif Naz Demiryılmaz
Kaan Engür
Cagri Toraman
ELM
65
0
0
21 May 2025
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
Yuxuan Jiang
Dawei Li
Frank Ferraro
LRM
106
0
0
20 May 2025
Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Qianli Wang
Van Bach Nguyen
Nils Feldhus
Luis Felipe Villa-Arenas
Christin Seifert
Sebastian Möller
Vera Schmitt
52
0
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
68
0
0
20 May 2025
Krikri: Advancing Open Large Language Models for Greek
Krikri: Advancing Open Large Language Models for Greek
Dimitris Roussis
Leon Voukoutis
Georgios Paraskevopoulos
Sokratis Sofianopoulos
Prokopis Prokopidis
Vassilis Papavasileiou
Athanasios Katsamanis
Stelios Piperidis
Vassilis Katsouros
ALM
79
0
0
19 May 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
82
2
0
27 Apr 2025
DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning
DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning
Jucheng Hu
Steve Yang
Dongzhan Zhou
Lijun Wu
48
0
0
21 Apr 2025
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Yize Zhang
Tianyi Liang
Xinyue Huang
Erfei Cui
Xu Guo
Pei Chu
Chenhui Li
Ru Zhang
Wenhai Wang
Gongshen Liu
271
0
0
15 Apr 2025
CHARM: Calibrating Reward Models With Chatbot Arena Scores
CHARM: Calibrating Reward Models With Chatbot Arena Scores
Xiao Zhu
Chenmien Tan
Pinzhen Chen
Rico Sennrich
Yanlin Zhang
Hanxu Hu
ALM
77
1
0
14 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
93
11
0
12 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELM
LRM
71
3
0
04 Apr 2025
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Liangjie Huang
Dawei Li
Huan Liu
Lu Cheng
LRM
82
0
0
03 Apr 2025
MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
Liuyue Xie
George Z. Wei
Avik Kuthiala
Ce Zheng
Ananya Bal
...
Rohan Choudhury
Morteza Ziyadi
Xu Zhang
Hao Yang
László A. Jeni
94
0
0
27 Mar 2025
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
Marthe Ballon
Andres Algaba
Vincent Ginis
LRM
ReLM
81
15
0
24 Feb 2025
BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment
Sizhe Wang
Yongqi Tong
Hengyuan Zhang
Dawei Li
Xin Zhang
Tianlong Chen
165
9
0
21 Feb 2025
CLIPPER: Compression enables long-context synthetic data generation
CLIPPER: Compression enables long-context synthetic data generation
Chau Minh Pham
Yapei Chang
Mohit Iyyer
SyDa
124
1
0
21 Feb 2025
Who Taught You That? Tracing Teachers in Model Distillation
Who Taught You That? Tracing Teachers in Model Distillation
Somin Wadhwa
Chantal Shaib
Silvio Amir
Byron C. Wallace
174
2
0
10 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
122
9
0
04 Feb 2025
Quantification of Large Language Model Distillation
Quantification of Large Language Model Distillation
Sunbowen Lee
Junting Zhou
Chang Ao
Kaige Li
Xinrun Du
...
Hamid Alinejad-Rokny
Min Yang
Yitao Liang
Zhoufutu Wen
Shiwen Ni
102
1
0
22 Jan 2025
Assessing the Impact of Conspiracy Theories Using Large Language Models
Assessing the Impact of Conspiracy Theories Using Large Language Models
Bohan Jiang
Dawei Li
Zhen Tan
Xinyi Zhou
Ashwin Rao
Kristina Lerman
H. Bernard
Huan Liu
161
2
0
09 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
228
104
0
25 Nov 2024
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework
Hengyuan Zhang
Chenming Shang
Sizhe Wang
Dongdong Zhang
Feng Yao
Renliang Sun
Yiyao Yu
Yujiu Yang
Furu Wei
108
5
0
25 Oct 2024
Agent-as-a-Judge: Evaluate Agents with Agents
Agent-as-a-Judge: Evaluate Agents with Agents
Mingchen Zhuge
Changsheng Zhao
Dylan R. Ashley
Wenyi Wang
Dmitrii Khizbullin
...
Raghuraman Krishnamoorthi
Yuandong Tian
Yangyang Shi
Vikas Chandra
Jürgen Schmidhuber
ELM
105
40
0
14 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh Chawla
Xiangliang Zhang
ELM
80
69
0
03 Oct 2024
Law of the Weakest Link: Cross Capabilities of Large Language Models
Law of the Weakest Link: Cross Capabilities of Large Language Models
Ming Zhong
Aston Zhang
Xuewei Wang
Rui Hou
Wenhan Xiong
...
Melanie Kambadur
Dhruv Mahajan
Sergey Edunov
Jiawei Han
Laurens van der Maaten
ELM
36
6
0
30 Sep 2024
Exploring Large Language Models for Feature Selection: A Data-centric
  Perspective
Exploring Large Language Models for Feature Selection: A Data-centric Perspective
Dawei Li
Zhen Tan
Huan Liu
LM&MA
71
10
0
21 Aug 2024
Fostering Natural Conversation in Large Language Models with NICO: a
  Natural Interactive COnversation dataset
Fostering Natural Conversation in Large Language Models with NICO: a Natural Interactive COnversation dataset
Renliang Sun
Mengyuan Liu
Shiping Yang
Rui Wang
Junqing He
Jiaxing Zhang
65
2
0
18 Aug 2024
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White
Samuel Dooley
Manley Roberts
Arka Pal
Ben Feuer
...
Willie Neiswanger
Micah Goldblum
Tom Goldstein
Willie Neiswanger
Micah Goldblum
ELM
83
18
0
27 Jun 2024
DataGen: Unified Synthetic Dataset Generation via Large Language Models
DataGen: Unified Synthetic Dataset Generation via Large Language Models
Yue Huang
Siyuan Wu
Chujie Gao
Dongping Chen
Qihui Zhang
...
Tianyi Zhou
Xiangliang Zhang
Jianfeng Gao
Chaowei Xiao
Lichao Sun
SyDa
84
20
0
27 Jun 2024
Unveiling the Spectrum of Data Contamination in Language Models: A
  Survey from Detection to Remediation
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Chunyuan Deng
Yilun Zhao
Yuzhao Heng
Yitong Li
Jiannan Cao
Xiangru Tang
Arman Cohan
64
15
0
20 Jun 2024
Uncovering Latent Memories: Assessing Data Leakage and Memorization
  Patterns in Frontier AI Models
Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models
Sunny Duan
Mikail Khona
Abhiram Iyer
Rylan Schaeffer
Ila R Fiete
89
3
0
20 Jun 2024
Data Contamination Can Cross Language Barriers
Data Contamination Can Cross Language Barriers
Feng Yao
Yufan Zhuang
Zihao Sun
Sunan Xu
Animesh Kumar
Jingbo Shang
69
11
0
19 Jun 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
119
64
0
18 Jun 2024
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and
  BenchBuilder Pipeline
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Tianle Li
Wei-Lin Chiang
Evan Frick
Lisa Dunlap
Tianhao Wu
Banghua Zhu
Joseph E. Gonzalez
Ion Stoica
ALM
72
171
0
17 Jun 2024
Measuring memorization in RLHF for code completion
Measuring memorization in RLHF for code completion
Aneesh Pappu
Billy Porter
Ilia Shumailov
Jamie Hayes
63
3
0
17 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
66
54
0
06 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
65
44
0
03 Jun 2024
DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's
  Disease Questions with Scientific Literature
DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions with Scientific Literature
Dawei Li
Shu Yang
Zhen Tan
Jae Young Baik
Sunkwon Yun
...
D. Duong-Tran
Ying Ding
Huan Liu
Li Shen
Tianlong Chen
75
37
0
08 May 2024
Prometheus 2: An Open Source Language Model Specialized in Evaluating
  Other Language Models
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim
Juyoung Suk
Shayne Longpre
Bill Yuchen Lin
Jamin Shin
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
MoMe
ALM
ELM
91
198
0
02 May 2024
Balancing Speciality and Versatility: a Coarse to Fine Framework for
  Supervised Fine-tuning Large Language Model
Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model
Hengyuan Zhang
Yanru Wu
Dawei Li
Zacc Yang
Rui Zhao
Yong Jiang
Fei Tan
ALM
76
0
0
16 Apr 2024
LLM Evaluators Recognize and Favor Their Own Generations
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
84
185
0
15 Apr 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
95
379
0
06 Apr 2024
Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to
  Boost for Reasoning
Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning
Yongqi Tong
Dawei Li
Sizhe Wang
Yujia Wang
Fei Teng
Jingbo Shang
LRM
80
57
0
29 Mar 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
104
53
0
26 Mar 2024
12
Next