Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.13076
Cited By
LLM Evaluators Recognize and Favor Their Own Generations
15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLM Evaluators Recognize and Favor Their Own Generations"
50 / 114 papers shown
Title
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
113
1
0
15 Dec 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li
Y. X. Wei
Zhihui Xie
Xuqing Yang
Yifan Song
...
Tianyu Liu
Sujian Li
Bill Yuchen Lin
Lingpeng Kong
Qiang Liu
CoGe
VLM
120
24
0
26 Nov 2024
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh
Tianyi Yao
Lizzy Chen
Sadid Hasan
Tianwei Chen
Dario Bernal
Huitian Jiao
H M Sajjad Hossain
ELM
76
0
0
25 Nov 2024
Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models
Iacopo Ghinassi
Leonardo Catalano
Tommaso Colella
70
1
0
21 Nov 2024
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
Chaitanya Malaviya
Joseph Chee Chang
Dan Roth
Mohit Iyyer
Mark Yatskar
Kyle Lo
ELM
45
4
0
11 Nov 2024
Benchmarking LLMs' Judgments with No Gold Standard
Shengwei Xu
Yuxuan Lu
Grant Schoenebeck
Yuqing Kong
34
1
0
11 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
32
3
0
04 Nov 2024
DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers
Rakesh R Menon
Shashank Srivastava
26
1
0
29 Oct 2024
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa
Wiradee Imrattanatrai
Zhi-Qi Cheng
Masaki Asada
Susan Holm
Yuran Wang
Ken Fukuda
Teruko Mitamura
26
0
0
29 Oct 2024
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen
Jing Jin
Peng-Ting Kuo
Chao-Wei Huang
Yun-Nung (Vivian) Chen
25
1
0
28 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
35
12
0
17 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLM
RALM
28
7
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
47
5
0
17 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
24
6
0
09 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
56
7
0
09 Oct 2024
Generating bilingual example sentences with large language models as lexicography assistants
Raphael Merx
Ekaterina Vylomova
Kemal Kurniawan
23
2
0
04 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
33
0
0
29 Sep 2024
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells
Atharva Naik
Marcus Alenius
Daniel Fried
Carolyn Rose
26
0
0
29 Sep 2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Samuel Arnesen
David Rein
Julian Michael
ELM
33
3
0
25 Sep 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
39
12
0
23 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
58
3
0
10 Sep 2024
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang
Fuyang Cui
Keiran Paster
Jimmy Ba
Pashootan Vaezipoor
Silviu Pitis
Michael Ruogu Zhang
28
1
0
01 Sep 2024
Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data
Juncheng Xie
Shensian Syu
Hung-yi Lee
ALM
26
1
0
27 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
64
23
0
23 Aug 2024
Summarizing long regulatory documents with a multi-step pipeline
Mika Sie
Ruby Beek
Michiel Bots
S. Brinkkemper
Albert Gatt
AILaw
ELM
29
1
0
19 Aug 2024
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss
Christian Poelitz
Ian Drosos
Vu Le
Nick McKenna
Carina Negreanu
Chris Parnin
Advait Sarkar
ELM
ALM
35
13
0
16 Aug 2024
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models
Faris Hijazi
Somayah Alharbi
Abdulaziz AlHussein
Harethah Shairah
Reem Alzahrani
Hebah Alshamlan
Omar Knio
G. Turkiyyah
AILaw
ELM
47
2
0
15 Aug 2024
Active Testing of Large Language Model via Multi-Stage Sampling
Yuheng Huang
Jiayang Song
Qiang Hu
Felix Juefei-Xu
Lei Ma
29
2
0
07 Aug 2024
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung
Faeze Brahman
Yejin Choi
ALM
44
12
0
25 Jul 2024
OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
Steffen Kleinle
Jakob Prange
Annemarie Friedrich
RALM
35
0
0
22 Jul 2024
Understanding Reference Policies in Direct Preference Optimization
Yixin Liu
Pengfei Liu
Arman Cohan
36
7
0
18 Jul 2024
Weak-to-Strong Reasoning
Yuqing Yang
Yan Ma
Pengfei Liu
LRM
37
13
0
18 Jul 2024
Self-Recognition in Language Models
Tim R. Davidson
Viacheslav Surkov
V. Veselovsky
Giuseppe Russo
Robert West
Çağlar Gülçehre
PILM
248
2
0
09 Jul 2024
AI AI Bias: Large Language Models Favor Their Own Generated Content
Walter Laurito
Benjamin Davis
Peli Grietzer
T. Gavenčiak
Ada Böhm
Jan Kulveit
25
3
0
09 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
43
29
0
05 Jul 2024
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks
Adrian Rebmann
Fabian David Schmidt
Goran Glavaš
Han van der Aa
LRM
22
5
0
02 Jul 2024
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh
Tejas Srinivasan
Swabha Swayamdipta
43
2
0
02 Jul 2024
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
Luísa Shimabucoro
Sebastian Ruder
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
SyDa
33
4
0
01 Jul 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Jiale Cheng
Yida Lu
Xiaotao Gu
Pei Ke
Xiao-Yang Liu
Yuxiao Dong
Hongning Wang
Jie Tang
Minlie Huang
37
4
0
24 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
77
131
0
22 Jun 2024
A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur
Matthew Shu
Alexander Hoyle
Alison Robey
Shi Feng
Seraphina Goldfarb-Tarrant
Jordan Boyd-Graber
44
2
0
21 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
45
9
0
21 Jun 2024
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
45
7
0
20 Jun 2024
PostMark: A Robust Blackbox Watermark for Large Language Models
Yapei Chang
Kalpesh Krishna
Amir Houmansadr
John Wieting
Mohit Iyyer
34
5
0
20 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
39
11
0
19 Jun 2024
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Ruiqi He
Yushu He
Longju Bai
Jiarui Liu
Zhenjie Sun
Zenghao Tang
He Wang
Hanchen Xia
Naihao Deng
30
0
0
18 Jun 2024
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Han Zhou
Xingchen Wan
Yinhong Liu
Nigel Collier
Ivan Vulić
Anna Korhonen
ALM
38
9
0
17 Jun 2024
CRAG -- Comprehensive RAG Benchmark
Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
...
Nirav Shah
Rakesh Wanga
Anuj Kumar
Wen-tau Yih
Xin Luna Dong
26
24
0
07 Jun 2024
UltraMedical: Building Specialized Generalists in Biomedicine
Kaiyan Zhang
Sihang Zeng
Ermo Hua
Ning Ding
Zhang-Ren Chen
...
Xuekai Zhu
Xingtai Lv
Hu Jinfang
Zhiyuan Liu
Bowen Zhou
LM&MA
43
22
0
06 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
30
0
0
04 Jun 2024
Previous
1
2
3
Next