Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.01937
Cited By
Can Large Language Models Be an Alternative to Human Evaluations?
3 May 2023
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Can Large Language Models Be an Alternative to Human Evaluations?"
50 / 122 papers shown
Title
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng
Pu Zhao
Qingfeng Sun
Can Xu
Fangkai Yang
...
Qianli Ma
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
AAML
ALM
169
0
0
23 Dec 2024
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI
Beiduo Chen
Siyao Peng
Anna Korhonen
Barbara Plank
139
2
0
18 Dec 2024
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
Juyoung Suk
Seungone Kim
Niklas Muennighoff
Dongkwan Kim
Alice Oh
ELM
190
1
0
10 Dec 2024
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
105
1
0
02 Dec 2024
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification
Saptarshi Sengupta
Kristal Curtis
Akshay Mallipeddi
Abhinav Mathur
Joseph Ross
Liang Gou
Liang Gou
LLMAG
SyDa
214
2
0
28 Nov 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
362
112
0
25 Nov 2024
Do LLMs Agree on the Creativity Evaluation of Alternative Uses?
Abdullah Al Rabeyah
Fabrício Góes
Marco Volpe
Talles Medeiros
131
1
0
23 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
175
1
0
12 Nov 2024
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang
Qingping Yang
Runtao Liu
Runtao Liu
LRM
ReLM
ELM
AIMat
151
1
0
11 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
138
5
0
04 Nov 2024
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
Prashanth Gurunath Shivakumar
Aditya Gourav
Yile Gu
Ankur Gandhe
Hung-yi Lee
I. Bulyko
120
9
0
04 Nov 2024
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALM
LRM
184
0
0
03 Nov 2024
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee
Yerin Hwang
Yongil Kim
Joonsuk Park
Kyomin Jung
ELM
153
10
0
28 Oct 2024
End-to-end Training for Recommendation with Language-based User Profiles
Zhaolin Gao
Joyce Zhou
Yijia Dai
Thorsten Joachims
AI4Ed
153
4
0
24 Oct 2024
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Zonghai Yao
Aditya Parashar
Huixue Zhou
Won Seok Jang
Feiyun Ouyang
Zhichao Yang
Hong-ye Yu
ELM
137
2
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
121
10
0
17 Oct 2024
Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Ryotaro Shimizu
Takashi Wada
Yu Wang
Johannes Kruse
Sean O'Brien
...
Yuya Yoshikawa
Yuki Saito
Fugee Tsung
M. Goto
Julian McAuley
60
0
0
17 Oct 2024
LLM-Human Pipeline for Cultural Context Grounding of Conversations
Rajkumar Pujari
Dan Goldwasser
108
1
0
17 Oct 2024
Expanding Chatbot Knowledge in Customer Service: Context-Aware Similar Question Generation Using Large Language Models
Mengze Hong
Yuanfeng Song
Di Jiang
Lu Wang
Zichang Guo
Yuanqin He
Zhiyang Su
Qing Li
81
2
0
16 Oct 2024
MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation
Aniket Deroy
Subhankar Maity
Sudeshna Sarkar
LLMAG
LRM
108
3
0
16 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
146
1
0
14 Oct 2024
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs
Yijie Li
Yuan Sun
ELM
60
1
0
13 Oct 2024
Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation
Tunazzina Islam
Dan Goldwasser
179
2
0
07 Oct 2024
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step
Wenxuan Wang
Kuiyi Gao
Zihan Jia
Youliang Yuan
Jen-tse Huang
S. Wang
Wenxiang Jiao
Zhaopeng Tu
386
3
0
04 Oct 2024
Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI
Nicholas Pangakis
Samuel Wolken
71
5
0
14 Sep 2024
Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang
Juluan Shi
Chaozheng Wang
Cheryl Lee
Chaozheng Wang
Cheryl Lee
Youliang Yuan
Jen-tse Huang
Wenxiang Jiao
Michael R. Lyu
LLMAG
183
12
0
31 Aug 2024
An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
Yunmeng Li
Jun Suzuki
Makoto Morishita
Kaori Abe
Kentaro Inui
123
1
0
28 Aug 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Kristina Gligorić
Tijana Zrnic
Cinoo Lee
Emmanuel J. Candès
Dan Jurafsky
185
12
0
27 Aug 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MA
ELM
120
4
0
25 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
194
32
0
23 Aug 2024
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
Sher Badshah
Hassan Sajjad
ELM
98
14
0
17 Aug 2024
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
Do Xuan Long
Hai Nguyen Ngoc
Tiviatis Sim
Hieu Dao
Shafiq Joty
Kenji Kawaguchi
Nancy F. Chen
Min-Yen Kan
131
11
0
16 Aug 2024
A Survey on Employing Large Language Models for Text-to-SQL Tasks
Liang Shi
Zhengju Tang
Nan Zhang
Xiaotong Zhang
Zhi Yang
190
31
0
21 Jul 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang
Linlu Qiu
Cheng-Yu Hsieh
Ranjay Krishna
Yoon Kim
James R. Glass
HILM
87
48
0
09 Jul 2024
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
Mehant Kammakomati
Sameer Pimparkhede
Srikanth G. Tamilselvam
Praveen Venkateswaran
Pushpak Bhattacharyya
ALM
135
0
0
03 Jul 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
177
88
0
26 Jun 2024
Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback
Jimin Sohn
Jeihee Cho
Junyong Lee
Songmu Heo
Ji-Eun Han
David R. Mortensen
LRM
87
0
0
24 Jun 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
171
65
0
18 Jun 2024
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh
Carlos Mullov
Leonard Barmann
Zhaolin Li
Danni Liu
...
Michael Beigl
Rainer Stiefelhagen
Carsten Dachsbacher
Klemens Bohm
Jan Niehues
ELM
88
12
0
14 Jun 2024
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang
Yingzhuo Yu
Jin Huang
Xingjian Zhang
Jiaqi Ma
120
1
0
11 Jun 2024
HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants
Milan Gritta
Gerasimos Lampouras
Ignacio Iacobacci
ALM
64
2
0
15 May 2024
Creative Beam Search: LLM-as-a-Judge For Improving Response Generation
Giorgio Franceschelli
Mirco Musolesi
74
8
0
30 Apr 2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay
Vahid Behzadan
AAML
75
16
0
09 Apr 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
146
57
0
26 Mar 2024
A Survey on Human-AI Teaming with Large Pre-Trained Models
Vanshika Vats
Marzia Binta Nizam
Minghao Liu
Ziyuan Wang
Richard Ho
...
Celeste Shen
Rachel Shen
Nafisa Hussain
Kesav Ravichandran
James Davis
LM&MA
124
9
0
07 Mar 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
Frank Breitinger
Mark Scanlon
144
10
0
29 Feb 2024
Prediction-Powered Ranking of Large Language Models
Ivi Chatzi
Eleni Straitouri
Suhas Thejaswi
Manuel Gomez Rodriguez
ALM
127
9
0
27 Feb 2024
Large Language Models are Advanced Anonymizers
Robin Staab
Mark Vero
Mislav Balunović
Martin Vechev
100
14
0
21 Feb 2024
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi
Anya Belz
58
5
0
19 Feb 2024
A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
Jaylen Jones
Lingbo Mo
Eric Fosler-Lussier
Huan Sun
95
4
0
18 Feb 2024
Previous
1
2
3
Next