Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.07385
Cited By
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
10 April 2025
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models"
13 / 13 papers shown
Title
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu
Srijan Bansal
Yifei Ming
Semih Yavuz
Shafiq Joty
ELM
125
3
0
19 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
111
1
0
11 Mar 2025
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Pat Verga
Sebastian Hofstatter
Sophia Althammer
Yixuan Su
Aleksandra Piktus
Arkady Arkhangorodsky
Minjie Xu
Naomi White
Patrick Lewis
ALM
ELM
89
99
0
29 Apr 2024
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
92
130
0
26 Oct 2023
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi
Wenxiang Chen
Xin Guo
Wei He
Yiwen Ding
...
Wenjuan Qin
Yongyan Zheng
Xipeng Qiu
Xuanjing Huan
Tao Gui
LM&MA
LM&Ro
3DV
AI4CE
92
919
0
14 Sep 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
314
4,288
0
09 Jun 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min
Kalpesh Krishna
Xinxi Lyu
M. Lewis
Wen-tau Yih
Pang Wei Koh
Mohit Iyyer
Luke Zettlemoyer
Hannaneh Hajishirzi
HILM
ALM
113
683
0
23 May 2023
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou
Zhihong Shao
Yeyun Gong
Yelong Shen
Yujiu Yang
Nan Duan
Weizhu Chen
KELM
LRM
61
382
0
19 May 2023
Evaluating Open-Domain Question Answering in the Era of Large Language Models
Ehsan Kamalloo
Nouha Dziri
C. Clarke
Davood Rafiei
ELM
54
105
0
11 May 2023
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
706
41,736
0
28 May 2020
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
289
5,791
0
21 Apr 2019
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang
Peng Qi
Saizheng Zhang
Yoshua Bengio
William W. Cohen
Ruslan Salakhutdinov
Christopher D. Manning
RALM
150
2,635
0
25 Sep 2018
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
195
2,643
0
09 May 2017
1