ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05087
  4. Cited By
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning
  Optimization
v1v2 (latest)

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

8 June 2023
Yidong Wang
Zhuohao Yu
Zhengran Zeng
Linyi Yang
Cunxiang Wang
Hao Chen
Chaoya Jiang
Rui Xie
Jindong Wang
Xingxu Xie
Wei Ye
Shi-Bo Zhang
Yue Zhang
    ALMELM
ArXiv (abs)PDFHTMLGithub (914★)

Papers citing "PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization"

50 / 184 papers shown
Title
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
99
6
0
07 Nov 2024
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALMLRM
186
0
0
03 Nov 2024
Trustworthy Alignment of Retrieval-Augmented Large Language Models via
  Reinforcement Learning
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning
Zongmeng Zhang
Yufeng Shi
Jinhua Zhu
Wengang Zhou
Xiang Qi
Peng Zhang
Haoyang Li
RALMHILM
41
0
0
22 Oct 2024
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and
  Evolution
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Maosong Cao
Alexander Lam
Haodong Duan
Hongwei Liu
Shanghang Zhang
Kai Chen
AILawELM
97
20
0
21 Oct 2024
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Dilip Venkatesh
Raj Dabre
Anoop Kunchukuttan
Mitesh M. Khapra
ELM
72
1
0
17 Oct 2024
Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Ryotaro Shimizu
Takashi Wada
Yu Wang
Johannes Kruse
Sean O'Brien
...
Yuya Yoshikawa
Yuki Saito
Fugee Tsung
M. Goto
Julian McAuley
67
0
0
17 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELMALM
155
52
0
16 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
150
1
0
14 Oct 2024
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of
  LLMs
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs
Yijie Li
Yuan Sun
ELM
60
1
0
13 Oct 2024
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Jiasheng Zheng
Hongyu Lin
Boxi Cao
M. Liao
Yaojie Lu
Xianpei Han
Le Sun
49
0
0
10 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
95
6
0
09 Oct 2024
Self-rationalization improves LLM as a fine-grained judge
Self-rationalization improves LLM as a fine-grained judge
Prapti Trivedi
Aditya Gulati
Oliver Molenschot
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Keith Stevens
Tanveesh Singh Chaudhery
Jahnavi Jambholkar
James Zou
Nazneen Rajani
LRM
99
7
0
07 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
128
7
0
07 Oct 2024
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding
Yanming Liu
Xinyue Peng
Jiannan Cao
Shi Bo
Yanxin Shen
Tianyu Du
Sheng Cheng
Xun Wang
Jianwei Yin
Xuhong Zhang
145
9
0
02 Oct 2024
Mitigating the Bias of Large Language Model Evaluation
Mitigating the Bias of Large Language Model Evaluation
Hongli Zhou
Hui Huang
Yunfei Long
Bing Xu
Conghui Zhu
Hailong Cao
Muyun Yang
Tiejun Zhao
ELM
52
3
0
25 Sep 2024
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL
  Benchmark
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark
Heegyu Kim
Taeyang Jeon
Seunghwan Choi
Seungtaek Choi
Hyunsouk Cho
127
0
0
24 Sep 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
109
13
0
23 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
125
6
0
06 Sep 2024
Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao
Feifan Song
Yibo Miao
Zefan Cai
Zhiyong Yang
...
Houfeng Wang
Zhifang Sui
Peiyi Wang
Baobao Chang
Baobao Chang
163
14
0
04 Sep 2024
Self-Instructed Derived Prompt Generation Meets In-Context Learning:
  Unlocking New Potential of Black-Box LLMs
Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Zhuo Li
Yuhao Du
Jinpeng Hu
Xiang Wan
Anningzhe Gao
73
2
0
03 Sep 2024
Self-Judge: Selective Instruction Following with Alignment
  Self-Evaluation
Self-Judge: Selective Instruction Following with Alignment Self-Evaluation
Hai Ye
Hwee Tou Ng
ELMALM
49
5
0
02 Sep 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
134
7
0
26 Aug 2024
LalaEval: A Holistic Human Evaluation Framework for Domain-Specific
  Large Language Models
LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models
Chongyan Sun
Ken Lin
Shiwei Wang
Hulong Wu
Chengfei Fu
Zhen Wang
ALM
40
2
0
23 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALMELM
196
32
0
23 Aug 2024
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
Jiancheng Dong
Lei Jiang
Wei Jin
Lu Cheng
110
1
0
18 Aug 2024
The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
Samee Arif
Sualeha Farid
Abdul Hameed Azeemi
Awais Athar
Agha Ali Raza
LLMAG
116
8
0
16 Aug 2024
AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling
AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling
Yuheng Huang
Jiayang Song
Qiang Hu
Felix Juefei Xu
Lei Ma
99
4
0
07 Aug 2024
SAFETY-J: Evaluating Safety with Critique
SAFETY-J: Evaluating Safety with Critique
Yixiu Liu
Yuxiang Zheng
Shijie Xia
Jiajun Li
Yi Tu
Chaoling Song
Pengfei Liu
ELM
60
2
0
24 Jul 2024
Halu-J: Critique-Based Hallucination Judge
Halu-J: Critique-Based Hallucination Judge
Binjie Wang
Steffi Chern
Ethan Chern
Pengfei Liu
HILM
56
8
0
17 Jul 2024
Towards Dataset-scale and Feature-oriented Evaluation of Text
  Summarization in Large Language Model Prompts
Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts
Sam Yu-Te Lee
Aryaman Bahukhandi
Dongyu Liu
Kwan-Liu Ma
AAML
76
5
0
16 Jul 2024
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of
  Large Language Models
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
Jin Liu
Qingquan Li
Wenlong Du
LM&MAELM
18
0
0
10 Jul 2024
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
Junsoo Park
Seungyeon Jwa
Meiying Ren
Daeyoung Kim
Sanghyuk Choi
ALM
87
43
0
09 Jul 2024
Survey on Knowledge Distillation for Large Language Models: Methods,
  Evaluation, and Application
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
Chuanpeng Yang
Wang Lu
Yao Zhu
Yidong Wang
Qian Chen
Chenlong Gao
Bingjie Yan
Yiqiang Chen
ALMKELM
103
32
0
02 Jul 2024
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for
  Foundation Models
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
Zhong-Zhi Li
Ming-Liang Zhang
Fei Yin
Zhi-Long Ji
Jin-Feng Bai
Zhen-Ru Pan
Fan-Hu Zeng
Jian Xu
Jia-Xin Zhang
Cheng-Lin Liu
ELM
98
14
0
28 Jun 2024
Themis: Towards Flexible and Interpretable NLG Evaluation
Themis: Towards Flexible and Interpretable NLG Evaluation
Xinyu Hu
Li Lin
Mingqi Gao
Xunjian Yin
Xiaojun Wan
ELM
91
8
0
26 Jun 2024
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house
  Criteria by Criteria Division and Zero-shot Plus Few-shot
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot
Kaiqi Zhang
Shuai Yuan
Honghan Zhao
ALMELM
71
2
0
25 Jun 2024
A LLM-Based Ranking Method for the Evaluation of Automatic
  Counter-Narrative Generation
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
I. Zubiaga
A. Soroa
Rodrigo Agerri
74
6
0
21 Jun 2024
Hybrid Alignment Training for Large Language Models
Hybrid Alignment Training for Large Language Models
Chenglong Wang
Hang Zhou
Kaiyan Chang
Bei Li
Yongyu Mu
Tong Xiao
Tongran Liu
Jingbo Zhu
109
5
0
21 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELMALM
164
8
0
20 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
112
16
0
19 Jun 2024
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Xunzhi Wang
Zhuowei Zhang
Qiongyu Li
Gaonan Chen
Mengting Hu
Zhixin Han
Bitong Luo
Zhiyu li
Hang Gao
Mengting Hu
ELM
109
3
0
18 Jun 2024
A Survey on Human Preference Learning for Large Language Models
A Survey on Human Preference Learning for Large Language Models
Ruili Jiang
Kehai Chen
Xuefeng Bai
Zhixuan He
Juntao Li
Muyun Yang
Tiejun Zhao
Liqiang Nie
Min Zhang
134
9
0
17 Jun 2024
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
  Survey
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Lin Long
Rui Wang
Ruixuan Xiao
Junbo Zhao
Xiao Ding
Gang Chen
Haobo Wang
SyDa
112
127
0
14 Jun 2024
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level
  Hallucination Evaluation
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation
Wen Luo
Tianshu Shen
Wei Li
Guangyue Peng
Richeng Xuan
Houfeng Wang
Xi Yang
HILM
111
12
0
11 Jun 2024
AutoSurvey: Large Language Models Can Automatically Write Surveys
AutoSurvey: Large Language Models Can Automatically Write Surveys
Yidong Wang
Qi Guo
Wenjin Yao
Hongbo Zhang
Xin Zhang
...
Hao Fei
Qingsong Wen
Wei Ye
Shikun Zhang
Yue Zhang
LM&MA
93
33
0
10 Jun 2024
Two Tales of Persona in LLMs: A Survey of Role-Playing and
  Personalization
Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization
Yu-Min Tseng
Yu-Chao Huang
Teng-Yun Hsiao
Yu-Ching Hsu
Chao-Wei Huang
Jia-Yin Foo
Yun-Nung Chen
LLMAG
428
92
0
03 Jun 2024
Favi-Score: A Measure for Favoritism in Automated Preference Ratings for
  Generative AI Evaluation
Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation
Pius von Daniken
Jan Deriu
Don Tuggener
Mark Cieliebak
68
2
0
03 Jun 2024
Prompt Chaining or Stepwise Prompt? Refinement in Text Summarization
Prompt Chaining or Stepwise Prompt? Refinement in Text Summarization
Shichao Sun
Ruifeng Yuan
Ziqiang Cao
Wenjie Li
Pengfei Liu
LRM
50
20
0
01 Jun 2024
Improving Reward Models with Synthetic Critiques
Improving Reward Models with Synthetic Critiques
Zihuiwen Ye
Fraser Greenlee-Scott
Max Bartolo
Phil Blunsom
Jon Ander Campos
Matthias Gallé
ALMSyDaLRM
103
24
0
31 May 2024
Cracking the Code of Juxtaposition: Can AI Models Understand the
  Humorous Contradictions
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Zhe Hu
Tuo Liang
Jing Li
Yiren Lu
Yunlai Zhou
Yiran Qiao
Jing Ma
Yu Yin
90
3
0
29 May 2024
Previous
1234
Next