Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.04048
Cited By
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
7 March 2023
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
LM&MA
ELM
ALM
AI4MH
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Is ChatGPT a Good NLG Evaluator? A Preliminary Study"
50 / 288 papers shown
Title
Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models
Banca Calvo Figueras
Rodrigo Agerri
ALM
ELM
LRM
17
0
0
16 May 2025
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Peichao Lai
K. Zhang
Yi Lin
L. Zhang
Feiyang Ye
...
Yanwei Xu
Conghui He
Yuhui Wang
Wentao Zhang
Bin Cui
ELM
LRM
44
0
0
12 May 2025
PLHF: Prompt Optimization with Few-Shot Human Feedback
Chun-Pai Yang
Kan Zheng
Shou-De Lin
21
0
0
11 May 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Hanhua Hong
Chenghao Xiao
Yang Wang
Y. Liu
Wenge Rong
Chenghua Lin
31
0
0
29 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
32
0
0
25 Apr 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
44
1
0
23 Apr 2025
Planning with Diffusion Models for Target-Oriented Dialogue Systems
Hanwen Du
B. Peng
Xia Ning
25
0
0
23 Apr 2025
V
2
^2
2
R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi Ren Fung
50
0
0
23 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili
Beatrice Savoldi
Matteo Negri
L. Bentivogli
ELM
37
0
0
16 Apr 2025
Deep Reasoning Translation via Reinforcement Learning
Jiaan Wang
Fandong Meng
Jie Zhou
OffRL
LRM
33
0
0
14 Apr 2025
Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
Grgur Kovač
Jérémy Perez
Rémy Portelas
Peter Ford Dominey
Pierre-Yves Oudeyer
35
0
0
04 Apr 2025
TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes
Raj Sanjay Shah
Lei Xu
Qianchu Liu
Jon Burnsky
Drew Bertagnolli
Chaitanya P. Shivade
LM&MA
96
0
0
26 Mar 2025
SLIDE: Sliding Localized Information for Document Extraction
Divyansh Singh
Manuel Nunez Martinez
Bonnie J. Dorr
Sonja Schmer-Galunder
39
0
0
23 Mar 2025
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
Yinhong Liu
Ramit Debnath
Anna Korhonen
30
0
0
22 Mar 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
68
0
0
17 Mar 2025
Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion
Bin Liu
Xinglin Lyu
Junhui Li
Daimeng Wei
M. Zhang
Shimin Tao
Hao Yang
56
0
0
15 Mar 2025
FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance
Fengbin Zhu
Junfeng Li
Liangming Pan
Wenqiang Wang
Fuli Feng
Chao Wang
Huanbo Luan
Tat-Seng Chua
AIFin
62
0
0
07 Mar 2025
How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale
Jeanette Falk
Yiyi Chen
Janet Rafner
Mike Zhang
Johannes Bjerva
Alexander Nolte
63
1
0
06 Mar 2025
In-depth Analysis of Graph-based RAG in a Unified Framework
Yingli Zhou
Yaodong Su
Youran Sun
Shu Wang
Taotao Wang
...
Yongwei Zhang
Sicong Liang
Xilin Liu
Yuchi Ma
Yixiang Fang
42
0
0
06 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
176
0
0
03 Mar 2025
Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks
Umar Ali Khan
Ekram Khan
Fiza Khan
A. A. Moinuddin
48
0
0
02 Mar 2025
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Terry Tong
Fei-Yue Wang
Zhe Zhao
M. Chen
AAML
ELM
37
1
0
01 Mar 2025
Towards Enhanced Immersion and Agency for LLM-based Interactive Drama
Hongqiu Wu
Weiqi Wu
Tianyang Xu
Jiameng Zhang
Hai Zhao
AI4CE
68
0
0
25 Feb 2025
Towards an automated workflow in materials science for combining multi-modal simulative and experimental information using data mining and large language models
Balduin Katzer
Steffen Klinder
Katrin Schulz
AI4CE
61
0
0
24 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
71
0
0
24 Feb 2025
Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Author's Style
Xueran Han
Yuhan Liu
Mingzhe Li
Wei Liu
Sen Hu
Rui Yan
Zhiqiang Xu
Xiuying Chen
69
0
0
24 Feb 2025
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
SeongYeub Chu
JongWoo Kim
MunYong Yi
57
3
0
21 Feb 2025
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALM
LRM
69
0
0
20 Feb 2025
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge
Ha Trinh
Newman Cheng
Joshua Bradley
Alex Chao
Apurva Mody
Steven Truitt
Dasha Metropolitansky
Robert Osazuwa Ness
Jonathan Larson
RALM
145
336
0
20 Feb 2025
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models
Zihao Wei
Jingcheng Deng
Liang Pang
Hanxing Ding
Huawei Shen
Xueqi Cheng
KELM
81
4
0
20 Feb 2025
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui
Yufei He
Zifeng Ding
Bryan Hooi
HILM
ELM
RALM
73
7
0
20 Feb 2025
SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen
Zhijie Deng
Kening Zheng
Yibo Yan
Shuliang Liu
PeiJun Wu
Peijie Jiang
J. Liu
Xuming Hu
MU
63
3
0
18 Feb 2025
Towards Reasoning Ability of Small Language Models
Gaurav Srivastava
Shuxiang Cao
Xuan Wang
ReLM
LRM
54
4
0
17 Feb 2025
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Borui Xu
Yao Chen
Zeyi Wen
Weiguo Liu
Bingsheng He
77
1
0
02 Feb 2025
Learning to Summarize from LLM-generated Feedback
Hwanjun Song
Taewon Yun
Yuho Lee
Jihwan Oh
Gihun Lee
Jason (Jinglun) Cai
Hang Su
73
2
0
28 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
132
69
0
20 Jan 2025
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Shunfan Zheng
Xiechi Zhang
Gerard de Melo
Xiaoling Wang
Linlin Wang
LM&MA
ELM
42
0
0
12 Jan 2025
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Justin Vasselli
Adam Nohejl
Taro Watanabe
AAML
49
0
0
12 Jan 2025
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Dong Yuan
Eti Rastogi
Fen Zhao
Sagar Goyal
Gautam Naik
Sree Prasanna Rajagopal
39
0
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
X. Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
61
18
0
31 Dec 2024
ACE-
M
3
M^3
M
3
: Automatic Capability Evaluator for Multimodal Medical Models
Xiechi Zhang
Shunfan Zheng
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Liang He
ELM
111
0
0
16 Dec 2024
Explingo: Explaining AI Predictions using Large Language Models
Alexandra Zytek
Sara Pido
Sarah Alnegheimish
Laure Berti-Equille
K. Veeramachaneni
69
1
0
06 Dec 2024
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
S. Ramprasad
Byron C. Wallace
LLMAG
HILM
87
2
0
25 Nov 2024
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh
Tianyi Yao
Lizzy Chen
Sadid Hasan
Tianwei Chen
Dario Bernal
Huitian Jiao
H M Sajjad Hossain
ELM
72
0
0
25 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
38
6
0
07 Nov 2024
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei-Yue Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
M. Chen
Chaowei Xiao
Chaowei Xiao
MU
58
6
0
05 Nov 2024
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
Do Xuan Long
Duong Ngoc Yen
Anh Tuan Luu
Kenji Kawaguchi
Min-Yen Kan
Nancy F. Chen
KELM
ELM
LRM
27
4
0
01 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Mohit Bansal
Shafiq R. Joty
HILM
50
3
0
31 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALM
ELM
33
1
0
29 Oct 2024
Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts
Sumit Asthana
Hannah Rashkin
Elizabeth Clark
Fantine Huot
Mirella Lapata
36
0
0
28 Oct 2024
1
2
3
4
5
6
Next