ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.16634
  4. Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
    ELM
    ALM
    LM&MA
ArXivPDFHTML

Papers citing "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

50 / 765 papers shown
Title
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonça
Isabel Trancoso
A. Lavie
39
3
0
16 Jul 2024
AdaptEval: Evaluating Large Language Models on Domain Adaptation for
  Text Summarization
AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
Anum Afzal
Ribin Chalumattu
Florian Matthes
Laura Mascarell
ALM
ELM
35
4
0
16 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual
  Narrative through User-Defined Highlights
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
46
1
0
16 Jul 2024
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation
  Framework
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework
Hannah Sansford
Nicholas Richardson
Hermina Petric Maretic
Juba Nait Saada
47
13
0
15 Jul 2024
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated
  Responses
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses
Jing Yao
Xiaoyuan Yi
Xing Xie
ELM
ALM
38
7
0
15 Jul 2024
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
Anni Zou
Wenhao Yu
Hongming Zhang
Kaixin Ma
Deng Cai
Zhuosheng Zhang
Hai Zhao
Dong Yu
49
6
0
15 Jul 2024
Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated
  Dialogues
Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
LLMAG
47
2
0
13 Jul 2024
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick
Rama Chellappa
Subhashini Venugopalan
50
15
0
12 Jul 2024
Large Language Models as Biomedical Hypothesis Generators: A
  Comprehensive Evaluation
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
Biqing Qi
Kaiyan Zhang
Kai Tian
Haoxiang Li
Zhang-Ren Chen
Sihang Zeng
Ermo Hua
Hu Jinfang
Bowen Zhou
LM&MA
45
11
0
12 Jul 2024
Lynx: An Open Source Hallucination Evaluation Model
Lynx: An Open Source Hallucination Evaluation Model
Selvan Sunitha Ravi
B. Mielczarek
Anand Kannappan
Douwe Kiela
Rebecca Qian
VLM
RALM
HILM
56
17
0
11 Jul 2024
On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard
  of Oz Experiments
On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard of Oz Experiments
Jingchao Fang
Nikos Aréchiga
Keiichi Namaoshi
N. Bravo
Candice L Hogan
David A. Shamma
44
3
0
10 Jul 2024
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models :
  Safety, Consensus, Objectivity, Reproducibility and Explainability
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
Ting Fang Tan
Kabilan Elangovan
J. Ong
Nigam Shah
J. Sung
...
Haibo Wang
Chang Fu Kuo
Simon Chesterman
Zee Kin Yeong
Daniel Ting
ELM
35
4
0
10 Jul 2024
CopyBench: Measuring Literal and Non-Literal Reproduction of
  Copyright-Protected Text in Language Model Generation
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
Tong Chen
Akari Asai
Niloofar Mireshghallah
Sewon Min
James Grimmelmann
Yejin Choi
Hannaneh Hajishirzi
Luke Zettlemoyer
Pang Wei Koh
56
17
0
09 Jul 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
  Large Language Models Using Only Attention Maps
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang
Linlu Qiu
Cheng-Yu Hsieh
Ranjay Krishna
Yoon Kim
James R. Glass
HILM
18
36
0
09 Jul 2024
Source Code Summarization in the Era of Large Language Models
Source Code Summarization in the Era of Large Language Models
Weisong Sun
Yun Miao
Yuekang Li
Hongyu Zhang
Chunrong Fang
Yi Liu
Gelei Deng
Yang Liu
Zhenyu Chen
ELM
60
14
0
09 Jul 2024
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
Junsoo Park
Seungyeon Jwa
Meiying Ren
Daeyoung Kim
Sanghyuk Choi
ALM
34
36
0
09 Jul 2024
Efficient and Accurate Memorable Conversation Model using DPO based on
  sLLM
Efficient and Accurate Memorable Conversation Model using DPO based on sLLM
Youngkyung Seo
Yoonseok Heo
Jun-Seok Koh
Du-Seong Chang
55
0
0
09 Jul 2024
A Factuality and Diversity Reconciled Decoding Method for
  Knowledge-Grounded Dialogue Generation
A Factuality and Diversity Reconciled Decoding Method for Knowledge-Grounded Dialogue Generation
Chenxu Yang
Zheng Lin
Chong Tian
Liang Pang
Lanrui Wang
Zhengyang Tong
Qirong Ho
Yanan Cao
Weiping Wang
HILM
44
0
0
08 Jul 2024
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Gaurav Sahu
Abhay Puri
Juan A. Rodriguez
Alexandre Drouin
Perouz Taslakian
...
Christopher Pal
Nicolas Chapados
I. Laradji
Sai Rajeswar Mudumba
Issam Hadj Laradji
ELM
50
5
0
08 Jul 2024
Enhancing Hallucination Detection through Perturbation-Based Synthetic
  Data Generation in System Responses
Enhancing Hallucination Detection through Perturbation-Based Synthetic Data Generation in System Responses
Dongxu Zhang
Varun Gangal
B. Lattimer
Yi Yang
40
6
0
07 Jul 2024
Large Language Model as an Assignment Evaluator: Insights, Feedback, and
  Challenges in a 1000+ Student Course
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
Cheng-Han Chiang
Wei-Chih Chen
Chun-Yi Kuan
Chienchou Yang
Hung-yi Lee
ELM
AI4Ed
49
5
0
07 Jul 2024
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual
  Contexts
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao
Edward Sun
Tianyu Liu
Wei Wang
LRM
35
28
0
06 Jul 2024
Evaluating Language Models for Generating and Judging Programming
  Feedback
Evaluating Language Models for Generating and Judging Programming Feedback
Charles Koutcheme
Nicola Dainese
Arto Hellas
Sami Sarsa
Juho Leinonen
Syed Ashraf
Paul Denny
ELM
39
2
0
05 Jul 2024
Towards Enhancing Coherence in Extractive Summarization: Dataset and
  Experiments with LLMs
Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs
Mihir Parmar
Hanieh Deilamsalehy
Franck Dernoncourt
Seunghyun Yoon
Ryan A. Rossi
Trung Bui
34
2
0
05 Jul 2024
On Evaluating The Performance of Watermarked Machine-Generated Texts
  Under Adversarial Attacks
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks
Zesen Liu
Tianshuo Cong
Xinlei He
Qi Li
AAML
WaLM
68
1
0
05 Jul 2024
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendonça
A. Lavie
Isabel Trancoso
ELM
43
2
0
04 Jul 2024
Large Language Models as Evaluators for Scientific Synthesis
Large Language Models as Evaluators for Scientific Synthesis
Julia Evans
Jennifer D'Souza
Sören Auer
ELM
42
4
0
03 Jul 2024
Integrate the Essence and Eliminate the Dross: Fine-Grained
  Self-Consistency for Free-Form Language Generation
Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation
Xinglin Wang
Yiwei Li
Shaoxiong Feng
Peiwen Yuan
Boyuan Pan
Heda Wang
Yao Hu
Kan Li
38
10
0
02 Jul 2024
Compare without Despair: Reliable Preference Evaluation with Generation
  Separability
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh
Tejas Srinivasan
Swabha Swayamdipta
56
2
0
02 Jul 2024
Free-text Rationale Generation under Readability Level Control
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu
Nils Feldhus
Sherzod Hakimov
46
0
0
01 Jul 2024
FineSurE: Fine-grained Summarization Evaluation using LLMs
FineSurE: Fine-grained Summarization Evaluation using LLMs
Hwanjun Song
Hang Su
Igor Shalyminov
Jason (Jinglun) Cai
Saab Mansour
HILM
41
32
0
01 Jul 2024
PerSEval: Assessing Personalization in Text Summarizers
PerSEval: Assessing Personalization in Text Summarizers
Sourish Dasgupta
Ankush Chander
Parth Borad
Isha Motiyani
Tanmoy Chakraborty
42
0
0
29 Jun 2024
The SIFo Benchmark: Investigating the Sequential Instruction Following
  Ability of Large Language Models
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen
Baohao Liao
Jirui Qi
Panagiotis Eustratiadis
Christof Monz
Arianna Bisazza
Maarten de Rijke
ALM
ELM
LRM
38
5
0
28 Jun 2024
When Search Engine Services meet Large Language Models: Visions and
  Challenges
When Search Engine Services meet Large Language Models: Visions and Challenges
Haoyi Xiong
Jiang Bian
Yuchen Li
Xuhong Li
Mengnan Du
Shuaiqiang Wang
Dawei Yin
Sumi Helal
64
29
0
28 Jun 2024
Can Large Language Models Generate High-quality Patent Claims?
Can Large Language Models Generate High-quality Patent Claims?
Lekang Jiang
Caiqi Zhang
Pascal A Scherz
Stephan Goetz
ELM
38
5
0
27 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20
  NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
56
62
0
26 Jun 2024
Themis: Towards Flexible and Interpretable NLG Evaluation
Themis: Towards Flexible and Interpretable NLG Evaluation
Xinyu Hu
Li Lin
Mingqi Gao
Xunjian Yin
Xiaojun Wan
ELM
34
7
0
26 Jun 2024
ConvoCache: Smart Re-Use of Chatbot Responses
ConvoCache: Smart Re-Use of Chatbot Responses
Conor Atkins
Ian D. Wood
M. Kâafar
Hassan Jameel Asghar
Nardine Basta
Michal Kepkowski
48
0
0
26 Jun 2024
BADGE: BADminton report Generation and Evaluation with LLM
BADGE: BADminton report Generation and Evaluation with LLM
Shang-Hsuan Chiang
Lin-Wei Chao
Kuang-Da Wang
Chih-Chuan Wang
Wen-Chih Peng
63
2
0
26 Jun 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLM
LRM
43
2
0
25 Jun 2024
CausalScore: An Automatic Reference-Free Metric for Assessing Response
  Relevance in Open-Domain Dialogue Systems
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng
Lizhen Qu
Xiaoxi Kang
Gholamreza Haffari
38
1
0
25 Jun 2024
Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework
Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework
Bohao Yang
Dong Liu
Chen Tang
Chenghao Xiao
Kun Zhao
Chao Li
Lin Yuan
Guang Yang
Lanxiao Huang
Chenghua Lin
51
2
0
25 Jun 2024
RaTEScore: A Metric for Radiology Report Generation
RaTEScore: A Metric for Radiology Report Generation
W. Zhao
Chaoyi Wu
X. Zhang
Ya Zhang
Yanfeng Wang
Weidi Xie
37
8
0
24 Jun 2024
AnnotatedTables: A Large Tabular Dataset with Language Model Annotations
AnnotatedTables: A Large Tabular Dataset with Language Model Annotations
Yaojie Hu
Ilias Fountalis
Jin Tian
N. Vasiloglou
LMTD
41
4
0
24 Jun 2024
A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student
  Feedback to Make Mnemonic Learning Stick
A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur
Matthew Shu
Alexander Hoyle
Alison Robey
Shi Feng
Seraphina Goldfarb-Tarrant
Jordan Boyd-Graber
49
2
0
21 Jun 2024
A LLM-Based Ranking Method for the Evaluation of Automatic
  Counter-Narrative Generation
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
I. Zubiaga
A. Soroa
Rodrigo Agerri
42
5
0
21 Jun 2024
Factual Dialogue Summarization via Learning from Large Language Models
Factual Dialogue Summarization via Learning from Large Language Models
Rongxin Zhu
Jey Han Lau
Jianzhong Qi
HILM
60
1
0
20 Jun 2024
Holistic Evaluation for Interleaved Text-and-Image Generation
Holistic Evaluation for Interleaved Text-and-Image Generation
Minqian Liu
Zhiyang Xu
Zihao Lin
Trevor Ashby
Joy Rimchala
Jiaxin Zhang
Lifu Huang
EGVM
46
7
0
20 Jun 2024
Step-Back Profiling: Distilling User History for Personalized Scientific
  Writing
Step-Back Profiling: Distilling User History for Personalized Scientific Writing
Xiangru Tang
Xingyao Zhang
Yanjun Shao
Jie Wu
Yilun Zhao
Arman Cohan
Ming Gong
Dongmei Zhang
Mark B. Gerstein
52
2
0
20 Jun 2024
Persuasiveness of Generated Free-Text Rationales in Subjective
  Decisions: A Case Study on Pairwise Argument Ranking
Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking
Mohamed S. Elaraby
Diane Litman
Xiang Lorraine Li
Ahmed Magooda
LRM
42
2
0
20 Jun 2024
Previous
123...678...141516
Next