ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.14889
  4. Cited By
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation
  Metrics using Measurement Theory

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

24 May 2023
Ziang Xiao
Susu Zhang
Vivian Lai
Q. V. Liao
    ELM
ArXivPDFHTML

Papers citing "Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory"

19 / 19 papers shown
Title
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
31
0
0
18 Apr 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani
Fernando Diaz
34
0
0
25 Mar 2025
SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
Lu Dai
Yijie Xu
Jinhui Ye
Hao Liu
Hui Xiong
3DV
RALM
83
2
0
03 Mar 2025
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
Helia Hashemi
J. Eisner
Corby Rosset
Benjamin Van Durme
Chris Kedzie
68
1
0
03 Jan 2025
What makes a good metric? Evaluating automatic metrics for text-to-image
  consistency
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adriana Romero Soriano
Adina Williams
92
3
0
18 Dec 2024
Can LLM "Self-report"?: Evaluating the Validity of Self-report Scales in Measuring Personality Design in LLM-based Chatbots
Huiqi Zou
Pengda Wang
Zihan Yan
Tianjun Sun
Ziang Xiao
90
1
0
29 Nov 2024
Reliability of Topic Modeling
Reliability of Topic Modeling
Kayla Schroeder
Zach Wood-Doughty
32
0
0
30 Oct 2024
A Novel Psychometrics-Based Approach to Developing Professional
  Competency Benchmark for Large Language Models
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
Elena Kardanova
Alina Ivanova
Ksenia Tarasova
Taras Pashchenko
Aleksei Tikhoniuk
Elen Yusupova
Anatoly Kasprzhak
Yaroslav Kuzminov
Ekaterina Kruchinskaia
Irina Brun
47
1
0
29 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
46
3
0
14 Oct 2024
Exploring Bengali Religious Dialect Biases in Large Language Models with
  Evaluation Perspectives
Exploring Bengali Religious Dialect Biases in Large Language Models with Evaluation Perspectives
Azmine Toushik Wasi
Raima Islam
Mst Rafia Islam
Taki Hasan Rafi
Dong-Kyu Chae
35
3
0
25 Jul 2024
Position: Measure Dataset Diversity, Don't Just Claim It
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
64
14
0
11 Jul 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Shu Wang
Xing Xie
Xing Xie
ALM
ELM
50
5
0
20 Jun 2024
SLIDE: A Framework Integrating Small and Large Language Models for
  Open-Domain Dialogues Evaluation
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation
Kun Zhao
Bohao Yang
Chen Tang
Chenghua Lin
Liang Zhan
41
5
0
24 May 2024
What Can Natural Language Processing Do for Peer Review?
What Can Natural Language Processing Do for Peer Review?
Ilia Kuznetsov
Osama Mohammed Afzal
Koen Dercksen
Nils Dycke
Alexander Goldberg
...
Jingyan Wang
Xiaodan Zhu
Anna Rogers
Nihar B. Shah
Iryna Gurevych
38
12
0
10 May 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
60
29
0
02 Feb 2024
Value FULCRA: Mapping Large Language Models to the Multidimensional
  Spectrum of Basic Human Values
Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Values
Jing Yao
Xiaoyuan Yi
Xiting Wang
Yifan Gong
Xing Xie
30
21
0
15 Nov 2023
Evaluating General-Purpose AI with Psychometrics
Evaluating General-Purpose AI with Psychometrics
Xiting Wang
Liming Jiang
Jose Hernandez-Orallo
David Stillwell
Luning Sun
Fang Luo
Xing Xie
AI4MH
ELM
30
12
0
25 Oct 2023
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
113
57
0
13 Sep 2021
1