Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.03706
Cited By
A Comprehensive Assessment of Dialog Evaluation Metrics
7 June 2021
Yi-Ting Yeh
M. Eskénazi
Shikib Mehri
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Comprehensive Assessment of Dialog Evaluation Metrics"
50 / 81 papers shown
Title
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
49
0
0
08 May 2025
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
Suvodip Dey
M. Desarkar
OffRL
41
0
0
20 Jan 2025
Towards Automatic Evaluation of Task-Oriented Dialogue Flows
Mehrnoosh Mirtaheri
Nikhil Varghese
Chandra Khatri
Amol Kelkar
26
0
0
15 Nov 2024
Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot
Herman Lassche
Michiel Overeem
Ayushi Rastogi
45
0
0
29 Oct 2024
Findings of the WMT 2024 Shared Task on Chat Translation
Wafaa Mohammed
Sweta Agrawal
M. Amin Farajian
Vera Cabarrão
Bryan Eikema
Ana C. Farinha
José G. C. de Souza
29
3
0
15 Oct 2024
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations
Ike Ebubechukwu
Johane Takeuchi
Antonello Ceravola
Frank Joublin
47
0
0
03 Sep 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
John Mendonça
Isabel Trancoso
A. Lavie
ALM
31
1
0
20 Aug 2024
Survey of Design Paradigms for Social Robots
Rita Frieske
Xiaoyu Mo
Yini Fang
Jay Nieles
Bertram E. Shi
23
1
0
30 Jul 2024
Impact of Decoding Methods on Human Alignment of Conversational LLMs
Shaz Furniturewala
Kokil Jaidka
Yashvardhan Sharma
30
1
0
28 Jul 2024
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonça
Isabel Trancoso
A. Lavie
34
3
0
16 Jul 2024
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendonça
A. Lavie
Isabel Trancoso
ELM
43
2
0
04 Jul 2024
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng
Lizhen Qu
Xiaoxi Kang
Gholamreza Haffari
35
1
0
25 Jun 2024
Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation
Pius von Daniken
Jan Deriu
Don Tuggener
Mark Cieliebak
28
1
0
03 Jun 2024
Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations
Yi-Pei Chen
Noriki Nishida
Hideki Nakayama
Yuji Matsumoto
LLMAG
49
11
0
28 May 2024
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
Abbas Ghaddar
David Alfonso-Hermelo
Philippe Langlais
Mehdi Rezagholizadeh
Boxing Chen
Prasanna Parthasarathi
39
0
0
24 May 2024
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models
Yiming Chen
Chen Zhang
Danqing Luo
L. F. D’Haro
R. Tan
Haizhou Li
AAML
ELM
40
2
0
23 May 2024
It Couldn't Help But Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
Brielen Madureira
David Schlangen
37
0
0
02 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
55
321
0
06 Apr 2024
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison
chaeHun Park
Minseok Choi
Dohyun Lee
Jaegul Choo
35
5
0
01 Apr 2024
Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems
Tsuta Yuma
Naoki Yoshinaga
Shoetsu Sato
Masashi Toyoda
31
1
0
04 Jan 2024
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models
Songbo Hu
Xiaobin Wang
Moy Yuan
Anna Korhonen
Ivan Vulić
32
3
0
04 Jan 2024
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Chen Zhang
L. F. D’Haro
Yiming Chen
Malu Zhang
Haizhou Li
ELM
21
29
0
24 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
44
1
0
21 Dec 2023
Dialogue Quality and Emotion Annotations for Customer Support Conversations
John Mendoncca
Patrícia Pereira
Miguel Menezes
Vera Cabarrão
Ana C. Farinha
Helena Moniz
Joao Paulo Carvalho
A. Lavie
Isabel Trancoso
15
3
0
23 Nov 2023
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Songbo Hu
Han Zhou
Moy Yuan
Milan Gritta
Guchun Zhang
Ignacio Iacobacci
Anna Korhonen
Ivan Vulić
32
3
0
19 Oct 2023
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark
Chen Zhang
L. F. D’Haro
Chengguang Tang
Ke Shi
Guohua Tang
Haizhou Li
ELM
40
9
0
13 Oct 2023
Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores
Rikiya Takehi
Akihisa Watanabe
Tetsuya Sakai
20
3
0
30 Sep 2023
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems
Bryan Wilie
Yan Xu
Willy Chung
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
30
1
0
19 Sep 2023
Towards Multilingual Automatic Dialogue Evaluation
John Mendonça
A. Lavie
Isabel Trancoso
19
0
0
31 Aug 2023
Three Ways of Using Large Language Models to Evaluate Chat
Ondvrej Plátek
Vojtvech Hudevcek
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
ALM
19
6
0
12 Aug 2023
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation
Liliang Ren
Mankeerat Sidhu
Qi Zeng
R. Reddy
Heng Ji
Chengxiang Zhai
14
6
0
27 Jun 2023
Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4
Mario Rodríguez-Cantelar
Chen Zhang
Chengguang Tang
Ke Shi
Sarik Ghazarian
João Sedoc
L. F. D’Haro
Alexander I. Rudnicky
28
9
0
22 Jun 2023
The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues
Anaïs Tack
E. Kochmar
Zheng Yuan
Serge Bibauw
Chris Piech
25
20
0
12 Jun 2023
Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs
A. Komma
Nagesh Panyam Chandrasekarasastry
Timothy Leffel
Anuj Kumar Goyal
A. Metallinou
Spyros Matsoukas
Aram Galstyan
33
3
0
06 Jun 2023
Correction of Errors in Preference Ratings from Automated Metrics for Text Generation
Jan Deriu
Pius von Daniken
Don Tuggener
Mark Cieliebak
27
2
0
06 Jun 2023
Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting
Akhila Yerukola
Xuhui Zhou
Elizabeth Clark
Maarten Sap
25
6
0
24 May 2023
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response
Yongkang Liu
Shi Feng
Daling Wang
Yifei Zhang
Hinrich Schütze
ALM
ELM
33
1
0
24 May 2023
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation
Huda Khayrallah
Zuhaib Akhtar
Edward Cohen
João Sedoc
27
2
0
23 May 2023
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
Yen-Ting Lin
Yun-Nung (Vivian) Chen
24
91
0
23 May 2023
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
Iftitahu Ni'mah
Meng Fang
Vlado Menkovski
Mykola Pechenizkiy
27
13
0
15 May 2023
Talking with Machines: A Comprehensive Survey of Emergent Dialogue Systems
William Tholke
21
0
0
10 May 2023
Controllable Mixed-Initiative Dialogue Generation through Prompting
Maximillian Chen
Xiao Yu
Weiyan Shi
Urvi Awasthi
Zhou Yu
24
23
0
06 May 2023
Modeling What-to-ask and How-to-ask for Answer-unaware Conversational Question Generation
Do Xuan Long
Bowei Zou
Shafiq R. Joty
Anh Tai Tran
Liangming Pan
Nancy F. Chen
A. Aw
21
8
0
04 May 2023
Approximating Online Human Evaluation of Social Chatbots with Prompting
Ekaterina Svikhnushina
Pearl Pu
ELM
10
13
0
11 Apr 2023
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
Baolin Peng
Michel Galley
Pengcheng He
Hao Cheng
Yujia Xie
...
Qiuyuan Huang
Lars Liden
Zhou Yu
Weizhu Chen
Jianfeng Gao
KELM
HILM
LRM
16
375
0
24 Feb 2023
A Transformer-based Response Evaluator for Open-Domain Spoken Conversation
Vrindavan Harrison
Rishi Rajasekaran
M. Walker
OffRL
24
3
0
09 Feb 2023
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
26
7
0
18 Dec 2022
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
21
15
0
25 Oct 2022
EnDex: Evaluation of Dialogue Engagingness at Scale
Guangxuan Xu
Ruibo Liu
Fabrice Harel-Canada
Nischal Reddy Chandra
Nanyun Peng
15
5
0
22 Oct 2022
DialoGen: Generalized Long-Range Context Representation for Dialogue Systems
Suvodip Dey
M. Desarkar
Asif Ekbal
P. K. Srijith
24
2
0
12 Oct 2022
1
2
Next