ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.19792
  4. Cited By
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
  Explainable Metrics

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

30 October 2023
Christoph Leiter
Juri Opitz
Daniel Deutsch
Yang Gao
Rotem Dror
Steffen Eger
    ALM
    LRM
    ELM
ArXivPDFHTML

Papers citing "The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics"

26 / 26 papers shown
Title
Large Language Models as Span Annotators
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
35
0
0
11 Apr 2025
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Daniil Larionov
Sotaro Takeshita
Ran Zhang
Yanran Chen
Christoph Leiter
Zhipin Wang
Christian Greisinger
Steffen Eger
ReLM
ELM
LRM
72
0
0
10 Apr 2025
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
Jeremy Barnes
Naiara Perez
Alba Bonet-Jover
Begoña Altuna
59
1
0
21 Mar 2025
Argument Summarization and its Evaluation in the Era of Large Language Models
Argument Summarization and its Evaluation in the Era of Large Language Models
Moritz Altemeyer
Steffen Eger
Johannes Daxenberger
Tim Altendorf
Philipp Cimiano
Benjamin Schiller
LM&MA
ELM
LRM
65
0
0
02 Mar 2025
ScImage: How Good Are Multimodal Large Language Models at Scientific
  Text-to-Image Generation?
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Leixin Zhang
Steffen Eger
Yinjie Cheng
Weihe Zhai
Jonas Belouadi
Christoph Leiter
Simone Paolo Ponzetto
Fahimeh Moafian
Zhixue Zhao
MLLM
76
1
0
03 Dec 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
71
4
0
24 Oct 2024
MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
Juraj Juraska
Daniel Deutsch
Mara Finkelstein
Markus Freitag
39
14
0
04 Oct 2024
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic
  Post-Editing in LLM Translation Evaluators
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
Qingyu Lu
Liang Ding
Kanjian Zhang
Jinxia Zhang
Dacheng Tao
35
3
0
22 Sep 2024
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on
  Curiosity-Driven Questioning
What Would You Ask When You First Saw a2+b2=c2a^2+b^2=c^2a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji
Zining Zhu
ELM
ALM
34
0
0
19 Sep 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine
  Translation and Summarization Evaluation
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter
Steffen Eger
34
8
0
26 Jun 2024
Themis: Towards Flexible and Interpretable NLG Evaluation
Themis: Towards Flexible and Interpretable NLG Evaluation
Xinyu Hu
Li Lin
Mingqi Gao
Xunjian Yin
Xiaojun Wan
ELM
34
6
0
26 Jun 2024
Evaluating Diversity in Automatic Poetry Generation
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
34
8
0
21 Jun 2024
xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned
  MT Evaluation Metrics
xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics
Daniil Larionov
Mikhail Seleznyov
Vasiliy Viskov
Alexander Panchenko
Steffen Eger
32
3
0
20 Jun 2024
Natural Language Processing RELIES on Linguistics
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
52
7
0
09 May 2024
Evaluating Large Language Models for Structured Science Summarization in
  the Open Research Knowledge Graph
Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph
Vladyslav Nechakhin
Jennifer D'Souza
Steffen Eger
46
4
0
03 May 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
57
29
0
02 Feb 2024
Leveraging Large Language Models for NLG Evaluation: Advances and
  Challenges
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges
Zhen Li
Xiaohan Xu
Tao Shen
Can Xu
Jia-Chen Gu
Yuxuan Lai
Chongyang Tao
Shuai Ma
LM&MA
ELM
34
9
0
13 Jan 2024
Exploring Prompting Large Language Models as Explainable Metrics
Exploring Prompting Large Language Models as Explainable Metrics
Ghazaleh Mahmoudi
LRM
11
4
0
20 Nov 2023
Which is better? Exploring Prompting Strategy For LLM-based Metrics
Which is better? Exploring Prompting Strategy For LLM-based Metrics
Joonghoon Kim
Saeran Park
Kiyoon Jeong
Sangmin Lee
S. Han
Jiyoon Lee
Pilsung Kang
6
15
0
07 Nov 2023
Little Giants: Exploring the Potential of Small LLMs as Evaluation
  Metrics in Summarization in the Eval4NLP 2023 Shared Task
Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task
Neema Kotonya
Saran Krishnasamy
Joel R. Tetreault
Alejandro Jaimes
16
9
0
01 Nov 2023
Towards Explainable Evaluation Metrics for Machine Translation
Towards Explainable Evaluation Metrics for Machine Translation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
ELM
28
13
0
22 Jun 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
224
571
0
03 May 2023
Layer or Representation Space: What makes BERT-based Evaluation Metrics
  Robust?
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
Doan Nam Long Vu
N. Moosavi
Steffen Eger
21
9
0
06 Sep 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
314
3,248
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,915
0
04 Mar 2022
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform
Zhen Xu
Sergio Escalera
Isabelle M Guyon
Adrien Pavao
M. Richard
Wei-Wei Tu
Quanming Yao
Huan Zhao
95
49
0
12 Oct 2021
1