Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.06264
Cited By
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
11 June 2020
Nitika Mathur
Tim Baldwin
Trevor Cohn
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics"
50 / 67 papers shown
Title
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang
Yekyung Kim
Michael Krumdick
Amir Zadeh
Chuan Li
Chris Tanner
Mohit Iyyer
ALM
27
0
0
16 May 2025
On Benchmarking Code LLMs for Android Malware Analysis
Yiling He
Hongyu She
Xingzhi Qian
Xinran Zheng
Zhuo Chen
Zhan Qin
Lorenzo Cavallaro
ELM
55
1
0
01 Apr 2025
Training and Inference Efficiency of Encoder-Decoder Speech Models
Piotr .Zelasko
Kunal Dhawan
Daniel Galvez
Krishna Puvvada
Ankita Pasad
Nithin Rao Koluguri
Ke Hu
Vitaly Lavrukhin
Jagadeesh Balam
Boris Ginsburg
56
0
0
07 Mar 2025
Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
Mingqi Gao
Xinyu Hu
Li Lin
Xiaojun Wan
33
1
0
28 Jan 2025
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics
Ameya Godbole
Robin Jia
HILM
62
1
0
24 Jan 2025
Investigating Length Issues in Document-level Machine Translation
Ziqian Peng
Rachel Bawden
François Yvon
77
1
0
23 Dec 2024
TeXBLEU: Automatic Metric for Evaluate LaTeX Format
Kyudan Jung
N. Kim
Hyongon Ryu
Sieun Hyeon
Seung-jun Lee
Hyeok-jae Lee
42
0
0
10 Sep 2024
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
Cordelia Schmid
Benoît Sagot
Rachel Bawden
45
3
0
18 Jul 2024
Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation
Paulo Cavalin
P. Domingues
Claudio S. Pinhanez
42
0
0
03 Jul 2024
Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Boxuan Lyu
Hidetaka Kamigaito
Kotaro Funakoshi
Manabu Okumura
48
0
0
17 Jun 2024
Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Matthias Sperber
Ondrej Bojar
Barry Haddow
Dávid Javorský
Xutai Ma
...
Jan Niehues
Peter Polák
Elizabeth Salesky
Katsuhito Sudoh
Marco Turchi
31
2
0
06 Jun 2024
Do LLMs Work on Charts? Designing Few-Shot Prompts for Chart Question Answering and Summarization
Do Xuan Long
Mohammad Hassanpour
Ahmed Masry
P. Kavehzadeh
Enamul Hoque
Chenyu You
LRM
32
9
0
17 Dec 2023
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model
Christian Tomani
David Vilar
Markus Freitag
Colin Cherry
Subhajit Naskar
Mara Finkelstein
Xavier Garcia
Daniel Cremers
26
7
0
10 Oct 2023
TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation
Yiming Ai
Zhiwei He
Kai Yu
Rui Wang
8
1
0
23 May 2023
How Good are Commercial Large Language Models on African Languages?
Jessica Ojo
Kelechi Ogueji
31
5
0
11 May 2023
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
Samantha Robertson
Zijie J. Wang
Dominik Moritz
Mary Beth Kery
Fred Hohman
43
15
0
12 Apr 2023
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
Qingyu Lu
Baopu Qiu
Liang Ding
Liping Xie
Tom Kocmi
Dacheng Tao
LRM
ALM
ELM
31
109
0
24 Mar 2023
Extrinsic Evaluation of Machine Translation Metrics
Nikita Moghe
Tom Sherborne
Mark Steedman
Alexandra Birch
ELM
36
18
0
20 Dec 2022
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
Qingyu Lu
Liang Ding
Liping Xie
Kanjian Zhang
Derek F. Wong
Dacheng Tao
ELM
ALM
41
14
0
20 Dec 2022
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
55
44
0
20 Dec 2022
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
Yiwei Qin
Weizhe Yuan
Graham Neubig
Pengfei Liu
17
23
0
12 Dec 2022
DC-MBR: Distributional Cooling for Minimum Bayesian Risk Decoding
Jianhao Yan
Jin Xu
Fandong Meng
Jie Zhou
Yue Zhang
29
3
0
08 Dec 2022
Calibrated Interpretation: Confidence Estimation in Semantic Parsing
Elias Stengel-Eskin
Benjamin Van Durme
UQLM
46
24
0
14 Nov 2022
Leveraging Affirmative Interpretations from Negation Improves Natural Language Understanding
Md Mosharaf Hossain
Eduardo Blanco
50
4
0
26 Oct 2022
DEMETR: Diagnosing Evaluation Metrics for Translation
Marzena Karpinska
N. Raj
Katherine Thai
Yixiao Song
Ankita Gupta
Mohit Iyyer
34
38
0
25 Oct 2022
m
4
A
d
a
p
t
e
r
m^4Adapter
m
4
A
d
a
pt
er
: Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter
Wen Lai
Alexandra Chronopoulou
Alexander Fraser
31
3
0
21 Oct 2022
Searching for a higher power in the human evaluation of MT
Johnny Tian-Zheng Wei
Tom Kocmi
C. Federmann
23
6
0
20 Oct 2022
Belief Revision based Caption Re-ranker with Visual Semantic Information
Ahmed Sabir
Francesc Moreno-Noguer
Pranava Madhyastha
Lluís Padró
BDL
34
2
0
16 Sep 2022
Rethinking Round-Trip Translation for Machine Translation Evaluation
Terry Yue Zhuo
Qiongkai Xu
Xuanli He
Trevor Cohn
LRM
29
2
0
15 Sep 2022
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun
Pierre Colombo
Chloé Clavel
Fabian M. Suchanek
58
51
0
24 Aug 2022
Lack of Fluency is Hurting Your Translation Model
J. Yoo
Jaewoo Kang
23
0
0
24 May 2022
Non-Autoregressive Machine Translation: It's Not as Fast as it Seems
Jindvrich Helcl
Barry Haddow
Alexandra Birch
27
20
0
04 May 2022
Quality-Aware Decoding for Neural Machine Translation
Patrick Fernandes
António Farinhas
Ricardo Rei
José G. C. de Souza
Perez Ogayo
Graham Neubig
André F. T. Martins
49
57
0
02 May 2022
RoBLEURT Submission for the WMT2021 Metrics Task
Boyi Deng
Dayiheng Liu
Baosong Yang
Tianchi Bi
Haibo Zhang
Boxing Chen
Weihua Luo
Derek F. Wong
Lidia S. Chao
39
13
0
28 Apr 2022
UniTE: Unified Translation Evaluation
Boyi Deng
Dayiheng Liu
Baosong Yang
Haibo Zhang
Boxing Chen
Derek F. Wong
Lidia S. Chao
41
41
0
28 Apr 2022
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics
Daniel Deutsch
Rotem Dror
Dan Roth
22
44
0
21 Apr 2022
A Call for Clarity in Beam Search: How It Works and When It Stops
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Dragomir R. Radev
Yejin Choi
Noah A. Smith
31
7
0
11 Apr 2022
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
Jiannan Xiang
Huayang Li
Yahui Liu
Lemao Liu
Guoping Huang
Defu Lian
Shuming Shi
10
4
0
29 Mar 2022
Improving Both Domain Robustness and Domain Adaptability in Machine Translation
Wen Lai
Jindrich Libovický
Alexander Fraser
AI4CE
39
14
0
15 Dec 2021
Better than Average: Paired Evaluation of NLP Systems
Maxime Peyrard
Wei Zhao
Steffen Eger
Robert West
ELM
21
24
0
20 Oct 2021
Control Prefixes for Parameter-Efficient Text Generation
Jordan Clive
Kris Cao
Marek Rei
47
32
0
15 Oct 2021
Learning Compact Metrics for MT
Amy Pu
Hyung Won Chung
Ankur P. Parikh
Sebastian Gehrmann
Thibault Sellam
38
99
0
12 Oct 2021
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors
Marvin Kaster
Wei Zhao
Steffen Eger
35
24
0
08 Oct 2021
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
Mingkai Deng
Bowen Tan
Zhengzhong Liu
Eric Xing
Zhiting Hu
21
73
0
14 Sep 2021
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
116
58
0
13 Sep 2021
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
Boseop Kim
Hyoungseok Kim
Sang-Woo Lee
Gichang Lee
Donghyun Kwak
...
Jaewook Kang
Inho Kang
Jung-Woo Ha
W. Park
Nako Sung
VLM
249
121
0
10 Sep 2021
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Tom Kocmi
C. Federmann
Roman Grundkiewicz
Marcin Junczys-Dowmunt
Hitokazu Matsushita
Arul Menezes
45
204
0
22 Jul 2021
Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Benjamin Marie
Atsushi Fujita
Raphaël Rubino
ELM
15
103
0
29 Jun 2021
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
57
811
0
22 Jun 2021
Machine Translation into Low-resource Language Varieties
Sachin Kumar
Antonios Anastasopoulos
S. Wintner
Yulia Tsvetkov
11
29
0
12 Jun 2021
1
2
Next