ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.07462
  4. Cited By
Are Large Language Model-based Evaluators the Solution to Scaling Up
  Multilingual Evaluation?

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

14 September 2023
Rishav Hada
Varun Gumma
Adrian de Wynter
Harshita Diddee
Mohamed Ahmed
Monojit Choudhury
Kalika Bali
Sunayana Sitaram
    ALM
    LM&MA
    ELM
ArXivPDFHTML

Papers citing "Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?"

49 / 49 papers shown
Title
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Minghao Wu
Weixuan Wang
Sinuo Liu
Huifeng Yin
Xintong Wang
Yu Zhao
Chenyang Lyu
Longyue Wang
Weihua Luo
Kaifu Zhang
ELM
79
0
0
22 Apr 2025
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Adrianna Romanowski
Pedro Valois
Kazuhiro Fukui
34
0
0
12 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
45
0
0
10 Apr 2025
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications
Sunayana Sitaram
Adrian de Wynter
Isobel McCrum
Qilong Gu
Si-Qing Chen
AILaw
106
0
0
26 Mar 2025
Got Compute, but No Data: Lessons From Post-training a Finnish LLM
Elaine Zosa
Ville Komulainen
S. Pyysalo
68
0
0
12 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
62
1
0
11 Mar 2025
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following
Z. Li
Kehai Chen
Yunfei Long
X. Bai
Yaoyin Zhang
Xuchen Wei
J. Li
Min Zhang
ELM
66
0
0
10 Mar 2025
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie
Qingqiu Li
Zhuohao Yu
Yuejie Zhang
Yue Zhang
Linyi Yang
ELM
63
1
0
15 Feb 2025
Aligning Black-box Language Models with Human Judgments
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
82
0
0
07 Feb 2025
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation
  on Nepali
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
Sharad Duwal
Suraj Prasai
Suresh Manandhar
CLL
84
1
0
18 Dec 2024
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM
  World
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
68
0
0
02 Dec 2024
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for
  reference-free open-ended text
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh
Tianyi Yao
Lizzy Chen
Sadid Hasan
Tianwei Chen
Dario Bernal
Huitian Jiao
H M Sajjad Hossain
ELM
76
0
0
25 Nov 2024
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented
  Generation
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen
Jing Jin
Peng-Ting Kuo
Chao-Wei Huang
Yun-Nung (Vivian) Chen
25
1
0
28 Oct 2024
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son
Dongkeun Yoon
Juyoung Suk
Javier Aula-Blasco
Mano Aslan
Vu Trong Kim
Shayekh Bin Islam
Jaume Prats-Cristià
Lucía Tormo-Bañuelos
Seungone Kim
ELM
LRM
25
8
0
23 Oct 2024
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World
  Multilingual Settings
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma
Anandhita Raghunath
Mohit Jain
Sunayana Sitaram
LM&MA
34
1
0
17 Oct 2024
Language Imbalance Driven Rewarding for Multilingual Self-improving
Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang
Junhong Wu
Chen Wang
Chengqing Zong
Junzhe Zhang
ALM
LRM
66
4
0
11 Oct 2024
Can visual language models resolve textual ambiguity with visual cues?
  Let visual puns tell you!
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung
Seungwon Lim
Jaehyun Jeon
Seungbeen Lee
Youngjae Yu
22
0
0
01 Oct 2024
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of
  Free-Form Text
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
Sher Badshah
Hassan Sajjad
ELM
40
9
0
17 Aug 2024
Machine Translation Hallucination Detection for Low and High Resource
  Languages using Large Language Models
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models
Kenza Benkirane
Laura Gongas
Shahar Pelles
Naomi Fuchs
Joshua Darmon
Pontus Stenetorp
David Ifeoluwa Adelani
Eduardo Sánchez
HILM
40
4
0
23 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
29
28
0
04 Jul 2024
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
Mehant Kammakomati
Sameer Pimparkhede
Srikanth G. Tamilselvam
Prince Kumar
Pushpak Bhattacharyya
ALM
40
0
0
03 Jul 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20
  NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
54
62
0
26 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement
  on Multilingual and Multi-Cultural Data
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
45
9
0
21 Jun 2024
On the Evaluation Practices in Multilingual NLP: Can Machine Translation
  Offer an Alternative to Human Translations?
On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?
Rochelle Choenni
Sara Rajaee
Christof Monz
Ekaterina Shutova
39
1
0
20 Jun 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
61
55
0
18 Jun 2024
Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced,
  Low-Resource Real-World Scenarios
Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios
Millicent Ochieng
Varun Gumma
Sunayana Sitaram
Jindong Wang
Vishrav Chaudhary
K. Ronen
Kalika Bali
Jacki OÑeill
34
4
0
01 Jun 2024
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles
  and Committee Discussions
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Ruochen Zhao
Wenxuan Zhang
Yew Ken Chia
Deli Zhao
Lidong Bing
41
10
0
30 May 2024
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer
  Selection in Large Language Models
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Zhangyue Yin
Qiushi Sun
Qipeng Guo
Zhiyuan Zeng
Xiaonan Li
...
Qinyuan Cheng
Ding Wang
Xiaofeng Mou
Xipeng Qiu
XuanJing Huang
LRM
46
4
0
21 May 2024
ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
Ana Brassard
Benjamin Heinzerling
Keito Kudo
Keisuke Sakaguchi
Kentaro Inui
LRM
39
0
0
08 May 2024
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
Adrian de Wynter
Ishaan Watts
Nektar Ege Altıntoprak
Tua Wongsangaroonsri
Minghui Zhang
...
Anna Vickers
Stéphanie Visser
Herdyan Widarmanto
A. Zaikin
Si-Qing Chen
LM&MA
52
16
0
22 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and
  Frontiers
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Hai-Tao Zheng
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
55
36
0
07 Apr 2024
METAL: Towards Multilingual Meta-Evaluation
METAL: Towards Multilingual Meta-Evaluation
Rishav Hada
Varun Gumma
Mohamed Ahmed
Kalika Bali
Sunayana Sitaram
ELM
40
2
0
02 Apr 2024
The Minimum Information about CLinical Artificial Intelligence Checklist
  for Generative Modeling Research (MI-CLAIM-GEN)
The Minimum Information about CLinical Artificial Intelligence Checklist for Generative Modeling Research (MI-CLAIM-GEN)
Brenda Y. Miao
Irene Y. Chen
C. Y. Williams
Jaysón M. Davidson
Augusto Garcia-Agundez
...
Bin Yu
Milena Gianfrancesco
A. Butte
Beau Norgeot
Madhumita Sushil
VLM
39
2
0
05 Mar 2024
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
  Criteria Decomposition
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition
Yuxuan Liu
Tianchi Yang
Shaohan Huang
Zihan Zhang
Haizhen Huang
Furu Wei
Weiwei Deng
Feng Sun
Qi Zhang
31
13
0
24 Feb 2024
High-quality Data-to-Text Generation for Severely Under-Resourced
  Languages with Out-of-the-box Large Language Models
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi
Anya Belz
6
5
0
19 Feb 2024
Are LLM-based Evaluators Confusing NLG Quality Criteria?
Are LLM-based Evaluators Confusing NLG Quality Criteria?
Xinyu Hu
Mingqi Gao
Sen Hu
Yang Zhang
Yicheng Chen
Teng Xu
Xiaojun Wan
AAML
ELM
36
22
0
19 Feb 2024
MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning
MoRAL: MoE Augmented LoRA for LLMs' Lifelong Learning
Shu Yang
Muhammad Asif Ali
Cheng-Long Wang
Lijie Hu
Di Wang
CLL
MoE
37
38
0
17 Feb 2024
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned
  LLMs?
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Ehsan Doostmohammadi
Oskar Holmstrom
Marco Kuhlmann
35
8
0
16 Feb 2024
Financial Report Chunking for Effective Retrieval Augmented Generation
Financial Report Chunking for Effective Retrieval Augmented Generation
Antonio Jimeno-Yepes
Yao You
Jan Milczek
Sebastian Laverde
Renyu Li
43
20
0
05 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
60
29
0
02 Feb 2024
Generating Zero-shot Abstractive Explanations for Rumour Verification
Generating Zero-shot Abstractive Explanations for Rumour Verification
I. Bilal
Preslav Nakov
Rob Procter
M. Liakata
19
0
0
23 Jan 2024
Towards Conversational Diagnostic AI
Towards Conversational Diagnostic AI
Tao Tu
Anil Palepu
M. Schaekermann
Khaled Saab
Jan Freyberg
...
Katherine Chou
Greg S. Corrado
Yossi Matias
Alan Karthikesalingam
Vivek Natarajan
AI4MH
LM&MA
26
92
0
11 Jan 2024
The Butterfly Effect of Altering Prompts: How Small Changes and
  Jailbreaks Affect Large Language Model Performance
The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance
A. Salinas
Fred Morstatter
45
49
0
08 Jan 2024
Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is
  Needed?
Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?
Tannon Kew
Florian Schottmann
Rico Sennrich
LRM
26
35
0
20 Dec 2023
Branch-Solve-Merge Improves Large Language Model Evaluation and
  Generation
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Swarnadeep Saha
Omer Levy
Asli Celikyilmaz
Mohit Bansal
Jason Weston
Xian Li
MoMe
25
71
0
23 Oct 2023
Risk Aware Benchmarking of Large Language Models
Risk Aware Benchmarking of Large Language Models
Apoorva Nitsure
Youssef Mroueh
Mattia Rigotti
Kristjan Greenewald
Brian M. Belgodere
Mikhail Yurochkin
Jirí Navrátil
Igor Melnyk
Jerret Ross
30
1
0
11 Oct 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
224
572
0
03 May 2023
Beyond English-Centric Bitexts for Better Multilingual Language
  Representation Learning
Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning
Barun Patra
Saksham Singhal
Shaohan Huang
Zewen Chi
Li Dong
Furu Wei
Vishrav Chaudhary
Xia Song
56
23
0
26 Oct 2022
Beyond Static Models and Test Sets: Benchmarking the Potential of
  Pre-trained Models Across Tasks and Languages
Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages
Kabir Ahuja
Sandipan Dandapat
Sunayana Sitaram
Monojit Choudhury
LRM
39
16
0
12 May 2022
1