ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.00595
  4. Cited By
State of What Art? A Call for Multi-Prompt LLM Evaluation
v1v2v3 (latest)

State of What Art? A Call for Multi-Prompt LLM Evaluation

31 December 2023
Moran Mizrahi
Guy Kaplan
Daniel Malkin
Rotem Dror
Dafna Shahaf
Gabriel Stanovsky
    ELM
ArXiv (abs)PDFHTML

Papers citing "State of What Art? A Call for Multi-Prompt LLM Evaluation"

38 / 38 papers shown
Title
VLM@school -- Evaluation of AI image understanding on German middle school knowledge
VLM@school -- Evaluation of AI image understanding on German middle school knowledge
René Peinl
Vincent Tischler
CoGeVLM
39
0
0
13 Jun 2025
Improving LLM Reasoning through Interpretable Role-Playing Steering
Improving LLM Reasoning through Interpretable Role-Playing Steering
Anyi Wang
Dong Shu
Yifan Wang
Yunpu Ma
Mengnan Du
LLMSVLRM
21
0
0
09 Jun 2025
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
Jenny Schmalfuss
Nadine Chang
Vibashan VS
Maying Shen
Andrés Bruhn
Jose M. Alvarez
VLM
19
0
0
03 Jun 2025
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Gili Lior
Eliya Habba
Shahar Levy
Avi Caciularu
Gabriel Stanovsky
37
1
0
28 May 2025
Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)
Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)
Clayton Cohn
Surya Rayala
Caitlin Snyder
J. Fonteles
Shruti Jain
...
Ashwin T S
Namrata Srivastava
Menton Deweese
Angela Eeds
Gautam Biswas
RALM
112
0
0
22 May 2025
Leveraging LLM Inconsistency to Boost Pass@k Performance
Leveraging LLM Inconsistency to Boost Pass@k Performance
Uri Dalal
Meirav Segal
Zvika Ben-Haim
Dan Lahav
Omer Nevo
100
0
0
19 May 2025
Evaluations at Work: Measuring the Capabilities of GenAI in Use
Evaluations at Work: Measuring the Capabilities of GenAI in Use
Brandon Lepine
Gawesha Weerantunga
Juho Kim
Pamela Mishkin
Matthew Beane
76
0
0
15 May 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Jonas Fischer
Barbara Plank
89
0
0
22 Apr 2025
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Dieuwke Hupkes
Nikolay Bogoychev
417
0
0
14 Apr 2025
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations
Sheila Castilho
Zoe Fitzsimmons
Claire Holton
Aoife Mc Donagh
57
0
0
10 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
141
0
0
07 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li
Yidi Miao
Xueying Ding
Ramayya Krishnan
R. Padman
136
0
0
28 Mar 2025
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models
Alexey Karev
Dong Xu
150
0
0
18 Mar 2025
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Junjie Chen
Xuyang Liu
Subin Huang
Linfeng Zhang
Hang Yu
103
0
0
15 Mar 2025
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results
Peter Fettke
Constantin Houy
ELM
90
0
0
14 Mar 2025
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba
Ofir Arviv
Itay Itzhak
Yotam Perlitz
Elron Bandel
Leshem Choshen
Michal Shmueli-Scheuer
Gabriel Stanovsky
131
5
0
03 Mar 2025
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction
Sarah Ball
Simeon Allmendinger
Frauke Kreuter
Niklas Kühl
106
0
0
22 Feb 2025
From Selection to Generation: A Survey of LLM-based Active Learning
From Selection to Generation: A Survey of LLM-based Active Learning
Yu Xia
Subhojyoti Mukherjee
Zhouhang Xie
Junda Wu
Xintong Li
...
Namyong Park
T. Nguyen
Jiebo Luo
Ryan Rossi
Julian McAuley
119
1
0
17 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez-Llorca
ELM
292
6
0
10 Feb 2025
Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization
Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization
Yuanye Liu
Jiahang Xu
Li Zhang
Qi Chen
Xuan Feng
Yang Chen
Zhongxin Guo
Yuqing Yang
Peng Cheng
187
2
0
06 Feb 2025
LCTG Bench: LLM Controlled Text Generation Benchmark
Kemal Kurniawan
Masato Mita
Peinan Zhang
S. Sasaki
Ryosuke Ishigami
Naoaki Okazaki
117
0
0
28 Jan 2025
Personalizing Education through an Adaptive LMS with Integrated LLMs
Kyle Spriggs
Meng Cheng Lau
Kalpdrum Passi
AI4Ed
150
1
0
24 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALMELM
169
5
0
12 Dec 2024
The Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems
The Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems
Sruthi Viswanathan
Seray Ibrahim
Ravi Shankar
Reuben Binns
Max Van Kleek
Petr Slovák
128
1
0
02 Nov 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei Zhao
Steffen Eger
142
10
0
24 Oct 2024
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li
Jiarui Liu
Andy Liu
Xuhui Zhou
Mona Diab
Maarten Sap
163
11
0
21 Oct 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
Roser Morante
Julio Gonzalo
Guillermo Marco
Jorge Carrillo-de-Albornoz
...
Enrique Amigó
Andrés Fernández
Alejandro Benito-Santos
Adrián Ghajari Espinosa
Victor Fresno
ELM
131
0
0
19 Sep 2024
Revolutionizing Database Q&A with Large Language Models: Comprehensive
  Benchmark and Evaluation
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng
Yue Liu
Zhenghao Lin
Yi Luo
Xuanhe Zhou
Chen Lin
Jinsong Su
Guoliang Li
Shifu Li
ELM
105
2
0
05 Sep 2024
A Novel Metric for Measuring the Robustness of Large Language Models in
  Non-adversarial Scenarios
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
Samuel Ackerman
Ella Rabinovich
E. Farchi
Ateret Anaby-Tavor
65
1
0
04 Aug 2024
Paraphrase Types Elicit Prompt Engineering Capabilities
Paraphrase Types Elicit Prompt Engineering Capabilities
Jan Philip Wahle
Terry Ruas
Yang Xu
Bela Gipp
143
10
0
28 Jun 2024
SEAM: A Stochastic Benchmark for Multi-Document Tasks
SEAM: A Stochastic Benchmark for Multi-Document Tasks
Gili Lior
Avi Caciularu
Arie Cattan
Shahar Levy
Ori Shapira
Gabriel Stanovsky
RALM
82
5
0
23 Jun 2024
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
Shuoqi Sun
Shengyao Zhuang
Shuai Wang
Guido Zuccon
128
9
0
20 Jun 2024
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models
Hwiyeol Jo
Hyunwoo Lee
Kang Min Yoo
Taiwoo Park
40
0
0
19 Jun 2024
Efficient multi-prompt evaluation of LLMs
Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo
Ronald Xu
Lucas Weber
Mírian Silva
Onkar Bhardwaj
Leshem Choshen
Allysson Flavio Melo de Oliveira
Yuekai Sun
Mikhail Yurochkin
106
27
0
27 May 2024
Chain of Targeted Verification Questions to Improve the Reliability of
  Code Generated by LLMs
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom
Arghavan Moradi Dakhel
Florian Tambon
Foutse Khomh
92
2
0
22 May 2024
Examining the robustness of LLM evaluation to the distributional
  assumptions of benchmarks
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Melissa Ailem
Katerina Marazopoulou
Charlotte Siska
James Bono
96
22
0
25 Apr 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language
  Model Leaderboards
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Norah A. Alzahrani
H. A. Alyahya
Sultan Yazeed Alnumay
Muhtasim Tahmid
Shaykhah Alsubaie
...
Saleh Soltan
Nathan Scales
Marie-Anne Lachaux
Samuel R. Bowman
Haidar Khan
ELM
135
80
0
01 Feb 2024
Mind Your Format: Towards Consistent Evaluation of In-Context Learning
  Improvements
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
Anton Voronov
Lena Wolf
Max Ryabinin
78
52
0
12 Jan 2024
1