Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.00595
Cited By
State of What Art? A Call for Multi-Prompt LLM Evaluation
31 December 2023
Moran Mizrahi
Guy Kaplan
Daniel Malkin
Rotem Dror
Dafna Shahaf
Gabriel Stanovsky
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"State of What Art? A Call for Multi-Prompt LLM Evaluation"
50 / 95 papers shown
Title
Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations
Moran Mizrahi
Chen Shani
Gabriel Stanovsky
Dan Jurafsky
Dafna Shahaf
29
0
0
29 Apr 2025
How Effective are Generative Large Language Models in Performing Requirements Classification?
Waad Alhoshan
Alessio Ferrari
Liping Zhao
27
0
0
23 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Barbara Plank
35
0
0
22 Apr 2025
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
31
0
0
18 Apr 2025
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Dieuwke Hupkes
Nikolay Bogoychev
124
0
0
14 Apr 2025
Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
Jennifer Haase
P. Hanel
Sebastian Pokutta
ALM
LRM
67
0
0
10 Apr 2025
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations
Sheila Castilho
Zoe Fitzsimmons
Claire Holton
Aoife Mc Donagh
33
0
0
10 Apr 2025
Towards LLMs Robustness to Changes in Prompt Format Styles
Lilian Ngweta
Kiran Kate
Jason Tsay
Yara Rizk
AAML
VLM
35
0
0
09 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
28
0
0
07 Apr 2025
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Aryan Agrawal
Lisa Alazraki
Shahin Honarvar
Marek Rei
57
0
0
03 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li
Yidi Miao
Xueying Ding
Ramayya Krishnan
R. Padman
37
0
0
28 Mar 2025
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models
Alexey Karev
Dong Xu
58
0
0
18 Mar 2025
Aligned Probing: Relating Toxic Behavior and Model Internals
Andreas Waldis
Vagrant Gautam
Anne Lauscher
Dietrich Klakow
Iryna Gurevych
45
0
0
17 Mar 2025
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Junjie Chen
X. Liu
Subin Huang
Linfeng Zhang
Hang Yu
58
0
0
15 Mar 2025
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results
Peter Fettke
Constantin Houy
ELM
46
0
0
14 Mar 2025
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba
Ofir Arviv
Itay Itzhak
Yotam Perlitz
Elron Bandel
Leshem Choshen
Michal Shmueli-Scheuer
Gabriel Stanovsky
77
2
0
03 Mar 2025
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Tingchen Fu
Fazl Barez
AAML
65
0
0
03 Mar 2025
Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models
Cheng-Kuang Wu
Zhi Rui Tam
Chieh-Yen Lin
Yun-Nung Chen
Hung-yi Lee
64
0
0
03 Mar 2025
ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer
Omer Goldman
Uri Shaham
Dan Malkin
Sivan Eiger
Avinatan Hassidim
...
Shruti Rijhwani
Laura Rimell
Idan Szpektor
Reut Tsarfaty
Matan Eyal
47
3
0
28 Feb 2025
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
Grigor Nalbandyan
Rima Shahbazyan
Evelina Bakhturina
ELM
38
0
0
28 Feb 2025
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction
Sarah Ball
Simeon Allmendinger
Frauke Kreuter
Niklas Kühl
57
0
0
22 Feb 2025
From Selection to Generation: A Survey of LLM-based Active Learning
Yu Xia
Subhojyoti Mukherjee
Zhouhang Xie
Junda Wu
Xintong Li
...
Namyong Park
T. Nguyen
Jiebo Luo
Ryan A. Rossi
Julian McAuley
55
0
0
17 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
139
1
0
10 Feb 2025
Evalita-LLM: Benchmarking Large Language Models on Italian
Bernardo Magnini
Roberto Zanoli
Michele Resta
Martin Cimmino
Paolo Albano
Marco Madeddu
V. Patti
55
1
0
04 Feb 2025
LCTG Bench: LLM Controlled Text Generation Benchmark
Kemal Kurniawan
Masato Mita
Peinan Zhang
S. Sasaki
Ryosuke Ishigami
Naoaki Okazaki
55
0
0
28 Jan 2025
MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
Zhongpu Chen
Y. Liu
Long Shi
Zhi-Jie Wang
Xingyan Chen
Yu Zhao
Fuji Ren
46
0
0
28 Jan 2025
Personalizing Education through an Adaptive LMS with Integrated LLMs
Kyle Spriggs
Meng Cheng Lau
Kalpdrum Passi
AI4Ed
57
0
0
24 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
98
3
0
12 Dec 2024
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Leixin Zhang
Steffen Eger
Yinjie Cheng
Weihe Zhai
Jonas Belouadi
Christoph Leiter
Simone Paolo Ponzetto
Fahimeh Moafian
Zhixue Zhao
MLLM
84
1
0
03 Dec 2024
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts
Aihua Pei
Zehua Yang
Shunan Zhu
Ruoxi Cheng
Ju Jia
AAML
80
2
0
01 Dec 2024
Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis
Adithya V Ganesan
Vasudha Varadarajan
Yash Kumar Lal
Veerle C. Eijsbroek
Katarina Kjell
...
Elizabeth C. Stade
J. Eichstaedt
Ryan L. Boyd
H. A. Schwartz
Lucie Flek
AI4MH
77
0
0
21 Nov 2024
The Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems
Sruthi Viswanathan
Seray Ibrahim
Ravi Shankar
Reuben Binns
Max Van Kleek
Petr Slovák
71
1
0
02 Nov 2024
Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data
Anup Shirgaonkar
Nikhil Pandey
Nazmiye Ceren Abay
Tolga Aktas
Vijay Aski
ALM
SyDa
31
0
0
24 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
76
4
0
24 Oct 2024
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li
Jiarui Liu
Andy Liu
Xuhui Zhou
Mona Diab
Maarten Sap
53
6
0
21 Oct 2024
LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks
Akshara Prabhakar
Yuanzhi Li
Karthik Narasimhan
Sham Kakade
Eran Malach
Samy Jelassi
MoMe
36
9
0
16 Oct 2024
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Jingming Zhuo
S. Zhang
Xinyu Fang
Haodong Duan
Dahua Lin
Kai Chen
32
19
0
16 Oct 2024
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
Lorenzo Pacchiardi
Marko Tesic
Lucy G. Cheke
José Hernández-Orallo
36
3
0
15 Oct 2024
Eliciting Textual Descriptions from Representations of Continuous Prompts
Dana Ramati
Daniela Gottesman
Mor Geva
37
0
0
15 Oct 2024
A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies
Yen-Hsiang Wang
Feng-Dian Su
Tzu-Yu Yeh
Yao-Chung Fan
RALM
AILaw
31
0
0
15 Oct 2024
Skill Learning Using Process Mining for Large Language Model Plan Generation
Andrei Cosmin Redis
M. Sani
Bahram Zarrin
Andrea Burattin
34
0
0
14 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
56
7
0
09 Oct 2024
POSIX: A Prompt Sensitivity Index For Large Language Models
Anwoy Chatterjee
H. S. V. N. S. K. Renduchintala
S. Bhatia
Tanmoy Chakraborty
AAML
39
6
0
03 Oct 2024
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
32
4
0
27 Sep 2024
The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification
Andreas Waldis
Joel Birrer
Anne Lauscher
Iryna Gurevych
28
1
0
26 Sep 2024
SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen
Nadine Chang
Sifei Liu
Jose M. Alvarez
36
0
0
20 Sep 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
Roser Morante
Julio Gonzalo
Guillermo Marco
Jorge Carrillo-de-Albornoz
...
Enrique Amigó
Andrés Fernández
Alejandro Benito-Santos
Adrián Ghajari Espinosa
Victor Fresno
ELM
51
0
0
19 Sep 2024
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng
Bo-wen Li
Zhenghao Lin
Yi Luo
Xuanhe Zhou
Chen Lin
Jinsong Su
Guoliang Li
Shifu Li
ELM
46
1
0
05 Sep 2024
Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
Maxim Ifergan
Leshem Choshen
Roee Aharoni
Idan Szpektor
Omri Abend
HILM
48
3
0
20 Aug 2024
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Zhi Rui Tam
Cheng-Kuang Wu
Yi-Lin Tsai
Chieh-Yen Lin
Hung-yi Lee
Yun-Nung Chen
27
24
0
05 Aug 2024
1
2
Next