v1v2v3 (latest)

State of What Art? A Call for Multi-Prompt LLM Evaluation

31 December 2023

Gabriel Stanovsky

Papers citing "State of What Art? A Call for Multi-Prompt LLM Evaluation"

38 / 38 papers shown

Title
VLM@school -- Evaluation of AI image understanding on German middle school knowledge René Peinl Vincent Tischler CoGe VLM 39 0 0 13 Jun 2025
Improving LLM Reasoning through Interpretable Role-Playing Steering Anyi Wang Dong Shu Yifan Wang Yunpu Ma Mengnan Du LLMSV LRM 23 0 0 09 Jun 2025
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models Jenny Schmalfuss Nadine Chang Vibashan VS Maying Shen Andrés Bruhn Jose M. Alvarez VLM 21 0 0 03 Jun 2025
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments Gili Lior Eliya Habba Shahar Levy Avi Caciularu Gabriel Stanovsky 37 1 0 28 May 2025
Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG) Clayton Cohn Surya Rayala Caitlin Snyder J. Fonteles Shruti Jain ... Ashwin T S Namrata Srivastava Menton Deweese Angela Eeds Gautam Biswas RALM 112 0 0 22 May 2025
Leveraging LLM Inconsistency to Boost Pass@k Performance Uri Dalal Meirav Segal Zvika Ben-Haim Dan Lahav Omer Nevo 100 0 0 19 May 2025
Evaluations at Work: Measuring the Capabilities of GenAI in Use Brandon Lepine Gawesha Weerantunga Juho Kim Pamela Mishkin Matthew Beane 76 0 0 15 May 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Michael A. Hedderich Anyi Wang Raoyuan Zhao Florian Eichin Jonas Fischer Barbara Plank 89 0 0 22 Apr 2025
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages Dieuwke Hupkes Nikolay Bogoychev 417 0 0 14 Apr 2025
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations Sheila Castilho Zoe Fitzsimmons Claire Holton Aoife Mc Donagh 57 0 0 10 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models Justus Westerhoff Erblina Purellku Jakob Hackstein Jonas Loos Leo Pinetzki Lorenz Hufe AAML 141 0 0 07 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions Yubo Li Yidi Miao Xueying Ding Ramayya Krishnan R. Padman 139 0 0 28 Mar 2025
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models Alexey Karev Dong Xu 150 0 0 18 Mar 2025
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models Junjie Chen Xuyang Liu Subin Huang Linfeng Zhang Hang Yu 103 0 0 15 Mar 2025
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results Peter Fettke Constantin Houy ELM 90 0 0 14 Mar 2025
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation Eliya Habba Ofir Arviv Itay Itzhak Yotam Perlitz Elron Bandel Leshem Choshen Michal Shmueli-Scheuer Gabriel Stanovsky 131 5 0 03 Mar 2025
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction Sarah Ball Simeon Allmendinger Frauke Kreuter Niklas Kühl 106 0 0 22 Feb 2025
From Selection to Generation: A Survey of LLM-based Active Learning Yu Xia Subhojyoti Mukherjee Zhouhang Xie Junda Wu Xintong Li ... Namyong Park T. Nguyen Jiebo Luo Ryan Rossi Julian McAuley 119 1 0 17 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation Maria Eriksson Erasmo Purificato Arman Noroozian Joao Vinagre Guillaume Chaslot Emilia Gomez David Fernandez-Llorca ELM 292 6 0 10 Feb 2025
Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization Yuanye Liu Jiahang Xu Li Zhang Qi Chen Xuan Feng Yang Chen Zhongxin Guo Yuqing Yang Peng Cheng 187 2 0 06 Feb 2025
LCTG Bench: LLM Controlled Text Generation Benchmark Kemal Kurniawan Masato Mita Peinan Zhang S. Sasaki Ryosuke Ishigami Naoaki Okazaki 117 0 0 28 Jan 2025
Personalizing Education through an Adaptive LMS with Integrated LLMs Kyle Spriggs Meng Cheng Lau Kalpdrum Passi AI4Ed 150 1 0 24 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking Ariel Gera Odellia Boni Yotam Perlitz Roy Bar-Haim Lilach Eden Asaf Yehudai ALM ELM 169 5 0 12 Dec 2024
The Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems Sruthi Viswanathan Seray Ibrahim Ravi Shankar Reuben Binns Max Van Kleek Petr Slovák 128 1 0 02 Nov 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs Ran Zhang Wei Zhao Steffen Eger 142 10 0 24 Oct 2024
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data Wenkai Li Jiarui Liu Andy Liu Xuhui Zhou Mona Diab Maarten Sap 163 11 0 21 Oct 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination Eva Sánchez Salido Roser Morante Julio Gonzalo Guillermo Marco Jorge Carrillo-de-Albornoz ... Enrique Amigó Andrés Fernández Alejandro Benito-Santos Adrián Ghajari Espinosa Victor Fresno ELM 131 0 0 19 Sep 2024
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation Yihang Zheng Yue Liu Zhenghao Lin Yi Luo Xuanhe Zhou Chen Lin Jinsong Su Guoliang Li Shifu Li ELM 105 2 0 05 Sep 2024
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios Samuel Ackerman Ella Rabinovich E. Farchi Ateret Anaby-Tavor 65 1 0 04 Aug 2024
Paraphrase Types Elicit Prompt Engineering Capabilities Jan Philip Wahle Terry Ruas Yang Xu Bela Gipp 143 10 0 28 Jun 2024
SEAM: A Stochastic Benchmark for Multi-Document Tasks Gili Lior Avi Caciularu Arie Cattan Shahar Levy Ori Shapira Gabriel Stanovsky RALM 82 5 0 23 Jun 2024
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers Shuoqi Sun Shengyao Zhuang Shuai Wang Guido Zuccon 128 9 0 20 Jun 2024
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models Hwiyeol Jo Hyunwoo Lee Kang Min Yoo Taiwoo Park 42 0 0 19 Jun 2024
Efficient multi-prompt evaluation of LLMs Felipe Maia Polo Ronald Xu Lucas Weber Mírian Silva Onkar Bhardwaj Leshem Choshen Allysson Flavio Melo de Oliveira Yuekai Sun Mikhail Yurochkin 106 27 0 27 May 2024
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs Sylvain Kouemo Ngassom Arghavan Moradi Dakhel Florian Tambon Foutse Khomh 92 2 0 22 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks Melissa Ailem Katerina Marazopoulou Charlotte Siska James Bono 96 22 0 25 Apr 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards Norah A. Alzahrani H. A. Alyahya Sultan Yazeed Alnumay Muhtasim Tahmid Shaykhah Alsubaie ... Saleh Soltan Nathan Scales Marie-Anne Lachaux Samuel R. Bowman Haidar Khan ELM 135 80 0 01 Feb 2024
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements Anton Voronov Lena Wolf Max Ryabinin 78 52 0 12 Jan 2024