Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

v1v2 (latest)

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

19 February 2025

Nishant Balepur

Rachel Rudinger

Jordan Lee Boyd-Graber

ArXiv (abs)PDF HTML

Papers citing "Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above"

5 / 5 papers shown

Title
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models Abdullah Mushtaq Imran Taj Rafay Naeem Ibrahim Ghaznavi Junaid Qadir 59 0 0 14 May 2025
BLAB: Brutally Long Audio Bench Orevaoghene Ahia Martijn Bartelds Kabir Ahuja Hila Gonen Valentin Hofmann ... Noah Bennett Shinji Watanabe Noah A. Smith Yulia Tsvetkov Sachin Kumar AuLLM LM&MA VLM 112 0 0 05 May 2025
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning Xindi Wu Hee Seung Hwang Polina Kirichenko Olga Russakovsky VLM CoGe 125 1 0 30 Apr 2025
Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation Hannah Murray Brian Hyeongseok Kim Isabelle Lee Jason Byun Dani Yogatama Evi Micha 82 1 0 29 Mar 2025
Language Models Fail to Introspect About Their Knowledge of Language Siyuan Song Jennifer Hu Kyle Mahowald LRM KELM HILM ELM 115 4 0 10 Mar 2025