Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.13692
Cited By
ARB: Advanced Reasoning Benchmark for Large Language Models
25 July 2023
Tomohiro Sawada
Daniel Paleka
Alexander Havrilla
Pranav Tadepalli
Paula Vidas
Alexander Kranias
John J. Nay
Kshitij Gupta
Aran Komatsuzaki
ELM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ARB: Advanced Reasoning Benchmark for Large Language Models"
16 / 16 papers shown
Title
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Zhaoxin Fan
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
86
2
0
26 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
61
1
0
01 Apr 2025
An Empirically-grounded tool for Automatic Prompt Linting and Repair: A Case Study on Bias, Vulnerability, and Optimization in Developer Prompts
Dhia Elhaq Rzig
Dhruba Jyoti Paul
Kaiser Pister
Jordan Henkel
Foyzul Hassan
80
0
0
21 Jan 2025
What is the best model? Application-driven Evaluation for Large Language Models
Shiguo Lian
Kaikai Zhao
Xinhui Liu
Xuejiao Lei
Bikun Yang
Wenjing Zhang
Kai Wang
Zhaoxiang Liu
ALM
ELM
40
2
0
14 Jun 2024
Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities
Wenyue Hua
Kaijie Zhu
Lingyao Li
Lizhou Fan
Shuhang Lin
Mingyu Jin
Haochen Xue
Zelong Li
Jindong Wang
Yongfeng Zhang
LRM
64
9
0
04 Jun 2024
Evaluating Mathematical Reasoning Beyond Accuracy
Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu
LRM
ReLM
52
21
0
08 Apr 2024
GLoRE: Evaluating Logical Reasoning of Large Language Models
Hanmeng Liu
Zhiyang Teng
Ruoxi Ning
Jian Liu
Qiji Zhou
Yuexin Zhang
Yue Zhang
ReLM
ELM
LRM
70
8
0
13 Oct 2023
LawBench: Benchmarking Legal Knowledge of Large Language Models
Zhiwei Fei
Xiaoyu Shen
D. Zhu
Fengzhe Zhou
Zhuo Han
Songyang Zhang
Kai-xiang Chen
Zongwen Shen
Jidong Ge
ELM
AILaw
34
34
0
28 Sep 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
226
572
0
03 May 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
298
2,232
0
22 Mar 2023
Revisiting Neural Scaling Laws in Language and Vision
Ibrahim M. Alabdulmohsin
Behnam Neyshabur
Xiaohua Zhai
159
102
0
13 Sep 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
314
3,248
0
21 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
389
8,495
0
28 Jan 2022
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
250
677
0
06 Jan 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
261
4,489
0
23 Jan 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,959
0
20 Apr 2018
1