Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.01869
Cited By
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
2 April 2024
Philipp Mondorf
Barbara Plank
ELM
LRM
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey"
21 / 21 papers shown
Title
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani
Sachit Menon
Ahmet Iscen
Shyamal Buch
Ramin Mehran
...
Yukun Zhu
Carl Vondrick
Mikhail Sirotenko
Cordelia Schmid
Tobias Weyand
58
0
0
01 May 2025
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Cheryl Li
Tianyuan Xu
Yiwen Guo
LRM
158
2
0
05 Feb 2025
Mathematical Language Models: A Survey
W. Liu
Hanglei Hu
Jie Zhou
Yuyang Ding
Junsong Li
...
Mengliang He
Qin Chen
Bo Jiang
Aimin Zhou
Liang He
LRM
79
12
0
03 Jan 2025
LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
Mufan Xu
K. Chen
Xuefeng Bai
Muyun Yang
T. Zhao
Min Zhang
80
1
0
17 Dec 2024
Uncovering Latent Chain of Thought Vectors in Language Models
Jason Zhang
Scott Viteri
LLMSV
LRM
36
1
0
21 Sep 2024
Cognitively Inspired Energy-Based World Models
Alexi Gladstone
Ganesh Nanduru
Md. Mofijul Islam
Aman Chadha
Jundong Li
Tariq Iqbal
31
0
0
13 Jun 2024
Benchmarking Benchmark Leakage in Large Language Models
Ruijie Xu
Zengzhi Wang
Run-Ze Fan
Pengfei Liu
56
42
0
29 Apr 2024
CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning
Peiyuan Liu
Hang Guo
Tao Dai
Naiqi Li
Jigang Bao
Xudong Ren
Yong Jiang
Shu-Tao Xia
AI4TS
62
16
0
12 Mar 2024
Premise Order Matters in Reasoning with Large Language Models
Xinyun Chen
Ryan A. Chi
Xuezhi Wang
Denny Zhou
ReLM
LRM
44
26
0
14 Feb 2024
A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models
Yuxuan Wan
Wenxuan Wang
Yiliu Yang
Youliang Yuan
Jen-tse Huang
Pinjia He
Wenxiang Jiao
Michael R. Lyu
ELM
LRM
94
12
0
01 Jan 2024
Position: AI Evaluation Should Learn from How We Test Humans
Yan Zhuang
Q. Liu
Yuting Ning
Wei Huang
Rui Lv
Zhenya Huang
Guanhao Zhao
Zheng-Wei Zhang
ELM
ALM
66
21
0
18 Jun 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
286
2,232
0
22 Mar 2023
Understanding Natural Language Understanding Systems. A Critical Analysis
Alessandro Lenci
ELM
34
12
0
01 Mar 2023
The Debate Over Understanding in AI's Large Language Models
Melanie Mitchell
D. Krakauer
ELM
74
202
0
14 Oct 2022
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Abulhair Saparov
He He
ELM
LRM
ReLM
121
275
0
03 Oct 2022
RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning
Soumya Sanyal
Zeyi Liao
Xiang Ren
ELM
ReLM
LRM
59
19
0
25 May 2022
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
307
4,077
0
24 May 2022
On the Paradox of Learning to Reason from Data
Honghua Zhang
Liunian Harold Li
Tao Meng
Kai-Wei Chang
Guy Van den Broeck
NAI
ReLM
OOD
LRM
134
103
0
23 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
311
11,915
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
352
8,457
0
28 Jan 2022
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
228
4,460
0
23 Jan 2020
1