Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.12022
Cited By
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
20 November 2023
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"GPQA: A Graduate-Level Google-Proof Q&A Benchmark"
50 / 289 papers shown
Title
Evaluating Gemini in an arena for learning
LearnLM Team Google
Abhinit Modi
Aditya Srikanth Veerubhotla
Aliya Rysbek
Andrea Huber
...
Theofilos Strinopoulos
Wei-Jen Ko
Yael Gold-Zamir
Yael Haramaty
Yannis Assael
ELM
40
0
0
30 May 2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Zafir Stojanovski
Oliver Stanley
Joe Sharratt
Richard Jones
Abdulhakeem Adefioye
Jean Kaddour
Andreas Kopf
OffRL
LRM
66
1
0
30 May 2025
DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar
Ali Ramezani
Mohammad Mahdi Mohajeri
M. Dousti
Majid Nili Ahmadabadi
Heshaam Faili
LRM
36
0
0
30 May 2025
Semi-structured LLM Reasoners Can Be Rigorously Audited
Jixuan Leng
Cassandra A. Cohen
Zhixian Zhang
Chenyan Xiong
William W. Cohen
LRM
37
0
0
30 May 2025
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Li yunhan
Wu gengshen
AILaw
ELM
ALM
30
0
0
30 May 2025
ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases
Y. Li
Xiaojun Zeng
Chihua Fang
Jian Yang
Fucang Jia
L. Zhang
LM&MA
ELM
AI4MH
51
0
0
30 May 2025
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu
Jiaxuan Gao
Xujie Shen
Chen Zhu
Zhiyu Mei
...
Jun Mei
Jiashu Wang
Tongkai Yang
Binhang Yuan
Yi Wu
OffRL
SyDa
LRM
72
0
0
30 May 2025
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Y. Liu
Kun Ouyang
Haoning Wu
Yi Liu
Lin Sui
Xinhao Li
Y. Zhong
Y. Charles
Xinyu Zhou
Xu Sun
VLM
LRM
96
0
0
29 May 2025
Evaluating the Sensitivity of LLMs to Prior Context
Robert Hankache
Kingsley Nketia Acheampong
Liang Song
Marek Brynda
Raad Khraishi
Greig A. Cowan
30
0
0
29 May 2025
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Ziyin Zhang
Jiahao Xu
Zhiwei He
Tian Liang
Qiuzhi Liu
...
Zhuosheng Zhang
Rui Wang
Zhaopeng Tu
Haitao Mi
Dong Yu
OffRL
LRM
75
1
0
29 May 2025
The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets
Shenzhe Zhu
Jiao Sun
Yi Nian
Tobin South
Alex Pentland
Jiaxin Pei
49
0
0
29 May 2025
Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data
Christopher Lee Lübbers
22
0
0
28 May 2025
THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
Zhiyuan Li
Yi-Ju Chang
Yuan Wu
LLMAG
LRM
76
0
0
28 May 2025
Maximizing Confidence Alone Improves Reasoning
Mihir Prabhudesai
Lili Chen
Alex Ippoliti
Katerina Fragkiadaki
Hao Liu
Deepak Pathak
OOD
OffRL
ReLM
LRM
130
3
0
28 May 2025
AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models
Feng Luo
Yu-Neng Chuang
Guanchu Wang
Hoang Anh Duy Le
Shaochen Zhong
...
Jiayi Yuan
Yang Sui
Vladimir Braverman
Vipin Chaudhary
Helen Zhou
LRM
75
1
0
28 May 2025
When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
Jirui Qi
Shan Chen
Zidi Xiong
Raquel Fernández
Danielle S. Bitterman
Arianna Bisazza
LRM
97
0
0
28 May 2025
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Hanting Chen
Yasheng Wang
Kai Han
Dong Li
Lin Li
...
Hailin Hu
Yehui Tang
Dacheng Tao
Xinghao Chen
Yunhe Wang
LRM
98
0
0
28 May 2025
ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
M. Shalyt
Rotem Elimelech
I. Kaminer
35
0
0
28 May 2025
Advancing Expert Specialization for Better MoE
Hongcan Guo
Haolang Lu
Guoshun Nan
Bolun Chu
Jialin Zhuang
Yuan Yang
Wenhao Che
Sicong Leng
Qimei Cui
Xudong Jiang
MoE
MoMe
101
0
0
28 May 2025
Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Paramita Mirza
Lucas Weber
Fabian Küch
51
0
0
28 May 2025
Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems
Y. Cho
Sharath Chandra Guntuku
Lyle Ungar
27
0
0
27 May 2025
Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning
Yang He
Xiao Ding
Bibo Cai
Yufei Zhang
Kai Xiong
Zhouhao Sun
Bing Qin
Ting Liu
LRM
54
0
0
27 May 2025
Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models
Injae Na
Keonwoong Noh
Woohwan Jung
72
0
0
27 May 2025
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
Fengqing Jiang
Fengbo Ma
Zhangchen Xu
Yuetai Li
Bhaskar Ramasubramanian
Luyao Niu
Bo Li
Xianyan Chen
Zhen Xiang
Radha Poovendran
ALM
ELM
76
1
0
27 May 2025
Interleaved Reasoning for Large Language Models via Reinforcement Learning
Roy Xie
David Qiu
Deepak Gopinath
Dong Lin
Yanchao Sun
Chong-Jun Wang
Saloni Potdar
Bhuwan Dhingra
KELM
LRM
75
0
0
26 May 2025
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
Jiangjie Chen
Qianyu He
Siyu Yuan
Aili Chen
Zhicheng Cai
...
Qiying Yu
Xuefeng Li
Jiaze Chen
Hao Zhou
Mingxuan Wang
ReLM
LRM
99
2
0
26 May 2025
Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
Yifan Wu
Jingze Shi
Bingheng Wu
Jiayi Zhang
Xiaotian Lin
Nan Tang
Yuyu Luo
LRM
100
1
0
26 May 2025
Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning
Ming Yin
Yuanhao Qu
Dyllan Liu
Ling Yang
Le Cong
39
0
0
26 May 2025
Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
Zihao Zeng
Xuyao Huang
Boxiu Li
Hao Zhang
Zhijie Deng
ReLM
LRM
33
0
0
26 May 2025
The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
Yiqun Zhang
Hao Li
Chenxu Wang
L. Chen
Qiaosheng Zhang
...
Xinrun Wang
Jia Xu
Lei Bai
Wanli Ouyang
Shuyue Hu
79
0
0
26 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLM
LRM
85
1
0
26 May 2025
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Junteng Liu
Yuanxiang Fan
Z. L. Jiang
Han Ding
Yongyi Hu
...
Yunan Huang
Mozhi Zhang
Pengyu Zhao
Junjie Yan
Junxian He
OffRL
NAI
SyDa
LRM
ELM
47
4
0
26 May 2025
Token-Importance Guided Direct Preference Optimization
Yang Ning
Lin Hai
Liu Yibo
Tian Baoliang
Liu Guoqing
Zhang Haijun
71
0
0
26 May 2025
Faster and Better LLMs via Latency-Aware Test-Time Scaling
Zili Wang
Tianyu Zhang
Haoli Bai
Lu Hou
Xianzhi Yu
Wulong Liu
Shiming Xiang
Lei Zhu
LRM
91
0
0
26 May 2025
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
Wenhao Liu
Zhengkang Guo
Mingchen Xie
Jingwen Xu
Zisu Huang
...
Changze Lv
He-Da Wang
Hu Yao
Xiaoqing Zheng
Xuanjing Huang
181
0
0
25 May 2025
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Kun Xiang
Heng Li
Terry Jingchen Zhang
Yinya Huang
Zirong Liu
...
J. N. Han
Hang Xu
Hanhui Li
Mrinmaya Sachan
Xiaodan Liang
LRM
186
0
0
25 May 2025
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu
Rongzhen Wang
Shen Nie
Xiaolu Zhang
Chunwei Wu
...
Jun Zhou
Jianfei Chen
Yankai Lin
Ji-Rong Wen
Chongxuan Li
195
2
0
25 May 2025
ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Benjamin Clavié
Florian Brand
VLM
CoGe
64
0
0
25 May 2025
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Jiwan Chung
Junhyeok Kim
Siyeol Kim
Jaeyoung Lee
Min Soo Kim
Youngjae Yu
LRM
95
0
0
24 May 2025
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Hao Chen
Haoze Li
Zhiqing Xiao
Lirong Gao
Qi Zhang
Xiaomeng Hu
Ningtao Wang
Xing Fu
Junbo Zhao
206
0
0
24 May 2025
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
Takashi Ishida
Thanawat Lodkaew
Ikko Yamane
221
0
0
23 May 2025
First Finish Search: Efficient Test-Time Scaling in Large Language Models
Aradhye Agarwal
Ayan Sengupta
Tanmoy Chakraborty
ReLM
RALM
ALM
LRM
111
0
0
23 May 2025
ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework
Lisheng Huang
Yichen Liu
Jinhao Jiang
Rongxiang Zhang
Jiahao Yan
Junyi Li
Wayne Xin Zhao
LLMAG
63
0
0
23 May 2025
Thought calibration: Efficient and confident test-time scaling
Menghua Wu
Cai Zhou
Stephen Bates
Tommi Jaakkola
LRM
83
0
0
23 May 2025
PD
3
^3
3
: A Project Duplication Detection Framework via Adapted Multi-Agent Debate
Dezheng Bao
Yueci Yang
Xin Chen
Zhengxuan Jiang
Zeguo Fei
...
Xuanwen Huang
Junru Chen
Chutian Yu
Xiang Yuan
Yang Yang
205
0
0
23 May 2025
Stable Reinforcement Learning for Efficient Reasoning
Muzhi Dai
Shixuan Liu
Qingyi Si
OffRL
LRM
117
0
0
23 May 2025
Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL
Che Liu
Haozhe Wang
J. Pan
Zhongwei Wan
Yong Dai
Fangzhen Lin
Wenjia Bai
Daniel Rueckert
Rossella Arcucci
OffRL
LRM
ELM
118
1
0
23 May 2025
ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
Razvan-Gabriel Dumitru
Darius Peteleaza
Vikas Yadav
Liangming Pan
ReLM
LRM
115
1
0
22 May 2025
Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning
Cehao Yang
Xueyuan Lin
Chengjin Xu
Xuhui Jiang
Xiaojun Wu
Honghao Liu
Hui Xiong
Jian Guo
LRM
104
0
0
22 May 2025
MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning
Zihan Chen
Song Wang
Zhen Tan
Jundong Li
Cong Shen
OffRL
238
1
0
22 May 2025
Previous
1
2
3
4
5
6
Next