ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.15074
  4. Cited By
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
  Large Language Models

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

24 May 2023
Daman Arora
H. Singh
Mausam
    ELM
    LRM
ArXivPDFHTML

Papers citing "Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models"

31 / 31 papers shown
Title
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Haoxiang Sun
Yingqian Min
Z. Chen
Wayne Xin Zhao
Zhengyang Liang
Zihan Wang
Lei Fang
Zhicheng Dou
ELM
LRM
59
2
0
27 Mar 2025
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Chuan Qin
Xiusi Chen
Chengrui Wang
Pengmin Wu
Xi Chen
...
Han Wu
Chong Li
Yuanchun Zhou
H. Xiong
Hengshu Zhu
ELM
60
1
0
12 Mar 2025
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
Daniel J.H. Chung
Zhiqi Gao
Yurii Kvasiuk
Tianyi Li
Moritz Münchmeyer
Maja Rudolph
Frederic Sala
Sai Chaitanya Tadepalli
AIMat
52
3
0
19 Feb 2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
X. Zhang
Yuxuan Dong
Yunsheng Wu
Jiaxing Huang
Chengyou Jia
Basura Fernando
Mike Zheng Shou
L. Zhang
Jun Liu
AIMat
ReLM
LRM
53
3
0
17 Feb 2025
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Yibo Yan
Shen Wang
Jiahao Huo
Jingheng Ye
Zhendong Chu
Xuming Hu
Philip S. Yu
Carla P. Gomes
B. Selman
Qingsong Wen
LRM
130
9
0
05 Feb 2025
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Xin Xu
Qiyun Xu
Tong Xiao
Tianhao Chen
Yuchen Yan
Jiaxin Zhang
Shizhe Diao
Can Yang
Yang Wang
ELM
LRM
AI4CE
111
3
0
01 Feb 2025
HARDMath: A Benchmark Dataset for Challenging Problems in Applied
  Mathematics
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Jingxuan Fan
Sarah Martinson
Erik Y. Wang
Kaylie Hausknecht
Jonah Brenner
Danxian Liu
Nianli Peng
Corey Wang
Michael P. Brenner
34
6
0
13 Oct 2024
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
  Language Models
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Bofei Gao
Feifan Song
Zhiyong Yang
Zefan Cai
Yibo Miao
...
Lei Sha
Yichang Zhang
Xuancheng Ren
Tianyu Liu
Baobao Chang
ELM
LRM
47
39
0
10 Oct 2024
Synergistic Simulations: Multi-Agent Problem Solving with Large Language
  Models
Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models
Asher Sprigler
Alexander Drobek
Keagan Weinstock
Wendpanga Tapsoba
Gavin Childress
Andy Dao
Lucas Gral
AI4CE
LLMAG
26
0
0
14 Sep 2024
A Perspective on Large Language Models, Intelligent Machines, and
  Knowledge Acquisition
A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition
V. Cherkassky
Eng Hock Lee
ELM
41
1
0
13 Aug 2024
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Zhen Huang
Zengzhi Wang
Shijie Xia
Xuefeng Li
Haoyang Zou
...
Yuxiang Zheng
Shaoting Zhang
Dahua Lin
Yu Qiao
Pengfei Liu
ELM
LRM
49
28
0
18 Jun 2024
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes
  in Mathematical Reasoning
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning
Joykirat Singh
A. Nambi
Vibhav Vineet
LRM
45
5
0
16 Jun 2024
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming
Victor-Alexandru Pădurean
Adish Singla
ELM
54
3
0
14 Jun 2024
PertEval: Unveiling Real Knowledge Capacity of LLMs with
  Knowledge-Invariant Perturbations
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li
Renjun Hu
Kunzhe Huang
Zhuang Yan
Qi Liu
Mengxiao Zhu
Xing Shi
Wei Lin
KELM
54
5
0
30 May 2024
MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT
  Prompting
MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting
Avinash Anand
Janak Kapuriya
Apoorv Singh
Jay Saraf
Naman Lal
Astha Verma
Rushali Gupta
R. Shah
LRM
38
12
0
11 Apr 2024
Are large language models superhuman chemists?
Are large language models superhuman chemists?
Adrian Mirza
Nawaf Alampara
Sreekanth Kunchapu
Benedict Emoekabu
Aswanth Krishnan
...
Leanne M. Stafast
Dinga Wonanke
Michael Pieler
P. Schwaller
Kevin Maik Jablonka
ELM
AI4MH
LRM
LM&MA
31
5
0
01 Apr 2024
Functional Benchmarks for Robust Evaluation of Reasoning Performance,
  and the Reasoning Gap
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Saurabh Srivastava
B. AnnaroseM
V. AntoP
Shashank Menon
Ajay Sukumar
T. AdwaithSamod
Alan Philipose
Stevin Prince
Sooraj Thomas
ELM
ReLM
LRM
39
46
0
29 Feb 2024
Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness
Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness
Samaneh Shafee
A. Bessani
Pedro M. Ferreira
31
19
0
26 Jan 2024
Large Language Models Empowered Agent-based Modeling and Simulation: A
  Survey and Perspectives
Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives
Chen Gao
Xiaochong Lan
Nian Li
Yuan Yuan
Jingtao Ding
Zhilun Zhou
Fengli Xu
Yong Li
LLMAG
AI4CE
LM&Ro
44
106
0
19 Dec 2023
ChatGPT-4 with Code Interpreter can be used to solve introductory
  college-level vector calculus and electromagnetism problems
ChatGPT-4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems
Tanuj Kumar
M. Kats
21
9
0
16 Sep 2023
A Fast Optimization View: Reformulating Single Layer Attention in LLM
  Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Yeqi Gao
Zhao Song
Weixin Wang
Junze Yin
22
25
0
14 Sep 2023
MaScQA: A Question Answering Dataset for Investigating Materials Science
  Knowledge of Large Language Models
MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models
Mohd Zaki
J. Jayadeva
Mausam
N. M. A. Krishnan
ELM
27
4
0
17 Aug 2023
ARB: Advanced Reasoning Benchmark for Large Language Models
ARB: Advanced Reasoning Benchmark for Large Language Models
Tomohiro Sawada
Daniel Paleka
Alexander Havrilla
Pranav Tadepalli
Paula Vidas
Alexander Kranias
John J. Nay
Kshitij Gupta
Aran Komatsuzaki
ELM
LRM
45
37
0
25 Jul 2023
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities
  of Large Language Models
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang
Ziniu Hu
Pan Lu
Yanqiao Zhu
Jieyu Zhang
Satyen Subramaniam
Arjun R. Loomba
Shichang Zhang
Yizhou Sun
Wei Wang
ELM
LRM
30
86
0
20 Jul 2023
Brain in a Vat: On Missing Pieces Towards Artificial General
  Intelligence in Large Language Models
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models
Yuxi Ma
Chi Zhang
Song-Chun Zhu
ELM
ALM
40
8
0
07 Jul 2023
A Survey on Evaluation of Large Language Models
A Survey on Evaluation of Large Language Models
Yu-Chu Chang
Xu Wang
Jindong Wang
Yuanyi Wu
Linyi Yang
...
Yue Zhang
Yi-Ju Chang
Philip S. Yu
Qian Yang
Xingxu Xie
ELM
LM&MA
ALM
75
1,517
0
06 Jul 2023
Neural Task Synthesis for Visual Programming
Neural Task Synthesis for Visual Programming
Victor-Alexandru Pădurean
Georgios Tzannetos
Adish Singla
33
17
0
26 May 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
339
2,232
0
22 Mar 2023
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
211
1,113
0
20 Sep 2022
Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
328
4,077
0
24 May 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
326
3,273
0
21 Mar 2022
1