ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.09938
  4. Cited By
Measuring Coding Challenge Competence With APPS

Measuring Coding Challenge Competence With APPS

20 May 2021
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
Ethan Guo
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
    ELM
    AIMat
    ALM
ArXivPDFHTML

Papers citing "Measuring Coding Challenge Competence With APPS"

50 / 129 papers shown
Title
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Kai Xu
YiWei Mao
XinYi Guan
ZiLong Feng
38
0
0
12 May 2025
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Rushi Qiang
Yuchen Zhuang
Yinghao Li
D. Kilman
Rongzhi Zhang
...
Ian Shu-Hei Wong
Sherry Yang
Percy Liang
Chao Zhang
Bo Dai
ELM
39
0
0
12 May 2025
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu
Y. Yang
Houxing Ren
Haotian Hou
Han Xiao
Ke Wang
Weikang Shi
Aojun Zhou
Mingjie Zhan
H. Li
LLMAG
45
0
0
06 May 2025
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang
Z. Wang
Dongsheng Ma
Yongan Yu
Rui Ling
Z. Li
Feiyu Xiong
W. Zhang
LRM
60
0
0
30 Apr 2025
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories
Connor Dilgren
Purva Chiniya
Luke Griffith
Yu Ding
Yizheng Chen
42
0
0
29 Apr 2025
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges
Yunseo Lee
John Youngeun Song
Dongsun Kim
Jindae Kim
Mijung Kim
Jaechang Nam
HILM
LRM
37
0
0
29 Apr 2025
GenTorrent: Scaling Large Language Model Serving with An Overley Network
GenTorrent: Scaling Large Language Model Serving with An Overley Network
Fei Fang
Yifan Hua
Shengze Wang
Ruilin Zhou
Y. Liu
Chen Qian
X. Zhang
57
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
1
0
26 Apr 2025
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
72
0
0
25 Apr 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
S. Hwang
AI4CE
39
0
0
24 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
52
2
0
21 Apr 2025
Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs
Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs
Marina Sakharova
Abhinav Anand
Mira Mezini
59
0
0
21 Apr 2025
CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Man Ho Adrian Lam
Chaozheng Wang
Jen-tse Huang
M. Lyu
LRM
34
0
0
19 Apr 2025
Measuring AI Ability to Complete Long Tasks
Measuring AI Ability to Complete Long Tasks
Thomas Kwa
Ben West
Joel Becker
Amy Deng
Katharyn Garcia
...
Lucas Jun Koba Sato
H. Wijk
Daniel M. Ziegler
Elizabeth Barnes
Lawrence Chan
ELM
77
6
0
18 Mar 2025
Enhancing High-Quality Code Generation in Large Language Models with Comparative Prefix-Tuning
Enhancing High-Quality Code Generation in Large Language Models with Comparative Prefix-Tuning
Yuan Jiang
Yujian Zhang
Liang Lu
Christoph Treude
Xiaohong Su
Shan Huang
Tiantian Wang
ALM
63
0
0
12 Mar 2025
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Wei Li
Xin Zhang
Zhongxin Guo
Shaoguang Mao
Wen Luo
Guangyue Peng
Yangyu Huang
Houfeng Wang
Scarlett Li
57
0
0
09 Mar 2025
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions
Julian Aron Prenner
Romain Robbes
59
0
0
06 Mar 2025
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
Blaine Quackenbush
P. Atzberger
3DPC
AI4CE
65
2
0
06 Mar 2025
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Peiding Wang
L. Zhang
Fang Liu
Lin Shi
Minxiao Li
Bo Shen
An Fu
ELM
LRM
140
0
0
05 Mar 2025
How Diversely Can Language Models Solve Problems? Exploring the Algorithmic Diversity of Model-Generated Code
Seonghyeon Lee
Heejae Chon
Joonwon Jang
Dongha Lee
Hwanjo Yu
ALM
39
0
0
02 Mar 2025
Selective Prompt Anchoring for Code Generation
Selective Prompt Anchoring for Code Generation
Yuan Tian
Tianyi Zhang
88
3
0
24 Feb 2025
How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark
How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark
Ruizhong Qiu
Weiliang Will Zeng
Hanghang Tong
James Ezick
Christopher Lott
88
15
0
20 Feb 2025
LeDex: Training LLMs to Better Self-Debug and Explain Code
LeDex: Training LLMs to Better Self-Debug and Explain Code
Nan Jiang
Xiaopeng Li
Shiqi Wang
Qiang Zhou
Soneya Binta Hossain
Baishakhi Ray
Varun Kumar
Xiaofei Ma
Anoop Deoras
LRM
92
11
0
17 Feb 2025
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
C. Xie
Shuo Cai
Wenjun Wang
Pengxiang Li
Zhijie Sang
...
Xiaotian Han
Jianbo Yuan
Shengyu Zhang
Fei Wu
Hongxia Yang
LRM
51
1
0
17 Feb 2025
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
SWE-Lancer: Can Frontier LLMs Earn 1MillionfromReal−WorldFreelanceSoftwareEngineering?1 Million from Real-World Freelance Software Engineering?1MillionfromReal−WorldFreelanceSoftwareEngineering?
Samuel Miserendino
M. Wang
Tejal Patwardhan
Johannes Heidecke
43
17
0
17 Feb 2025
Preference Optimization for Reasoning with Pseudo Feedback
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao
Geyang Guo
Xingxing Zhang
Nancy F. Chen
Shafiq R. Joty
Furu Wei
LRM
99
9
0
17 Feb 2025
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Bohan Lyu
Siqiao Huang
Zichen Liang
Qi-An Sun
Jiaming Zhang
ELM
LRM
57
0
0
16 Feb 2025
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
Yijia Xiao
Runhui Wang
Luyang Kong
Davor Golac
Wei Wang
LLMAG
148
0
0
10 Feb 2025
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Md. Ashraful Islam
Mohammed Eunus Ali
Md. Rizwan Parvez
LLMAG
68
2
0
08 Feb 2025
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
Zishun Yu
Tengyu Xu
Di Jin
Karthik Abinav Sankararaman
Yun He
...
Eryk Helenowski
Chen Zhu
Sinong Wang
Hao Ma
Han Fang
LRM
54
4
0
29 Jan 2025
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Shahin Honarvar
Mark van der Wilk
Alastair Donaldson
78
6
0
28 Jan 2025
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
Yaojie Hu
Qiang Zhou
Qihong Chen
Xiaopeng Li
Linbo Liu
Dejiao Zhang
Amit Kachroo
Talha Oz
Omer Tripp
66
4
0
20 Jan 2025
Planning-Driven Programming: A Large Language Model Programming Workflow
Planning-Driven Programming: A Large Language Model Programming Workflow
Chao Lei
Yanchuan Chang
N. Lipovetzky
Krista A. Ehinger
83
2
0
10 Jan 2025
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown
Jordan Juravsky
Ryan Ehrlich
Ronald Clark
Quoc V. Le
Christopher Ré
Azalia Mirhoseini
ALM
LRM
78
216
0
03 Jan 2025
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Shanghaoran Quan
Jiaxi Yang
Bowen Yu
Bo Zheng
Dayiheng Liu
...
Zeyu Cui
Yang Fan
Y. Zhang
Binyuan Hui
Junyang Lin
ALM
ELM
LRM
72
15
0
02 Jan 2025
Are Large Language Models Memorizing Bug Benchmarks?
Are Large Language Models Memorizing Bug Benchmarks?
Daniel Ramos
Claudia Mamede
Kush Jain
Paulo Canelas
Catarina Gamboa
Claire Le Goues
PILM
ELM
94
6
0
20 Nov 2024
Human-In-the-Loop Software Development Agents
Human-In-the-Loop Software Development Agents
Wannita Takerngsaksiri
Jirat Pasuksmit
Patanamon Thongtanunam
C. Tantithamthavorn
Ruixiong Zhang
Fan Jiang
Jing Li
Evan Cook
K. Chen
Ming Wu
LLMAG
100
1
0
19 Nov 2024
GitChameleon: Unmasking the Version-Switching Capabilities of Code
  Generation Models
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
Nizar Islah
Justine Gehring
Diganta Misra
Eilif B. Muller
Irina Rish
Terry Yue Zhuo
Massimo Caccia
SyDa
38
1
0
05 Nov 2024
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay
Yuyang Chen
Kaiyan Zhao
Yiming Wang
Ming Yang
Jian Zhang
Xiaoguang Niu
33
1
0
16 Oct 2024
Mastering the Craft of Data Synthesis for CodeLLMs
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
...
Mark Johnson
K. K.
Don Dharmasiri
Long Duong
Yuan-Fang Li
SyDa
58
1
0
16 Oct 2024
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
Weibin Liao
Xu Chu
Yasha Wang
LRM
42
6
0
10 Oct 2024
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Yekun Chai
Haoran Sun
Huang Fang
Shuohuan Wang
Yu Sun
Hua-Hong Wu
144
1
0
03 Oct 2024
ChatGPT as a Solver and Grader of Programming Exams written in Spanish
ChatGPT as a Solver and Grader of Programming Exams written in Spanish
Pablo Fernández-Saborido
Marcos Fernández-Pichel
David E. Losada
ELM
42
0
0
23 Sep 2024
Language Models Learn to Mislead Humans via RLHF
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen
Ruiqi Zhong
Akbir Khan
Ethan Perez
Jacob Steinhardt
Minlie Huang
Samuel R. Bowman
He He
Shi Feng
29
34
0
19 Sep 2024
Evaluating the Performance of Large Language Models in Competitive
  Programming: A Multi-Year, Multi-Grade Analysis
Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
Adrian Marius Dumitran
Adrian Catalin Badea
Stefan-Gabriel Muscalu
ELM
LRM
30
1
0
31 Aug 2024
What can Large Language Models Capture about Code Functional Equivalence?
What can Large Language Models Capture about Code Functional Equivalence?
Nickil Maveli
Antonio Vergari
Shay B. Cohen
36
2
0
20 Aug 2024
Strong Copyright Protection for Language Models via Adaptive Model
  Fusion
Strong Copyright Protection for Language Models via Adaptive Model Fusion
Javier Abad
Konstantin Donhauser
Francesco Pinto
Fanny Yang
45
4
0
29 Jul 2024
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Somshubra Majumdar
Vahid Noroozi
Sean Narenthiran
Aleksander Ficek
Aleksander Ficek
Wasi Uddin Ahmad
Jocelyn Huang
Jagadeesh Balam
Boris Ginsburg
SyDa
54
2
0
29 Jul 2024
Effective Large Language Model Debugging with Best-first Tree Search
Effective Large Language Model Debugging with Best-first Tree Search
Jialin Song
Jonathan Raiman
Bryan Catanzaro
LRM
38
0
0
26 Jul 2024
Prompting Techniques for Secure Code Generation: A Systematic Investigation
Prompting Techniques for Secure Code Generation: A Systematic Investigation
Catherine Tony
Nicolás E. Díaz Ferreyra
Markus Mutas
Salem Dhiff
Riccardo Scandariato
SILM
73
9
0
09 Jul 2024
123
Next