Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.01210
Cited By
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
2 May 2023
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
ELM
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
50 / 132 papers shown
Title
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Kai Xu
YiWei Mao
XinYi Guan
ZiLong Feng
38
0
0
12 May 2025
xGen-small Technical Report
Erik Nijkamp
Bo Pang
Egor Pakhomov
Akash Gokul
Jin Qu
Silvio Savarese
Yingbo Zhou
Caiming Xiong
LLMAG
53
0
0
10 May 2025
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
Jae-Won Chung
Jiachen Liu
Jeff J. Ma
Ruofan Wu
Oh Jun Kweon
Yuxuan Xia
Zhiyu Wu
Mosharaf Chowdhury
28
0
0
09 May 2025
Scalable Chain of Thoughts via Elastic Reasoning
Yuhui Xu
Hanze Dong
Lei Wang
Doyen Sahoo
Junnan Li
Caiming Xiong
OffRL
LRM
51
1
0
08 May 2025
Directed Greybox Fuzzing via Large Language Model
HanXiang Xu
Yanjie Zhao
Haoyu Wang
48
0
0
06 May 2025
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Kazuki Fujii
Yukito Tajima
Sakae Mizuki
Hinari Shimada
Taihei Shiotani
...
Kakeru Hattori
Youmi Ma
Hiroya Takamura
Rio Yokota
Naoaki Okazaki
SyDa
49
0
0
05 May 2025
AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks
Ilyas Oulkadda
Julien Perez
ALM
42
0
0
05 May 2025
Phi-4-reasoning Technical Report
Marah Abdin
Sahaj Agarwal
Ahmed Hassan Awadallah
Vidhisha Balachandran
Harkirat Singh Behl
...
Vaishnavi Shrivastava
Vibhav Vineet
Yue Wu
Safoora Yousefi
Guoqing Zheng
ReLM
LRM
84
0
0
30 Apr 2025
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges
Yunseo Lee
John Youngeun Song
Dongsun Kim
Jindae Kim
Mijung Kim
Jaechang Nam
HILM
LRM
37
0
0
29 Apr 2025
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Paiheng Xu
Gang Wu
Xiang Chen
Tong Yu
Chang Xiao
Franck Dernoncourt
Tianyi Zhou
Wei Ai
Viswanathan Swaminathan
OffRL
52
0
0
29 Apr 2025
SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories
Connor Dilgren
Purva Chiniya
Luke Griffith
Yu Ding
Yizheng Chen
40
0
0
29 Apr 2025
AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers
Zijie Lin
Yiqing Shen
Qilin Cai
He Sun
Jinrui Zhou
Mingjun Xiao
57
0
0
28 Apr 2025
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling
Ishan Kavathekar
Raghav Donakanti
Ponnurangam Kumaraguru
Karthik Vaidhyanathan
54
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
1
0
26 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq R. Joty
ELM
ALM
LRM
50
2
0
21 Apr 2025
CODECRASH: Stress Testing LLM Reasoning under Structural and Semantic Perturbations
Man Ho Adrian Lam
Chaozheng Wang
Jen-tse Huang
M. Lyu
LRM
34
0
0
19 Apr 2025
LLM Sensitivity Evaluation Framework for Clinical Diagnosis
Chenwei Yan
Xiangling Fu
Yuxuan Xiong
Tianyi Wang
Siu Cheung Hui
Ji Wu
Xien Liu
LM&MA
ELM
35
0
0
18 Apr 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue
Zhiqi Chen
Rui Lu
Andrew Zhao
Zhaokai Wang
Yang Yue
Shiji Song
Gao Huang
ReLM
LRM
46
11
0
18 Apr 2025
Generating Planning Feedback for Open-Ended Programming Exercises with LLMs
Mehmet Arif Demirtaş
Claire Zheng
Max Fowler
Kathryn Cunningham
LRM
31
1
0
11 Apr 2025
How Accurately Do Large Language Models Understand Code?
Sabaat Haroon
Ahmad Faraz Khan
Ahmad Humayun
Waris Gill
Abdul Haddi Amjad
A. R. Butt
Mohammad Taha Khan
Muhammad Ali Gulzar
ELM
LRM
30
1
0
06 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Z. Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
95
3
0
01 Apr 2025
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
Yingwei Ma
Binhua Li
Yihong Dong
Xue Jiang
Rongyu Cao
J. Chen
Fei Huang
Y. Li
LLMAG
LRM
57
0
0
31 Mar 2025
Effective Skill Unlearning through Intervention and Abstention
Yongce Li
Chung-En Sun
Tsui-Wei Weng
MU
149
0
0
27 Mar 2025
SandboxEval: Towards Securing Test Environment for Untrusted Code
Rafiqul Rabin
Jesse Hostetler
Sean McGregor
Brett Weir
Nick Judd
ELM
39
0
0
27 Mar 2025
Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective
L. Collini
Andrew Hennessee
Ramesh Karri
Siddharth Garg
ELM
LRM
41
0
0
17 Mar 2025
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution
Zhi Chen
Wei Ma
Lingxiao Jiang
LLMAG
53
0
0
16 Mar 2025
Unified Modeling Language Code Generation from Diagram Images Using Multimodal Large Language Models
Averi Bates
Ryan Vavricka
Shane Carleton
Ruosi Shao
Chongle Pan
59
0
0
15 Mar 2025
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Cheng Deng
Luoyang Sun
Jiwen Jiang
Yongcheng Zeng
Xinjian Wu
...
Haoyang Li
Lei Chen
Lionel M. Ni
H. Zhang
Jun Wang
157
0
0
15 Mar 2025
ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness
Ce Guo
Tong Zhao
61
1
0
11 Mar 2025
Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models
Anastasiia Grishina
Vadim Liventsev
Aki Härmä
Leon Moonen
ELM
79
0
0
10 Mar 2025
From Idea to Implementation: Evaluating the Influence of Large Language Models in Software Development -- An Opinion Paper
Sargam Yadav
Asifa Mehmood Qureshi
Abhishek Kaushik
Shubham Sharma
Roisin Loughran
...
. Nikhil Singh
Padraic O'Hara
Pranay Jaiswal
Roshan Chandru
David Lillis
56
1
0
10 Mar 2025
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Wei Li
Xin Zhang
Zhongxin Guo
Shaoguang Mao
Wen Luo
Guangyue Peng
Yangyu Huang
Houfeng Wang
Scarlett Li
57
0
0
09 Mar 2025
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions
Julian Aron Prenner
Romain Robbes
59
0
0
06 Mar 2025
Robust Learning of Diverse Code Edits
Tushar Aggarwal
Swayam Singh
Abhijeet Awasthi
Aditya Kanade
Nagarajan Natarajan
SyDa
151
0
0
05 Mar 2025
IterPref: Focal Preference Learning for Code Generation via Iterative Debugging
Jie Wu
Haoling Li
Xin Zhang
Jianwen Luo
Yangyu Huang
Ruihang Chu
Y. Yang
Scarlett Li
73
0
0
04 Mar 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin
Atabak Ashfaq
Adam Atkinson
Hany Awadalla
Nguyen Bach
...
Ishmam Zabir
Yunan Zhang
Li Zhang
Y. Zhang
Xiren Zhou
MoE
SyDa
68
23
0
03 Mar 2025
How Diversely Can Language Models Solve Problems? Exploring the Algorithmic Diversity of Model-Generated Code
Seonghyeon Lee
Heejae Chon
Joonwon Jang
Dongha Lee
Hwanjo Yu
ALM
39
0
0
02 Mar 2025
Kanana: Compute-efficient Bilingual Language Models
Kanana LLM Team
Yunju Bak
Hojin Lee
Minho Ryu
Jiyeon Ham
...
Daniel Lee
Minchul Lee
M. Lee
Shinbok Lee
Gaeun Seo
90
1
0
26 Feb 2025
Selective Prompt Anchoring for Code Generation
Yuan Tian
Tianyi Zhang
86
3
0
24 Feb 2025
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference
Thanh Le-Cong
Bach Le
Toby Murray
LRM
47
1
0
22 Feb 2025
How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark
Ruizhong Qiu
Weiliang Will Zeng
Hanghang Tong
James Ezick
Christopher Lott
88
15
0
20 Feb 2025
Pragmatic Reasoning improves LLM Code Generation
Zhuchen Cao
Sven Apel
Adish Singla
Vera Demberg
LRM
37
0
0
20 Feb 2025
Flaming-hot Initiation with Regular Execution Sampling for Large Language Models
Weizhe Chen
Zhicheng Zhang
Guanlin Liu
Renjie Zheng
Wenlei Shi
Chen Dun
Zheng Wu
Xing Jin
Lin Yan
ALM
LRM
51
1
0
17 Feb 2025
LeDex: Training LLMs to Better Self-Debug and Explain Code
Nan Jiang
Xiaopeng Li
Shiqi Wang
Qiang Zhou
Soneya Binta Hossain
Baishakhi Ray
Varun Kumar
Xiaofei Ma
Anoop Deoras
LRM
92
11
0
17 Feb 2025
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Bohan Lyu
Siqiao Huang
Zichen Liang
Qi-An Sun
Jiaming Zhang
ELM
LRM
53
0
0
16 Feb 2025
Typhoon T1: An Open Thai Reasoning Model
Pittawat Taveekitworachai
Potsawee Manakul
Kasima Tharnpipitchai
Kunat Pipatanakul
OffRL
LRM
99
0
0
13 Feb 2025
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Xu Huang
Wenhao Zhu
Hanxu Hu
Conghui He
Lei Li
Shujian Huang
Fei Yuan
ELM
51
3
0
11 Feb 2025
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
Yijia Xiao
Runhui Wang
Luyang Kong
Davor Golac
Wei Wang
LLMAG
141
0
0
10 Feb 2025
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Kaixuan Huang
Jiacheng Guo
Zihao Li
X. Ji
Jiawei Ge
...
Yangsibo Huang
Chi Jin
Xinyun Chen
Chiyuan Zhang
Mengdi Wang
AAML
LRM
95
7
0
10 Feb 2025
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Danrui Li
Sen Zhang
Sam S. Sohn
Kaidong Hu
Muhammad Usman
Mubbasir Kapadia
35
0
0
10 Feb 2025
1
2
3
Next