ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.01210
  4. Cited By
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
  Large Language Models for Code Generation

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

2 May 2023
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
    ELM
    ALM
ArXivPDFHTML

Papers citing "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"

50 / 138 papers shown
Title
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Kaixuan Huang
Jiacheng Guo
Zihao Li
X. Ji
Jiawei Ge
...
Yangsibo Huang
Chi Jin
Xinyun Chen
Chiyuan Zhang
Mengdi Wang
AAML
LRM
100
7
0
10 Feb 2025
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Danrui Li
Sen Zhang
Sam S. Sohn
Kaidong Hu
Muhammad Usman
Mubbasir Kapadia
40
0
0
10 Feb 2025
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Md. Ashraful Islam
Mohammed Eunus Ali
Md. Rizwan Parvez
LLMAG
68
2
0
08 Feb 2025
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
99
2
0
06 Feb 2025
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Cheryl Li
Tianyuan Xu
Yiwen Guo
LRM
170
2
0
05 Feb 2025
Learning to Generate Unit Tests for Automated Debugging
Learning to Generate Unit Tests for Automated Debugging
Archiki Prasad
Elias Stengel-Eskin
Justin Chih-Yao Chen
Zaid Khan
Joey Tianyi Zhou
ELM
88
1
0
03 Feb 2025
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code
Shahin Honarvar
Mark van der Wilk
Alastair Donaldson
80
6
0
28 Jan 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Erik Cambria
LM&MA
AILaw
93
154
0
28 Jan 2025
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
Nishat Raihan
Antonios Anastasopoulos
Marcos Zampieri
ELM
43
6
0
28 Jan 2025
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang
Jiaheng Wen
Fangkai Yang
Pu Zhao
Yu Kang
...
Qingwei Lin
Yingnong Dang
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
56
2
0
28 Jan 2025
Treefix: Enabling Execution with a Tree of Prefixes
Treefix: Enabling Execution with a Tree of Prefixes
Beatriz Souza
Michael Pradel
42
1
0
21 Jan 2025
Towards Advancing Code Generation with Large Language Models: A Research Roadmap
Towards Advancing Code Generation with Large Language Models: A Research Roadmap
Haolin Jin
Huaming Chen
Qinghua Lu
Liming Zhu
LLMAG
45
1
0
20 Jan 2025
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
Yaojie Hu
Qiang Zhou
Qihong Chen
Xiaopeng Li
Linbo Liu
Dejiao Zhang
Amit Kachroo
Talha Oz
Omer Tripp
68
4
0
20 Jan 2025
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Zhaojian Yu
Yilun Zhao
Arman Cohan
Xiao-Ping Zhang
LRM
36
2
0
03 Jan 2025
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
Junqiao Wang
Zeng Zhang
Yangfan He
Yuyang Song
Tianyu Shi
...
Hengyuan Xu
Kunyu Wu
Guangwu Qian
Qiuwu Chen
Lewei He
38
11
0
03 Jan 2025
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
136
5
0
04 Dec 2024
The importance of visual modelling languages in generative software engineering
The importance of visual modelling languages in generative software engineering
Roberto Rossi
79
1
0
27 Nov 2024
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Siming Huang
Tianhao Cheng
J.K. Liu
Jiaran Hao
L. Song
...
Ge Zhang
Zili Wang
Yuan Qi
Yinghui Xu
Wei Chu
ALM
80
17
0
07 Nov 2024
GitChameleon: Unmasking the Version-Switching Capabilities of Code
  Generation Models
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
Nizar Islah
Justine Gehring
Diganta Misra
Eilif B. Muller
Irina Rish
Terry Yue Zhuo
Massimo Caccia
SyDa
40
1
0
05 Nov 2024
MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification
MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification
Yin Li
Liangwei Wang
Shiyuan Piao
Boo-Ho Yang
Ziyue Li
Wei Zeng
Fugee Tsung
33
0
0
19 Oct 2024
Agent Skill Acquisition for Large Language Models via CycleQD
Agent Skill Acquisition for Large Language Models via CycleQD
So Kuroki
Taishi Nakamura
Takuya Akiba
Yujin Tang
MoMe
34
0
0
16 Oct 2024
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
Syeda Nahida Akter
Shrimai Prabhumoye
John Kamalu
S. Satheesh
Eric Nyberg
M. Patwary
M. Shoeybi
Bryan Catanzaro
LRM
SyDa
ReLM
100
1
0
15 Oct 2024
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
Furu Wei
Furu Wei
51
2
0
14 Oct 2024
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Rushang Karia
Daniel Bramblett
D. Dobhal
Siddharth Srivastava
ELM
LRM
32
0
0
11 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Z. Z. Ren
60
6
0
10 Oct 2024
CursorCore: Assist Programming through Aligning Anything
CursorCore: Assist Programming through Aligning Anything
Hao Jiang
Qi Liu
Rui Li
Shengyu Ye
Shijin Wang
53
1
0
09 Oct 2024
FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering
Siqiao Xue
Tingting Chen
Fan Zhou
Qingyang Dai
Zhixuan Chu
Hongyuan Mei
38
4
0
06 Oct 2024
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Ulyana Piterbarg
Lerrel Pinto
Rob Fergus
SyDa
37
2
0
03 Oct 2024
Automated test generation to evaluate tool-augmented LLMs as
  conversational AI agents
Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
Samuel Arcadinho
David Aparicio
Mariana Almeida
31
5
0
24 Sep 2024
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
Mingjie Liu
Yun-Da Tsai
Wenfei Zhou
Haoxing Ren
SyDa
3DV
45
6
0
19 Sep 2024
VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation
VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation
Zhijie Wang
Zhehua Zhou
Jiayang Song
Yuheng Huang
Zhan Shu
Lei Ma
LM&Ro
71
5
0
19 Sep 2024
AutoVerus: Automated Proof Generation for Rust Code
AutoVerus: Automated Proof Generation for Rust Code
Chenyuan Yang
Xuheng Li
Md Rakib Hossain Misu
Jianan Yao
Weidong Cui
...
Jacob R. Lorch
Shuai Lu
Fan Yang
Ziqiao Zhou
Shan Lu
27
7
0
19 Sep 2024
What can Large Language Models Capture about Code Functional Equivalence?
What can Large Language Models Capture about Code Functional Equivalence?
Nickil Maveli
Antonio Vergari
Shay B. Cohen
44
2
0
20 Aug 2024
CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
Weijie Lv
Xuan Xia
Sheng-Jun Huang
ALM
36
2
0
05 Aug 2024
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Somshubra Majumdar
Vahid Noroozi
Sean Narenthiran
Aleksander Ficek
Aleksander Ficek
Wasi Uddin Ahmad
Jocelyn Huang
Jagadeesh Balam
Boris Ginsburg
SyDa
58
2
0
29 Jul 2024
Effective Large Language Model Debugging with Best-first Tree Search
Effective Large Language Model Debugging with Best-first Tree Search
Jialin Song
Jonathan Raiman
Bryan Catanzaro
LRM
48
0
0
26 Jul 2024
CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization
CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization
Yang Zhao
Di Huang
Chongxiao Li
Pengwei Jin
Muxin Song
...
Rui Zhang
Xingui Hu
Yunji Chen
Qi Guo
Xing Hu
71
22
0
15 Jul 2024
Defending Code Language Models against Backdoor Attacks with Deceptive Cross-Entropy Loss
Defending Code Language Models against Backdoor Attacks with Deceptive Cross-Entropy Loss
Guang Yang
Yu Zhou
Xiang Chen
Xiangyu Zhang
Terry Yue Zhuo
David Lo
Taolue Chen
AAML
52
4
0
12 Jul 2024
Prompting Techniques for Secure Code Generation: A Systematic Investigation
Prompting Techniques for Secure Code Generation: A Systematic Investigation
Catherine Tony
Nicolás E. Díaz Ferreyra
Markus Mutas
Salem Dhiff
Riccardo Scandariato
SILM
76
9
0
09 Jul 2024
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates
Zeyu Leo Liu
Shrey Pandit
Xi Ye
Eunsol Choi
Greg Durrett
KELM
ALM
78
4
0
08 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
29
28
0
04 Jul 2024
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia
Yinlin Deng
Soren Dunn
Lingming Zhang
LLMAG
43
85
0
01 Jul 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
77
134
0
22 Jun 2024
Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks
Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks
Hokyung Lee
Sumanyu Sharma
Bing Hu
39
2
0
21 Jun 2024
Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative
  Models
Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models
Sanjay Vishwakarma
Francis Harkins
Siddharth Golecha
Vishal Sharathchandra Bajpe
Nicolas Dupuis
Luca Buratti
David Kremer
Ismael Faro
Ruchir Puri
Juan Cruz-Benito
ELM
50
3
0
20 Jun 2024
CodeRAG-Bench: Can Retrieval Augment Code Generation?
CodeRAG-Bench: Can Retrieval Augment Code Generation?
Zora Zhiruo Wang
Akari Asai
Xinyan Velocity Yu
Frank F. Xu
Yiqing Xie
Graham Neubig
Daniel Fried
RALM
74
30
0
20 Jun 2024
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
Federico Errica
G. Siracusano
D. Sanvito
Roberto Bifulco
80
19
0
18 Jun 2024
Learn Beyond The Answer: Training Language Models with Reflection for
  Mathematical Reasoning
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
Zhihan Zhang
Zhenwen Liang
Wenhao Yu
Dian Yu
Mengzhao Jia
Dong Yu
Meng Jiang
AIMat
RALM
LRM
ReLM
30
12
0
17 Jun 2024
Evaluating the Performance of Large Language Models via Debates
Evaluating the Performance of Large Language Models via Debates
Behrad Moniri
Hamed Hassani
Yan Sun
ELM
ALM
58
5
0
16 Jun 2024
Is Programming by Example solved by LLMs?
Is Programming by Example solved by LLMs?
Wen-Ding Li
Kevin Ellis
37
10
0
12 Jun 2024
Previous
123
Next