ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.10481
  4. Cited By
Execution-Based Evaluation for Open-Domain Code Generation

Execution-Based Evaluation for Open-Domain Code Generation

20 December 2022
Zhiruo Wang
Shuyan Zhou
Daniel Fried
Graham Neubig
    ELM
ArXivPDFHTML

Papers citing "Execution-Based Evaluation for Open-Domain Code Generation"

50 / 59 papers shown
Title
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan
Zhirong Huang
Wei Liu
Hanwu Chen
L. Zhang
...
Jing Su
Tianyu Liu
Rui Long
Kai Shen
Liang Xiang
43
2
0
03 Apr 2025
LLMs Love Python: A Study of LLMs' Bias for Programming Languages and Libraries
LLMs Love Python: A Study of LLMs' Bias for Programming Languages and Libraries
Lukas Twist
Jie M. Zhang
Mark Harman
Don Syme
Joost Noppen
Detlef Nauck
50
0
0
21 Mar 2025
Survey on Evaluation of LLM-based Agents
Survey on Evaluation of LLM-based Agents
Asaf Yehudai
Lilach Eden
Alan Li
Guy Uziel
Yilun Zhao
Roy Bar-Haim
Arman Cohan
Michal Shmueli-Scheuer
LLMAG
ELM
Presented at ResearchTrend Connect | LLMAG on 07 May 2025
95
7
0
20 Mar 2025
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Daniil Orel
Dilshod Azizov
Preslav Nakov
DeLMO
50
0
0
17 Mar 2025
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
Transferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators
Blaine Quackenbush
P. Atzberger
3DPC
AI4CE
65
0
0
06 Mar 2025
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Jialun Cao
Yuk-Kit Chan
Zixuan Ling
Wenxuan Wang
Shuqing Li
...
Pinjia He
Shuai Wang
Zibin Zheng
Michael R. Lyu
Shing-Chi Cheung
ALM
71
1
0
18 Jan 2025
Multi-Programming Language Sandbox for LLMs
Multi-Programming Language Sandbox for LLMs
Shihan Dou
Jiazheng Zhang
Jianxiang Zang
Yunbo Tao
W. Zhou
...
Yixin Cao
Tao Gui
Xipeng Qiu
Qi Zhang
Xuanjing Huang
56
1
0
30 Oct 2024
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of
  Large Multimodal Models Through Coding Tasks
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
Fengji Zhang
Linquan Wu
Huiyu Bai
Guancheng Lin
Xiao Li
Xiao Yu
Yue Wang
Bei Chen
Jacky Keung
MLLM
ELM
LRM
32
0
0
16 Oct 2024
An evaluation of LLM code generation capabilities through graded
  exercises
An evaluation of LLM code generation capabilities through graded exercises
Álvaro Barbero Jiménez
ELM
31
1
0
06 Oct 2024
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software
  Domains?
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
John Yang
Carlos E. Jimenez
Alex Zhang
K. Lieret
Joyce Yang
...
Gabriel Synnaeve
Karthik Narasimhan
Diyi Yang
Sida I. Wang
Ofir Press
41
23
0
04 Oct 2024
CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack
  Overflow
CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
Nathanael Beau
Benoît Crabbé
42
1
0
25 Sep 2024
Game On: Towards Language Models as RL Experimenters
Game On: Towards Language Models as RL Experimenters
Jingwei Zhang
Thomas Lampe
A. Abdolmaleki
Jost Tobias Springenberg
Martin Riedmiller
LM&Ro
36
0
0
05 Sep 2024
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Daoguang Zan
Zhirong Huang
Ailun Yu
Shaoxin Lin
Yifan Shi
...
Bei Guan
Pengjie Huang
Tao Xie
Yongji Wang
Qianxiang Wang
31
8
0
26 Aug 2024
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code
  Generation
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
Qiming Zhu
Jialun Cao
Yaojie Lu
Hongyu Lin
Xianpei Han
Le Sun
Shing-Chi Cheung
ALM
35
7
0
23 Aug 2024
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating
  an LLM's Ability to Generate Digital Twins
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins
Jingquan Wang
Harry Zhang
H. Unjhawala
Peter Negrut
Shu Wang
Khailanii Slaton
R. Serban
Jin-Long Wu
Dan Negrut
56
0
0
21 Aug 2024
RePair: Automated Program Repair with Process-based Feedback
RePair: Automated Program Repair with Process-based Feedback
Yuze Zhao
Zhenya Huang
Yixiao Ma
Rui Li
Kai Zhang
Hao Jiang
Qi Liu
Linbo Zhu
Yu Su
KELM
34
6
0
21 Aug 2024
What can Large Language Models Capture about Code Functional Equivalence?
What can Large Language Models Capture about Code Functional Equivalence?
Nickil Maveli
Antonio Vergari
Shay B. Cohen
41
2
0
20 Aug 2024
Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer
Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer
Mingda Li
Abhijit Mishra
Utkarsh Mujumdar
39
0
0
19 Aug 2024
What's Wrong with Your Code Generated by Large Language Models? An
  Extensive Study
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
Shihan Dou
Haoxiang Jia
Shenxi Wu
Huiyuan Zheng
Weikang Zhou
...
Xunliang Cai
Tao Gui
Xipeng Qiu
Qi Zhang
Xuanjing Huang
31
32
0
08 Jul 2024
Hierarchical Context Pruning: Optimizing Real-World Code Completion with
  Repository-Level Pretrained Code LLMs
Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
Lei Zhang
Yunshui Li
Jiaming Li
Xiaobo Xia
Jiaxi Yang
Run Luo
Minzheng Wang
Longze Chen
Junhao Liu
Min Yang
32
1
0
26 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
77
131
0
22 Jun 2024
CodeRAG-Bench: Can Retrieval Augment Code Generation?
CodeRAG-Bench: Can Retrieval Augment Code Generation?
Zora Zhiruo Wang
Akari Asai
Xinyan Velocity Yu
Frank F. Xu
Yiqing Xie
Graham Neubig
Daniel Fried
RALM
71
30
0
20 Jun 2024
A Survey on Large Language Models for Code Generation
A Survey on Large Language Models for Code Generation
Juyong Jiang
Fan Wang
Jiasi Shen
Sungju Kim
Sunghun Kim
53
161
0
01 Jun 2024
MHPP: Exploring the Capabilities and Limitations of Language Models
  Beyond Basic Code Generation
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation
Jianbo Dai
Jianqiao Lu
Yunlong Feng
Rongju Ruan
Ming Cheng
Haochen Tan
Zhijiang Guo
ELM
LRM
36
12
0
19 May 2024
On the Limitations of Embedding Based Methods for Measuring Functional
  Correctness for Code Generation
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation
Atharva Naik
43
2
0
26 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and
  Frontiers
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Hai-Tao Zheng
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
55
36
0
07 Apr 2024
CodeBenchGen: Creating Scalable Execution-based Code Generation
  Benchmarks
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
Yiqing Xie
Alex Xie
Divyanshu Sheth
Pengfei Liu
Daniel Fried
Carolyn Rose
43
8
0
31 Mar 2024
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval:
  Evolving Coding Benchmarks via LLM
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
Chun Xia
Yinlin Deng
Lingming Zhang
ALM
ELM
41
27
0
28 Mar 2024
What Are Tools Anyway? A Survey from the Language Model Perspective
What Are Tools Anyway? A Survey from the Language Model Perspective
Zhiruo Wang
Zhoujun Cheng
Hao Zhu
Daniel Fried
Graham Neubig
65
27
0
18 Mar 2024
Exploring Language Model's Code Generation Ability with Auxiliary
  Functions
Exploring Language Model's Code Generation Ability with Auxiliary Functions
Seonghyeon Lee
Sanghwan Jang
Seongbo Jang
Dongha Lee
Hwanjo Yu
ALM
32
2
0
15 Mar 2024
DevBench: A Comprehensive Benchmark for Software Development
DevBench: A Comprehensive Benchmark for Software Development
Bowen Li
Wenhan Wu
Ziwei Tang
Lin Shi
John Yang
...
He Du
Ping Yang
Dahua Lin
Chao Peng
Kai Chen
91
10
0
13 Mar 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large
  Language Models for Code
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain
King Han
Alex Gu
Wen-Ding Li
Fanjia Yan
Tianjun Zhang
Sida I. Wang
Armando Solar-Lezama
Koushik Sen
Ion Stoica
ELM
36
274
0
12 Mar 2024
Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks
Linyuan Gong
Sida Wang
Mostafa Elhoushi
Alvin Cheung
27
15
0
07 Mar 2024
Compositional API Recommendation for Library-Oriented Code Generation
Compositional API Recommendation for Library-Oriented Code Generation
Zexiong Ma
Shengnan An
Bing Xie
Zeqi Lin
34
17
0
29 Feb 2024
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual
  Natural Language Generalization
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
Qiwei Peng
Yekun Chai
Xuhong Li
ELM
LM&MA
39
35
0
26 Feb 2024
Mercury: A Code Efficiency Benchmark for Code Large Language Models
Mercury: A Code Efficiency Benchmark for Code Large Language Models
Mingzhe Du
A. Luu
Bin Ji
Qian Liu
See-Kiong Ng
ALM
ELM
OffRL
24
6
0
12 Feb 2024
Solution-oriented Agent-based Models Generation with Verifier-assisted
  Iterative In-context Learning
Solution-oriented Agent-based Models Generation with Verifier-assisted Iterative In-context Learning
Tong Niu
Weihao Zhang
Rong Zhao
LLMAG
32
2
0
04 Feb 2024
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
Dong Huang
Yuhao Qing
Weiyi Shang
Heming Cui
Jie M. Zhang
82
31
0
03 Feb 2024
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
  Models
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models
Shuai Wang
Liang Ding
Li Shen
Yong Luo
Bo Du
Dacheng Tao
ELM
ALM
45
2
0
12 Jan 2024
CodeScholar: Growing Idiomatic Code Examples
CodeScholar: Growing Idiomatic Code Examples
Manish Shetty
Koushik Sen
Ion Stoica
ELM
30
1
0
23 Dec 2023
Capture the Flag: Uncovering Data Insights with Large Language Models
Capture the Flag: Uncovering Data Insights with Large Language Models
I. Laradji
Perouz Taslakian
Sai Rajeswar
Valentina Zantedeschi
Alexandre Lacoste
Nicolas Chapados
David Vazquez
Christopher Pal
Alexandre Drouin
57
3
0
21 Dec 2023
An In-depth Look at Gemini's Language Abilities
An In-depth Look at Gemini's Language Abilities
Syeda Nahida Akter
Zichun Yu
Aashiq Muhamed
Tianyue Ou
Alex Bäuerle
Ángel Alexander Cabrera
Krish Dholakia
Chenyan Xiong
Graham Neubig
LRM
ELM
33
34
0
18 Dec 2023
LLM-Assisted Code Cleaning For Training Accurate Code Generators
LLM-Assisted Code Cleaning For Training Accurate Code Generators
Naman Jain
Tianjun Zhang
Wei-Lin Chiang
Joseph E. Gonzalez
Koushik Sen
Ion Stoica
39
27
0
25 Nov 2023
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
  Language Models
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
Ansong Ni
Pengcheng Yin
Yilun Zhao
Chen Wei
Yanjun Wang
...
Mingyuan Zhang
Chen Change Loy
Yingbo Zhou
Dragomir R. Radev
Arman Cohan
ELM
24
16
0
29 Sep 2023
BioCoder: A Benchmark for Bioinformatics Code Generation with Large
  Language Models
BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
Xiangru Tang
Bill Qian
Rick Gao
Jiakang Chen
Xinyun Chen
Mark B. Gerstein
23
11
0
31 Aug 2023
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation
  with Large Language Models
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models
Martin Weyssow
Xin Zhou
Kisub Kim
David Lo
H. Sahraoui
32
27
0
21 Aug 2023
OctoPack: Instruction Tuning Code Large Language Models
OctoPack: Instruction Tuning Code Large Language Models
Niklas Muennighoff
Qian Liu
A. Zebaze
Qinkai Zheng
Binyuan Hui
Terry Yue Zhuo
Swayam Singh
Xiangru Tang
Leandro von Werra
Shayne Longpre
VLM
ALM
65
117
0
14 Aug 2023
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou
Frank F. Xu
Hao Zhu
Xuhui Zhou
Robert Lo
...
Tianyue Ou
Yonatan Bisk
Daniel Fried
Uri Alon
Graham Neubig
LLMAG
36
381
0
25 Jul 2023
InterCode: Standardizing and Benchmarking Interactive Coding with
  Execution Feedback
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback
John Yang
Akshara Prabhakar
Karthik Narasimhan
Shunyu Yao
22
102
0
26 Jun 2023
StarCoder: may the source be with you!
StarCoder: may the source be with you!
Raymond Li
Loubna Ben Allal
Yangtian Zi
Niklas Muennighoff
Denis Kocetkov
...
Sean M. Hughes
Thomas Wolf
Arjun Guha
Leandro von Werra
H. D. Vries
48
716
0
09 May 2023
12
Next