Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14261
Cited By
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
22 February 2024
Anisha Agarwal
Aaron Chan
Shubham Chandel
Jinu Jang
Shaun Miller
Roshanak Zilouchian Moghaddam
Yevhen Mohylevskyy
Neel Sundaresan
Michele Tufano
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming"
11 / 11 papers shown
Title
Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code
Antonio Della Porta
Stefano Lambiase
Fabio Palomba
24
0
0
18 Apr 2025
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code
Dhruv Gautam
Spandan Garg
Jinu Jang
Neel Sundaresan
Roshanak Zilouchian Moghaddam
LLMAG
LRM
78
2
0
10 Mar 2025
Human-AI Experience in Integrated Development Environments: A Systematic Literature Review
Agnia Sergeyuk
Ilya Zakharov
Ekaterina Koshchenko
M. Izadi
60
0
0
08 Mar 2025
Towards Evaluating Large Language Models for Graph Query Generation
Siraj Munir
Alessandro Aldini
ELM
41
0
0
13 Nov 2024
From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating UI Operation Impacts
Zhuohao Jerry Zhang
E. Schoop
Jeffrey Nichols
Anuj Mahajan
Amanda Swearngin
LLMAG
31
0
0
11 Oct 2024
HDL-GPT: High-Quality HDL is All You Need
Bhuvnesh Kumar
Saurav Nanda
G. Parthasarathy
Pawan Patil
Austin Tsai
Parivesh Choudhary
LM&MA
36
0
0
25 Jul 2024
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom
Arghavan Moradi Dakhel
Florian Tambon
Foutse Khomh
40
2
0
22 May 2024
Automated Creation of Source Code Variants of a Cryptographic Hash Function Implementation Using Generative Pre-Trained Transformer Models
Elijah Pelofske
Vincent Urias
L. Liebrock
35
0
0
24 Apr 2024
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
372
12,081
0
04 Mar 2022
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
243
1,927
0
31 Dec 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,996
0
20 Apr 2018
1