ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.00005
  4. Cited By
Mastering the Craft of Data Synthesis for CodeLLMs
v1v2v3 (latest)

Mastering the Craft of Data Synthesis for CodeLLMs

16 October 2024
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
Mahdi Kazemi Moghaddam
Omid Nezami
Tien N Nguyen
Gioacchino Tangari
Duy Vu
Thanh Tien Vu
Mark Johnson
Kemal Kurniawan
Don Dharmasiri
Long Duong
Yuan-Fang Li
    SyDa
ArXiv (abs)PDFHTML

Papers citing "Mastering the Craft of Data Synthesis for CodeLLMs"

50 / 64 papers shown
Title
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
Mihai Nadas
Laura Diosan
Andreea Tomescu
SyDa
120
3
0
18 Mar 2025
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications
Nam Huynh
Beiyu Lin
LM&MA
130
19
0
03 Mar 2025
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Ulyana Piterbarg
Lerrel Pinto
Rob Fergus
SyDa
142
2
0
03 Oct 2024
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Jiaxi Yang
Binyuan Hui
Min Yang
Jian Yang
Junyang Lin
Chang Zhou
SyDa
102
34
0
06 Aug 2024
Case2Code: Scalable Synthetic Data for Code Generation
Case2Code: Scalable Synthetic Data for Code Generation
Yunfan Shao
Linyang Li
Yichuan Ma
Peiji Li
Demin Song
...
Qipeng Guo
Hang Yan
Xipeng Qiu
Xuanjing Huang
Dahua Lin
LRM
94
2
0
17 Jul 2024
Instruction Pre-Training: Language Models are Supervised Multitask
  Learners
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng
Yuxian Gu
Shaohan Huang
Junyu Bi
Minlie Huang
Furu Wei
SyDa
137
27
0
20 Jun 2024
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code
  Intelligence
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-AI
Qihao Zhu
Daya Guo
Zhihong Shao
Dejian Yang
...
Jiashi Li
Chenggang Zhao
Chong Ruan
Fuli Luo
Wenfeng Liang
MoELRMELMVLM
103
209
0
17 Jun 2024
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
  Survey
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Lin Long
Rui Wang
Ruixuan Xiao
Junbo Zhao
Xiao Ding
Gang Chen
Haobo Wang
SyDa
115
127
0
14 Jun 2024
Synthetic Programming Elicitation and Repair for Text-to-Code in Very
  Low-Resource Programming Languages
Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages
Federico Mora
Justin Wong
Haley Lepe
Sahil Bhatia
Karim Elmaaroufi
George Varghese
Joseph E. Gonzalez
Elizabeth Polgreen
Sanjit A. Seshia
SyDa
90
4
0
05 Jun 2024
SemCoder: Training Code Language Models with Comprehensive Semantics
SemCoder: Training Code Language Models with Comprehensive Semantics
Yangruibo Ding
Jinjun Peng
Marcus J. Min
Gail E. Kaiser
Junfeng Yang
Baishakhi Ray
OffRL
110
21
0
03 Jun 2024
Automatic Programming: Large Language Models and Beyond
Automatic Programming: Large Language Models and Beyond
Michael R. Lyu
Baishakhi Ray
Abhik Roychoudhury
Shin Hwei Tan
Patanamon Thongtanunam
103
22
0
03 May 2024
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Saumya Gandhi
Ritu Gala
Vijay Viswanathan
Tongshuang Wu
Graham Neubig
SyDa
133
25
0
22 Apr 2024
CYCLE: Learning to Self-Refine the Code Generation
CYCLE: Learning to Self-Refine the Code Generation
Yangruibo Ding
Marcus J. Min
Gail E. Kaiser
Baishakhi Ray
133
37
0
27 Mar 2024
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language
  Models to Coding Preferences
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
Martin Weyssow
Aton Kamanda
H. Sahraoui
ALM
116
38
0
14 Mar 2024
Quantifying Contamination in Evaluating Code Generation Capabilities of
  Language Models
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Martin Riddell
Ansong Ni
Arman Cohan
ELM
90
32
0
06 Mar 2024
StarCoder 2 and The Stack v2: The Next Generation
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Raymond Li
Loubna Ben Allal
Federico Cassano
J. Lamy-Poirier
...
Sean M. Hughes
Thomas Wolf
Arjun Guha
Leandro von Werra
H. D. Vries
OSLMELM
86
362
0
29 Feb 2024
Large Language Models for Data Annotation: A Survey
Large Language Models for Data Annotation: A Survey
Zhen Tan
Dawei Li
Song Wang
Alimohammad Beigi
Bohan Jiang
Amrita Bhattacharjee
Mansooreh Karami
Wenlin Yao
Lu Cheng
Huan Liu
SyDa
134
80
0
21 Feb 2024
A Survey on Data Selection for LLM Instruction Tuning
A Survey on Data Selection for LLM Instruction Tuning
Bolin Zhang
Jiahao Wang
Qianlong Du
Jiajun Zhang
Zhiying Tu
Dianhui Chu
103
48
0
04 Feb 2024
Superfiltering: Weak-to-Strong Data Filtering for Fast
  Instruction-Tuning
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li
Yong Zhang
Shwai He
Zhitao Li
Hongyu Zhao
Jianzong Wang
Ning Cheng
Dinesh Manocha
111
82
0
01 Feb 2024
DeepSeek-Coder: When the Large Language Model Meets Programming -- The
  Rise of Code Intelligence
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo
Qihao Zhu
Dejian Yang
Zhenda Xie
Kai Dong
...
Yu-Huan Wu
Yiming Li
Fuli Luo
Yingfei Xiong
W. Liang
ELM
145
813
0
25 Jan 2024
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu
Baptiste Rozière
Hugh Leather
Armando Solar-Lezama
Gabriel Synnaeve
Sida I. Wang
ELMALMLRM
63
116
0
05 Jan 2024
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
Yao Wan
Yang He
Zhangqian Bi
Jianguo Zhang
Hongyu Zhang
Yulei Sui
Guandong Xu
Hai Jin
Philip S. Yu
102
27
0
30 Dec 2023
What Makes Good Data for Alignment? A Comprehensive Study of Automatic
  Data Selection in Instruction Tuning
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
Wei Liu
Weihao Zeng
Keqing He
Yong Jiang
Junxian He
ALM
138
239
0
25 Dec 2023
WaveCoder: Widespread And Versatile Enhancement For Code Large Language
  Models By Instruction Tuning
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning
Zhaojian Yu
Xin Zhang
Ning Shang
Yangyu Huang
Can Xu
Yishujie Zhao
Wenxiang Hu
Qiufeng Yin
SyDa
135
28
0
20 Dec 2023
Beyond Human Data: Scaling Self-Training for Problem-Solving with
  Language Models
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh
John D. Co-Reyes
Rishabh Agarwal
Ankesh Anand
Piyush Patil
...
Yamini Bansal
Ethan Dyer
Behnam Neyshabur
Jascha Narain Sohl-Dickstein
Noah Fiedel
ALMLRMReLMSyDa
286
190
0
11 Dec 2023
Efficient Online Data Mixing For Language Model Pre-Training
Efficient Online Data Mixing For Language Model Pre-Training
Alon Albalak
Liangming Pan
Colin Raffel
Wenjie Wang
101
46
0
05 Dec 2023
Magicoder: Empowering Code Generation with OSS-Instruct
Magicoder: Empowering Code Generation with OSS-Instruct
Yuxiang Wei
Zhe Wang
Jiawei Liu
Yifeng Ding
Lingming Zhang
SyDa
111
118
0
04 Dec 2023
LLM-Assisted Code Cleaning For Training Accurate Code Generators
LLM-Assisted Code Cleaning For Training Accurate Code Generators
Naman Jain
Tianjun Zhang
Wei-Lin Chiang
Joseph E. Gonzalez
Koushik Sen
Ion Stoica
80
32
0
25 Nov 2023
Unifying the Perspectives of NLP and Software Engineering: A Survey on
  Language Models for Code
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Ziyin Zhang
Chaoyu Chen
Bingchang Liu
Cong Liao
Zi Gong
Hang Yu
Jianguo Li
Rui Wang
ELM
84
59
0
14 Nov 2023
Automatic Unit Test Data Generation and Actor-Critic Reinforcement
  Learning for Code Synthesis
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis
P. Gorinski
Matthieu Zimmer
Gerasimos Lampouras
Derrick-Goh-Xin Deik
Ignacio Iacobacci
ALMOffRL
95
3
0
20 Oct 2023
Benchmarking and Improving Text-to-SQL Generation under Ambiguity
Benchmarking and Improving Text-to-SQL Generation under Ambiguity
Adithya Bhaskar
Tushar Tomar
Ashutosh Sathe
Sunita Sarawagi
95
22
0
20 Oct 2023
Qwen Technical Report
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
375
1,924
0
28 Sep 2023
Human Feedback is not Gold Standard
Human Feedback is not Gold Standard
Tom Hosking
Phil Blunsom
Max Bartolo
ALM
126
55
0
28 Sep 2023
SlimPajama-DC: Understanding Data Combinations for LLM Training
SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen
Tianhua Tao
Liqun Ma
Willie Neiswanger
Zhengzhong Liu
...
Bowen Tan
Joel Hestness
Natalia Vassilieva
Daria Soboleva
Eric Xing
118
51
0
19 Sep 2023
Textbooks Are All You Need II: phi-1.5 technical report
Textbooks Are All You Need II: phi-1.5 technical report
Yuan-Fang Li
Sébastien Bubeck
Ronen Eldan
Allison Del Giorno
Suriya Gunasekar
Yin Tat Lee
ALMLRM
183
482
0
11 Sep 2023
Distilled GPT for Source Code Summarization
Distilled GPT for Source Code Summarization
Chia-Yi Su
Collin McMillan
89
41
0
28 Aug 2023
Knowledge Transfer from High-Resource to Low-Resource Programming
  Languages for Code LLMs
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
Federico Cassano
John Gouwar
Francesca Lucchetti
Claire Schlesinger
Anders Freeman
Carolyn Jane Anderson
Molly Q. Feldman
Michael Greenberg
Abhinav Jangda
Arjun Guha
107
38
0
19 Aug 2023
Is Self-Repair a Silver Bullet for Code Generation?
Is Self-Repair a Silver Bullet for Code Generation?
Theo X. Olausson
J. Inala
Chenglong Wang
Jianfeng Gao
Armando Solar-Lezama
LRM
139
121
0
16 Jun 2023
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Ziyang Luo
Can Xu
Pu Zhao
Qingfeng Sun
Xiubo Geng
Wenxiang Hu
Chongyang Tao
Jing Ma
Qingwei Lin
Daxin Jiang
ELMSyDaALM
220
698
0
14 Jun 2023
ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural
  Language to SQL Systems
ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems
Yi Zhang
Jan Deriu
George Katsogiannis-Meimarakis
Catherine Kosten
Georgia Koutrika
Kurt Stockinger
81
25
0
07 Jun 2023
Uncovering and Quantifying Social Biases in Code Generation
Uncovering and Quantifying Social Biases in Code Generation
Yang Liu
Xiaokang Chen
Yan Gao
Zhe Su
Fengji Zhang
Daoguang Zan
Jian-Guang Lou
Pin-Yu Chen
Tsung-Yi Ho
94
20
0
24 May 2023
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie
Hieu H. Pham
Xuanyi Dong
Nan Du
Hanxiao Liu
Yifeng Lu
Percy Liang
Quoc V. Le
Tengyu Ma
Adams Wei Yu
MoMeMoE
169
205
0
17 May 2023
LeTI: Learning to Generate from Textual Interactions
LeTI: Learning to Generate from Textual Interactions
Xingyao Wang
Hao Peng
Reyhaneh Jabbarvand
Heng Ji
116
30
0
17 May 2023
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low
  Training Data Instruction Tuning
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning
Haowen Chen
Yiming Zhang
Qi Zhang
Hantao Yang
Xiaomeng Hu
Xuetao Ma
Yifan YangGong
Jiaqi Zhao
ALM
112
52
0
16 May 2023
ICE-Score: Instructing Large Language Models to Evaluate Code
ICE-Score: Instructing Large Language Models to Evaluate Code
Terry Yue Zhuo
ELMALM
127
45
0
27 Apr 2023
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual
  Benchmarking on HumanEval-X
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X
Qinkai Zheng
Xiao Xia
Xu Zou
Yuxiao Dong
Shanshan Wang
...
Andi Wang
Yang Li
Teng Su
Zhilin Yang
Jie Tang
ELMALMSyDa
172
347
0
30 Mar 2023
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
...
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
AI4CEAILaw
75
169
0
07 Mar 2023
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Shuyan Zhou
Uri Alon
Sumit Agarwal
Graham Neubig
ELMALM
99
114
0
10 Feb 2023
Exploring Data Augmentation for Code Generation Tasks
Exploring Data Augmentation for Code Generation Tasks
Pinzhen Chen
Gerasimos Lampouras
101
10
0
05 Feb 2023
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
  Robustness
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Shuaichen Chang
Jun Wang
Mingwen Dong
Lin Pan
Henghui Zhu
...
William Yang Wang
Zhiguo Wang
Vittorio Castelli
Patrick Ng
Bing Xiang
OOD
101
35
0
21 Jan 2023
12
Next