We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \50 bug fixes to \32,000featureimplementations−−andmanagerialtasks,wheremodelschoosebetweentechnicalimplementationproposals.Independenttasksaregradedwithend−to−endteststriple−verifiedbyexperiencedsoftwareengineers,whilemanagerialdecisionsareassessedagainstthechoicesoftheoriginalhiredengineeringmanagers.Weevaluatemodelperformanceandfindthatfrontiermodelsarestillunabletosolvethemajorityoftasks.Tofacilitatefutureresearch,weopen−sourceaunifiedDockerimageandapublicevaluationsplit,SWE−LancerDiamond(thishttpsURL).Bymappingmodelperformancetomonetaryvalue,wehopeSWE−LancerenablesgreaterresearchintotheeconomicimpactofAImodeldevelopment.
@article{miserendino2025_2502.12115,
title={ SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? },
author={ Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke },
journal={arXiv preprint arXiv:2502.12115},
year={ 2025 }
}
Comments on this paper
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.