30
5

Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models

Abstract

In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of 0.7760.776 and 0.680.68, outperforming both GPT-4 and GPT-4 Turbo, 0.660.66 and 0.6750.675. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities (0.610.61 and 0.560.56 respectively against GPT-4's 0.2180.218 and GPT-4 Turbo's 0.2430.243) and weighted F1 for the top two identified vulnerabilities (0.7190.719 for GPT-3.5FT, 0.6740.674 for Detect Llama - Foundation against GPT-4's 0.3630.363 and GPT-4 Turbo's 0.4290.429).

View on arXiv
Comments on this paper