Uncovering blockchain smart contract vulnerabilities with CodeBERT
Sapkota, Shankar (2024)
Diplomityö
Sapkota, Shankar
2024
School of Engineering Science, Laskennallinen tekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe202501021089
https://urn.fi/URN:NBN:fi-fe202501021089
Tiivistelmä
Smart contracts (Szabo 1997) pertain to self-executing, trustless transactions through decentralization, which do not require any intermediaries. As decentralized finance has come to the fore, however, several glaring security flaws of smart contracts come to light and it has caused losses worth several dollars due to code exploits. The present research work attempts to utilize the CodeBERT model, which is a pre-trained transformer model for programming languages, to address the immediate demand for automated vulnerability identification in smart contracts.
Our fine-tuning of CodeBERT involves the automation of analyzing the code structure and keywords’ pattern to detect likely vulnerabilities in Solidity-based smart contracts.
The research shows that fine-tuning CodeBERT on a labeled Solidity dataset of 47,398 contracts is adequate to realize vulnerability detection in smart contracts. The model recorded an impressive accuracy of 85.3% on the test dataset and an ROC AUC score of 0.93, thus showing robust classification capability. Special care was taken in balancing the dataset, thus ensuring Vulnerable and Non-Vulnerable labels have equal representation so as to solve the class imbalance problem. The experimental results further validated the ability of the model to generalize it because it showed a great level of confidence when identifying vulnerabilities in unseen Solidity code.
Our fine-tuning of CodeBERT involves the automation of analyzing the code structure and keywords’ pattern to detect likely vulnerabilities in Solidity-based smart contracts.
The research shows that fine-tuning CodeBERT on a labeled Solidity dataset of 47,398 contracts is adequate to realize vulnerability detection in smart contracts. The model recorded an impressive accuracy of 85.3% on the test dataset and an ROC AUC score of 0.93, thus showing robust classification capability. Special care was taken in balancing the dataset, thus ensuring Vulnerable and Non-Vulnerable labels have equal representation so as to solve the class imbalance problem. The experimental results further validated the ability of the model to generalize it because it showed a great level of confidence when identifying vulnerabilities in unseen Solidity code.
