Quality of dataset for data-driven software vulnerability detection
Sah, Dharmendra (2024)
Diplomityö
Sah, Dharmendra
2024
School of Engineering Science, Tietotekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe202501021060
https://urn.fi/URN:NBN:fi-fe202501021060
Tiivistelmä
Given that modern complex software is getting more and more intricate, the likelihood of having glitches in security mechanisms also rises, and this requires the use of automated methods for identification. To identify the vulnerable Python codes, this research work proposes and assesses a deep learning model with word2vec and LSTM called VUDENC. This paper solves the issues of limited data availability, noisy labels, and dataset imbalance by filtering the samples systematically and removing noise from labels. The rationale for the methodology followed a structured literature review, stressing that high-quality labeled datasets are an influential determinant of machine learning-based software vulnerability detection. The actual data was mined and preprocessed by pre-fetching Python repositories from GitHub to obtain a large dataset for the model training and evaluation. Critical threats which are SQL Injection and cross-site scripting were identified in the VUDENC model to improve automated integration and scalability into SDLC. Key findings show how effectively the model performs for conducting accuracy as well as precision for code vulnerability identification. The integration of color-coded output into the actual building of the model proved to greatly improve this aspect of usability for developers, to make a rather large improvement in the manner in which vulnerabilities could be found and fixed. This work further gives a boost to automated vulnerability detection by incorporating machine learning techniques into practical, real-world applications. The results show great promise in integrating such tools into the CI/CD pipeline as a means of proactive security. Future work includes increasing dataset variety across more programming languages and exploring advanced methods to refine preprocessing techniques to make models more robust and generalizable within a broader set of coding environments.
