Paper 2024/1228

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Elijah Pelofske, Sandia National Laboratories
Vincent Urias, Sandia National Laboratories
Lorie M. Liebrock, Sandia National Laboratories, New Mexico Cybersecurity Center of Excellence, New Mexico Tech
Abstract

Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks -- including generating computer code. However, in general GPT models have been shown to not be incredibly effective at handling specific computational tasks (such as evaluating mathematical functions). In this study, we evaluate the effectiveness of open source GPT models, with no fine-tuning, and with context introduced by the langchain and localGPT Large Language Model (LLM) framework, for the task of automatic identification of the presence of vulnerable code syntax (specifically targeting C and C++ source code). This task is evaluated on a selection of $36$ source code examples from the NIST SARD dataset, which are specifically curated to not contain natural English that indicates the presence, or lack thereof, of a particular vulnerability (including the removal of all source code comments). The NIST SARD source code dataset contains identified vulnerable lines of source code that are examples of one out of the $839$ distinct Common Weakness Enumerations (CWE), allowing for exact quantification of the GPT output classification error rate. A total of $5$ GPT models are evaluated, using $10$ different inference temperatures and $100$ repetitions at each setting, resulting in $5,000$ GPT queries per vulnerable source code analyzed. Ultimately, we find that the open source GPT models that we evaluated are not suitable for fully automated vulnerability scanning because the false positive and false negative rates are too high to likely be useful in practice. However, we do find that the GPT models perform surprisingly well at automated vulnerability detection for some of the test cases, in particular surpassing random sampling (for some GPT models and inference temperatures), and being able to identify the exact lines of code that are vulnerable albeit at a low success rate. The best performing GPT model result found was Llama-2-70b-chat-hf with inference temperature of $0.1$ applied to NIST SARD test case 149165 (which is an example of a buffer overflow vulnerability), which had a binary classification recall score of $1.0$ and a precision of $1.0$ for correctly and uniquely identifying the vulnerable line of code and the correct CWE number. Additionally, the GPT models are able to, with a rate quantifiably better than random sampling, identify the specific line of source that contains the identified CWE for many of the NIST SARD test cases.

Metadata
Available format(s)
PDF
Category
Applications
Publication info
Preprint.
Keywords
CWECommon Weakness EnumerationGPT modelGenerative Pre-Trained TransformerStatic Code analysisNIST SARD dataset
Contact author(s)
elijah pelofske @ protonmail com
History
2024-08-02: approved
2024-07-31: received
See all versions
Short URL
https://ia.cr/2024/1228
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2024/1228,
      author = {Elijah Pelofske and Vincent Urias and Lorie M. Liebrock},
      title = {Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models},
      howpublished = {Cryptology {ePrint} Archive, Paper 2024/1228},
      year = {2024},
      url = {https://eprint.iacr.org/2024/1228}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.