CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

CISPA Helmholtz Center for Information Security

Abstract

Large language models (LLMs) for automatic code generation have recently achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively evaluated for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models.
In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. This involves proposing a novel few-shot prompting approach. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we use our method to create a collection of diverse non-secure prompts for various vulnerability scenarios. This dataset serves as a benchmark to evaluate and compare the security weaknesses of code language models.

CodeLM Security Benchmark: We employ our proposed non-secure prompts dataset as a benchmark to assess and evaluate different large language models (LLMs). This dataset contains 280 non-secure prompts, with 200 designed for Python and 80 for C. We evaluate the security vulnerabilities issues that can be generated by these models through the following steps:

  1. Using a LLM (e.g., CodeGen-6B) to generate code completion for each non-secure prompt.
  2. Consider each non-secure prompt and the corresponding generated code completion(s) as a complete code(s).
  3. Employ CodeQL security analyzer to spot security issues in the generated codes.

To generate code completion for each non-secure prompts we use the following settings: a maximum token limit of 512, using nucleus sampling to sample 5 completion, a top-p value of 0.95, and a temperature of 0.2.


Notice: The following table shows the number of vulnerable Python and C codes generated by various models using our non-secure prompt dataset. The top-1 column displays the number of vulnerable codes in the top-ranked output of the model. The top-5 column shows the number of vulnerable codes among the five most probable model outputs.


Model Name top-1 (Python) top-5 (Python) top-1 (C) top-5 (C)
CodeGen-6B 108 544 38 203
ChatGPT 118 567 44 256
Code Llama-13B 115 588 45 252
StarCoder-7B 122 622 59 283
WizardCoder-15B 152 747 51 260

Notice: The following table shows the number of vulnerable Python codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to eleven provide details results generating codes for various CWEs. Column twelve provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.


Model Name CWE-020 CWE-022 CWE-078 CWE-079 CWE-089 CWE-094 CWE-117 CWE-502 CWE-601 CWE-611 Other Total
CodeGen-6B 8 78 24 172 33 52 9 31 64 49 24 544
ChatGPT 19 43 59 118 23 52 32 36 56 48 81 567
Code Llama-13B 34 90 40 128 1 53 35 26 59 43 79 588
StarCoder-7B 18 87 39 155 3 50 11 39 42 48 130 622
WizardCoder-15B 16 69 44 133 7 53 21 27 28 26 323 747

Notice: The following table shows the number of vulnerable C codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to five provide details results generating codes for various CWEs. Column six provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.


Model Name CWE-022 CWE-190 CWE-476 CWE-787 Other Total
CodeGen-6B 35 22 50 79 17 203
ChatGPT 40 58 47 97 14 256
Code Llama-13B 58 30 53 102 9 252
StarCoder-7B 58 33 74 101 17 283
WizardCoder-15B 44 38 57 114 7 260

BibTeX

@inproceedings{
debenedetti2023a,
title={CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models},
author={Hossein Hajipour and Keno Hassler and Thorsten Holz and Lea Schönherr and Mario Fritz},
booktitle={Second IEEE Conference on Secure and Trustworthy Machine Learning},
year={2024}
}