CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

CISPA Helmholtz Center for Information Security

Abstract

Large language models (LLMs) for automatic code generation have recently achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively evaluated for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models.
In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. This involves proposing a novel few-shot prompting approach. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we use our method to create a collection of diverse non-secure prompts for various vulnerability scenarios. This dataset serves as a benchmark to evaluate and compare the security weaknesses of code language models.

CodeLM Security Benchmark: We employ our proposed non-secure prompts dataset as a benchmark to assess and evaluate different large language models (LLMs). This dataset contains 280 non-secure prompts, with 200 designed for Python and 80 for C. We evaluate the security vulnerabilities issues that can be generated by these models through the following steps:

  1. Using a LLM (e.g., CodeGen-6B) to generate code completion for each non-secure prompt.
  2. Consider each non-secure prompt and the corresponding generated code completion(s) as a complete code(s).
  3. Employ CodeQL security analyzer to spot security issues in the generated codes.

To generate code completion for each non-secure prompts we use the following settings: a maximum token limit of 512, using nucleus sampling to sample 5 completion, a top-p value of 0.95, and a temperature of 0.2.


Notice: The following table shows the number of vulnerable Python and C codes generated by various models using our non-secure prompt dataset. The top-1 column displays the number of vulnerable codes in the top-ranked output of the model. The top-5 column shows the number of vulnerable codes among the five most probable model outputs.


Notice: The following table shows the number of vulnerable Python codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to eleven provide details results generating codes for various CWEs. Column twelve provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.



Notice: The following table shows the number of vulnerable C codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to five provide details results generating codes for various CWEs. Column six provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.


BibTeX

@inproceedings{
debenedetti2023a,
title={CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models},
author={Hossein Hajipour and Keno Hassler and Thorsten Holz and Lea Schönherr and Mario Fritz},
booktitle={Second IEEE Conference on Secure and Trustworthy Machine Learning},
year={2024}
}