CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

Abstract

Large language models (LLMs) for automatic code generation have recently achieved breakthroughs in several programming tasks. Their advances in competition-level programming problems have made them an essential pillar of AI-assisted pair programming, and tools such as GitHub Copilot have emerged as part of the daily programming workflow used by millions of developers. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. While these models have been extensively evaluated for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models.
In this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. To this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. This involves proposing a novel few-shot prompting approach. We evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. Furthermore, we use our method to create a collection of diverse non-secure prompts for various vulnerability scenarios. This dataset serves as a benchmark to evaluate and compare the security weaknesses of code language models.

CodeLM Security Benchmark: We employ our proposed non-secure prompts dataset as a benchmark to assess and evaluate different large language models (LLMs). This dataset contains 280 non-secure prompts, with 200 designed for Python and 80 for C. We evaluate the security vulnerabilities issues that can be generated by these models through the following steps:

Using a LLM (e.g., CodeGen-6B) to generate code completion for each non-secure prompt.
Consider each non-secure prompt and the corresponding generated code completion(s) as a complete code(s).
Employ CodeQL security analyzer to spot security issues in the generated codes.

To generate code completion for each non-secure prompts we use the following settings: a maximum token limit of 512, using nucleus sampling to sample 5 completion, a top-p value of 0.95, and a temperature of 0.2.

Notice: The following table shows the number of vulnerable Python and C codes generated by various models using our non-secure prompt dataset. The top-1 column displays the number of vulnerable codes in the top-ranked output of the model. The top-5 column shows the number of vulnerable codes among the five most probable model outputs.

Model Name	top-1 (Python)	top-5 (Python)	top-1 (C)	top-5 (C)
CodeGen-6B	108	544	38	203
ChatGPT	118	567	44	256
Code Llama-13B	115	588	45	252
StarCoder-7B	122	622	59	283
WizardCoder-15B	152	747	51	260

Notice: The following table shows the number of vulnerable Python codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to eleven provide details results generating codes for various CWEs. Column twelve provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.

Model Name	CWE-020	CWE-022	CWE-078	CWE-079	CWE-089	CWE-094	CWE-117	CWE-502	CWE-601	CWE-611	Other	Total
CodeGen-6B	8	78	24	172	33	52	9	31	64	49	24	544
ChatGPT	19	43	59	118	23	52	32	36	56	48	81	567
Code Llama-13B	34	90	40	128	1	53	35	26	59	43	79	588
StarCoder-7B	18	87	39	155	3	50	11	39	42	48	130	622
WizardCoder-15B	16	69	44	133	7	53	21	27	28	26	323	747

Notice: The following table shows the number of vulnerable C codes generated by various models using our non-secure prompt dataset. The results demonstrate the number of generated vulnerable codes among the five most probable model outputs. Columns two to five provide details results generating codes for various CWEs. Column six provides the number of found vulnerable codes with the other CWEs that CodeQL queries. The last column provides the sum of all codes with at least one security vulnerability.

Model Name	CWE-022	CWE-190	CWE-476	CWE-787	Other	Total
CodeGen-6B	35	22	50	79	17	203
ChatGPT	40	58	47	97	14	256
Code Llama-13B	58	30	53	102	9	252
StarCoder-7B	58	33	74	101	17	283
WizardCoder-15B	44	38	57	114	7	260

BibTeX

@inproceedings{
debenedetti2023a,
title={CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models},
author={Hossein Hajipour and Keno Hassler and Thorsten Holz and Lea Schönherr and Mario Fritz},
booktitle={Second IEEE Conference on Secure and Trustworthy Machine Learning},
year={2024}
}