How secure is AI-generated code: a large-scale comparison of large language models


연구 분야: Strategies



학회: Empirical Software Engineering


초록

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE ’23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI’s GPT-4o-mini, Google’s Gemini Pro 1.0, TII’s 180 billion-parameter Falcon, Meta’s 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.


Author Profile
Norbert Tihanyi

Eötvös Loránd University (ELTE) Budapest Hungary

Hungary
Author Profile
Tamas Bisztray

Technology Innovation Institute (TII) Abu Dhabi UAE

정보 없음
Author Profile
Mohamed Amine Ferrag

University of Oslo Oslo Norway

Norway

📄 논문 정보

발행 연도 2024년
인용수 0
출판 국가 Hungary, Norway, Algeria
사이트 Springer
좋아요 수 0

연관 논문 목록 (6건)