File size: 5,647 Bytes
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
 
 
 
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
 
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
 
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
 
 
 
 
 
 
 
 
 
 
 
bdc1ae8
c4f898e
bdc1ae8
c4f898e
bdc1ae8
c4f898e
 
 
bdc1ae8
c4f898e
 
 
bdc1ae8
c4f898e
 
 
 
 
 
bdc1ae8
c4f898e
 
 
 
bdc1ae8
c4f898e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdc1ae8
c4f898e
 
bdc1ae8
c4f898e
 
 
 
bdc1ae8
c4f898e
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

-----

## \#\# Model Details

  * **Developed by:** CipherSaber
  * **Base Model:** `Qwen/Qwen1.5-7B-Chat`
  * **Fine-tuning Method:** Supervised Fine-Tuning (SFT) with the TRL library.
  * **Language:** English
  * **License:** Apache-2.0
  * **Repository:** [https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1](https://www.google.com/url?sa=E&source=gmail&q=https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1)

-----

## \#\# Model Description

`Foss-Cherub-Vuln-Detector-v1` is a specialized generative language model fine-tuned to act as an expert security analyst. Its primary function is to analyze snippets of source code across various languages (C, C++, Java, Python) and identify potential security flaws.

The model is trained to follow a specific instruction format, where it receives code and responds with either "**Vulnerable**" along with the relevant Common Weakness Enumeration (CWE) ID, or "**Not Vulnerable**" if the code appears secure. It also has capabilities for generating mitigation advice when prompted correctly.

**Key Features:**

  * Detects common vulnerability patterns.
  * Identifies potential novel or "zero-day" like flaws that rule-based systems might miss.
  * Provides CWE classification for identified weaknesses.
  * Can generate detailed mitigation advice and code patches.

-----

## \#\# Intended Use

This model is intended to be used as the core AI engine for automated security auditing tools, IDE extensions, and CI/CD pipeline security gates. It is designed to assist:

  * **Developers:** By providing real-time feedback on potential security issues as they write code.
  * **Security Analysts:** By automating the initial triage of large codebases, allowing them to focus on verifying the most critical findings.
  * **Open-Source Maintainers:** By providing a first line of defense against insecure code contributions.

This model is **not** a replacement for a comprehensive security review by a human expert, especially for critical applications. It should be used as an assistive tool to augment, not replace, manual security practices.

-----

## \#\# Limitations & Bias

  * **False Positives/Negatives:** Like all AI models, this model can produce false positives (flagging secure code as vulnerable) and false negatives (missing actual vulnerabilities). All findings should be manually verified.
  * **Language Scope:** While trained on a multi-language dataset, its performance may vary between languages. It is most proficient with C, C++, Java, and Python.
  * **Context Window:** The model's analysis is limited to the provided code snippet. It cannot analyze an entire application's architecture or data flow, which may be necessary to understand the full context of a vulnerability.
  * **Hallucination:** The model may occasionally generate incorrect CWE IDs or mitigation advice. The generated fixes should be carefully reviewed and tested before implementation.

-----

## \#\# Training Data

The model was fine-tuned on a synthetically generated dataset comprising thousands of instruction-response pairs. Each data point consisted of a source code snippet (either secure or containing a specific vulnerability) and an expert-style analysis. The training was performed using the following prompt structure, which is crucial for effective inference:

```
<|im_start|>system
You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable".
<|im_end|>
<|im_start|>user
Analyze this {language} code:
```
{code}
```
<|im_end|>
<|im_start|>assistant
{expert_analysis}
```
-----

## \#\# How to Use

The model should be used with the `transformers` library. It is critical to format the input using the same prompt structure it was trained on.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Your Hugging Face repository name
model_name = "CipherSaber/Foss-Cherub-Vuln-Detector-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example vulnerable C code
vulnerable_code = """
#include <stdio.h>
#include <string.h>

void vulnerable_function(char *input) {
    char buffer[100];
    strcpy(buffer, input); // Classic buffer overflow
    printf("Input was: %s\n", buffer);
}
"""
# Format the input using the specific prompt template
prompt = f"""<|im_start|>system
You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable".
<|im_end|>
<|im_start|>user
Analyze this C code:
```
{vulnerable\_code}
```
<|im_end|>
<|im_start|>assistant
"""
# Generate the analysis
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
analysis = response.split("<|im_start|>assistant")[-1].strip()

print(analysis)
# Expected Output: Vulnerable. CWE-120: Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')
```
-----

## \#\# Citation

If you use this model in your research or application, please cite it as follows:

```bibtex
@software{CipherSaber_2025_Foss-Cherub,
  author = {CipherSaber},
  title = {{Foss-Cherub-Vuln-Detector-v1: An AI-Powered Source Code Vulnerability Detector}},
  month = oct,
  year = 2025,
  url = {https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1}
}
```