----- ## \#\# Model Details * **Developed by:** CipherSaber * **Base Model:** `Qwen/Qwen1.5-7B-Chat` * **Fine-tuning Method:** Supervised Fine-Tuning (SFT) with the TRL library. * **Language:** English * **License:** Apache-2.0 * **Repository:** [https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1](https://www.google.com/url?sa=E&source=gmail&q=https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1) ----- ## \#\# Model Description `Foss-Cherub-Vuln-Detector-v1` is a specialized generative language model fine-tuned to act as an expert security analyst. Its primary function is to analyze snippets of source code across various languages (C, C++, Java, Python) and identify potential security flaws. The model is trained to follow a specific instruction format, where it receives code and responds with either "**Vulnerable**" along with the relevant Common Weakness Enumeration (CWE) ID, or "**Not Vulnerable**" if the code appears secure. It also has capabilities for generating mitigation advice when prompted correctly. **Key Features:** * Detects common vulnerability patterns. * Identifies potential novel or "zero-day" like flaws that rule-based systems might miss. * Provides CWE classification for identified weaknesses. * Can generate detailed mitigation advice and code patches. ----- ## \#\# Intended Use This model is intended to be used as the core AI engine for automated security auditing tools, IDE extensions, and CI/CD pipeline security gates. It is designed to assist: * **Developers:** By providing real-time feedback on potential security issues as they write code. * **Security Analysts:** By automating the initial triage of large codebases, allowing them to focus on verifying the most critical findings. * **Open-Source Maintainers:** By providing a first line of defense against insecure code contributions. This model is **not** a replacement for a comprehensive security review by a human expert, especially for critical applications. It should be used as an assistive tool to augment, not replace, manual security practices. ----- ## \#\# Limitations & Bias * **False Positives/Negatives:** Like all AI models, this model can produce false positives (flagging secure code as vulnerable) and false negatives (missing actual vulnerabilities). All findings should be manually verified. * **Language Scope:** While trained on a multi-language dataset, its performance may vary between languages. It is most proficient with C, C++, Java, and Python. * **Context Window:** The model's analysis is limited to the provided code snippet. It cannot analyze an entire application's architecture or data flow, which may be necessary to understand the full context of a vulnerability. * **Hallucination:** The model may occasionally generate incorrect CWE IDs or mitigation advice. The generated fixes should be carefully reviewed and tested before implementation. ----- ## \#\# Training Data The model was fine-tuned on a synthetically generated dataset comprising thousands of instruction-response pairs. Each data point consisted of a source code snippet (either secure or containing a specific vulnerability) and an expert-style analysis. The training was performed using the following prompt structure, which is crucial for effective inference: ``` <|im_start|>system You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable". <|im_end|> <|im_start|>user Analyze this {language} code: ``` {code} ``` <|im_end|> <|im_start|>assistant {expert_analysis} ``` ----- ## \#\# How to Use The model should be used with the `transformers` library. It is critical to format the input using the same prompt structure it was trained on. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Your Hugging Face repository name model_name = "CipherSaber/Foss-Cherub-Vuln-Detector-v1" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) # Example vulnerable C code vulnerable_code = """ #include #include void vulnerable_function(char *input) { char buffer[100]; strcpy(buffer, input); // Classic buffer overflow printf("Input was: %s\n", buffer); } """ # Format the input using the specific prompt template prompt = f"""<|im_start|>system You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable". <|im_end|> <|im_start|>user Analyze this C code: ``` {vulnerable\_code} ``` <|im_end|> <|im_start|>assistant """ # Generate the analysis inputs = tokenizer(prompt, return_tensors="pt").to(device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=50) response = tokenizer.decode(outputs[0], skip_special_tokens=True) analysis = response.split("<|im_start|>assistant")[-1].strip() print(analysis) # Expected Output: Vulnerable. CWE-120: Buffer Copy without Checking Size of Input ('Classic Buffer Overflow') ``` ----- ## \#\# Citation If you use this model in your research or application, please cite it as follows: ```bibtex @software{CipherSaber_2025_Foss-Cherub, author = {CipherSaber}, title = {{Foss-Cherub-Vuln-Detector-v1: An AI-Powered Source Code Vulnerability Detector}}, month = oct, year = 2025, url = {https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1} } ```