YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

## Model Details


## Model Description

Foss-Cherub-Vuln-Detector-v1 is a specialized generative language model fine-tuned to act as an expert security analyst. Its primary function is to analyze snippets of source code across various languages (C, C++, Java, Python) and identify potential security flaws.

The model is trained to follow a specific instruction format, where it receives code and responds with either "Vulnerable" along with the relevant Common Weakness Enumeration (CWE) ID, or "Not Vulnerable" if the code appears secure. It also has capabilities for generating mitigation advice when prompted correctly.

Key Features:

  • Detects common vulnerability patterns.
  • Identifies potential novel or "zero-day" like flaws that rule-based systems might miss.
  • Provides CWE classification for identified weaknesses.
  • Can generate detailed mitigation advice and code patches.

## Intended Use

This model is intended to be used as the core AI engine for automated security auditing tools, IDE extensions, and CI/CD pipeline security gates. It is designed to assist:

  • Developers: By providing real-time feedback on potential security issues as they write code.
  • Security Analysts: By automating the initial triage of large codebases, allowing them to focus on verifying the most critical findings.
  • Open-Source Maintainers: By providing a first line of defense against insecure code contributions.

This model is not a replacement for a comprehensive security review by a human expert, especially for critical applications. It should be used as an assistive tool to augment, not replace, manual security practices.


## Limitations & Bias

  • False Positives/Negatives: Like all AI models, this model can produce false positives (flagging secure code as vulnerable) and false negatives (missing actual vulnerabilities). All findings should be manually verified.
  • Language Scope: While trained on a multi-language dataset, its performance may vary between languages. It is most proficient with C, C++, Java, and Python.
  • Context Window: The model's analysis is limited to the provided code snippet. It cannot analyze an entire application's architecture or data flow, which may be necessary to understand the full context of a vulnerability.
  • Hallucination: The model may occasionally generate incorrect CWE IDs or mitigation advice. The generated fixes should be carefully reviewed and tested before implementation.

## Training Data

The model was fine-tuned on a synthetically generated dataset comprising thousands of instruction-response pairs. Each data point consisted of a source code snippet (either secure or containing a specific vulnerability) and an expert-style analysis. The training was performed using the following prompt structure, which is crucial for effective inference:

<|im_start|>system
You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable".
<|im_end|>
<|im_start|>user
Analyze this {language} code:

{code} <|im_end|> <|im_start|>assistant {expert_analysis}

## How to Use

The model should be used with the transformers library. It is critical to format the input using the same prompt structure it was trained on.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Your Hugging Face repository name
model_name = "CipherSaber/Foss-Cherub-Vuln-Detector-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example vulnerable C code
vulnerable_code = """
#include <stdio.h>
#include <string.h>

void vulnerable_function(char *input) {
    char buffer[100];
    strcpy(buffer, input); // Classic buffer overflow
    printf("Input was: %s\n", buffer);
}
"""
# Format the input using the specific prompt template
prompt = f"""<|im_start|>system
You are an expert security analyst. Analyze the code for vulnerabilities. Respond with "Vulnerable" and the CWE ID if a flaw exists, or "Not Vulnerable".
<|im_end|>
<|im_start|>user
Analyze this C code:

{vulnerable_code}

<|im_end|>
<|im_start|>assistant
"""
# Generate the analysis
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
analysis = response.split("<|im_start|>assistant")[-1].strip()

print(analysis)
# Expected Output: Vulnerable. CWE-120: Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

## Citation

If you use this model in your research or application, please cite it as follows:

@software{CipherSaber_2025_Foss-Cherub,
  author = {CipherSaber},
  title = {{Foss-Cherub-Vuln-Detector-v1: An AI-Powered Source Code Vulnerability Detector}},
  month = oct,
  year = 2025,
  url = {https://huggingface.co/CipherSaber/Foss-Cherub-Vuln-Detector-v1}
}
Downloads last month
44
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support