abhaysastha-myfi commited on
Commit
c3eb30e
·
verified ·
1 Parent(s): 1c3a037

Add model card

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - prompt-injection
7
+ - security
8
+ - classification
9
+ - fine-tuned
10
+ - myfi
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - accuracy
15
+ - precision
16
+ - recall
17
+ - f1
18
+ - auc
19
+ ---
20
+
21
+ # Fine-tuned Llama-Prompt-Guard-2-86M
22
+
23
+ This is a fine-tuned version of the Meta Llama-Prompt-Guard-2-86M model for prompt injection detection, developed by the MyFi team.
24
+
25
+ ## Model Description
26
+
27
+ - **Base Model**: meta-llama/Llama-Prompt-Guard-2-86M
28
+ - **Task**: Binary classification (benign vs malicious prompts)
29
+ - **Architecture**: mDeBERTa-base with custom classifier head
30
+ - **Fine-tuning**: Custom dataset with balanced benign/malicious samples
31
+ - **Organization**: MyFi
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
37
+ import torch
38
+
39
+ # Load model and tokenizer
40
+ model_name = "myfi/llama-prompt-guard-finetuned"
41
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
42
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
43
+
44
+ # Classify text
45
+ text = "How do I hack a computer?"
46
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
47
+ outputs = model(**inputs)
48
+
49
+ # Apply temperature scaling (recommended: 3.0)
50
+ temperature = 3.0
51
+ scaled_logits = outputs.logits / temperature
52
+ probabilities = torch.softmax(scaled_logits, dim=-1)
53
+
54
+ # Get prediction
55
+ benign_prob = probabilities[0][0].item()
56
+ malicious_prob = probabilities[0][1].item()
57
+ prediction_result = "MALICIOUS" if malicious_prob > 0.5 else "BENIGN"
58
+
59
+ print(f"Prediction: {prediction_result}")
60
+ print(f"Benign Probability: {benign_prob:.4f}")
61
+ print(f"Malicious Probability: {malicious_prob:.4f}")
62
+ ```
63
+
64
+ ## Training Details
65
+
66
+ - **Dataset**: Custom dataset with balanced benign/malicious samples
67
+ - **Training Method**: Fine-tuning with custom loss function
68
+ - **Temperature Scaling**: Recommended temperature = 3.0
69
+ - **Classification Threshold**: Default = 0.5
70
+ - **Organization**: MyFi
71
+
72
+ ## Performance
73
+
74
+ The model is designed to detect prompt injection attempts and malicious queries while allowing legitimate requests to pass through.
75
+
76
+ ## Limitations
77
+
78
+ - May have false positives/negatives on edge cases
79
+ - Performance depends on the quality and distribution of training data
80
+ - Should be used as part of a broader security strategy
81
+
82
+ ## License
83
+
84
+ This model is licensed under the MIT License.
85
+
86
+ ## Organization
87
+
88
+ This model is maintained by [MyFi](https://huggingface.co/myfi) - a company focused on AI & ML solutions.
89
+
90
+ ## Citation
91
+
92
+ If you use this model, please cite the original Llama-Prompt-Guard-2-86M paper and mention that this is a fine-tuned version by MyFi.