synapti commited on
Commit
833baa4
·
verified ·
1 Parent(s): a6f5099

Model save

Browse files
Files changed (3) hide show
  1. README.md +50 -107
  2. model.safetensors +1 -1
  3. test_results.json +12 -0
README.md CHANGED
@@ -1,134 +1,77 @@
1
  ---
 
2
  license: apache-2.0
3
  base_model: answerdotai/ModernBERT-base
4
  tags:
5
- - transformers
6
- - modernbert
7
- - text-classification
8
- - propaganda-detection
9
- - binary-classification
10
- - nci-protocol
11
- datasets:
12
- - synapti/nci-propaganda-production
13
  metrics:
14
  - accuracy
15
  - f1
16
  - precision
17
  - recall
18
- pipeline_tag: text-classification
 
 
19
  ---
20
 
21
- # NCI Binary Propaganda Detector
 
22
 
23
- Binary classifier that detects whether text contains propaganda/manipulation techniques.
24
 
25
- ## Model Description
 
 
 
 
 
 
 
26
 
27
- This model is **Stage 1** of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline:
28
 
29
- - **Stage 1 (this model)**: Fast binary detection - "Does this text contain propaganda?"
30
- - **Stage 2**: Multi-label technique classification - "Which specific techniques are used?"
31
 
32
- The binary detector is optimized for **high recall** to ensure manipulative content is not missed, while Stage 2 provides detailed technique classification.
33
 
34
- ## Intended Uses
35
 
36
- - Fast filtering of content for propaganda presence
37
- - First-pass screening in content moderation pipelines
38
- - Real-time detection in social media monitoring
39
- - Input gating for detailed technique analysis
40
 
41
- ## Training Data
42
 
43
- Trained on the [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production) dataset:
44
 
45
- - **23,000+ examples** from multiple sources
46
- - **Positive examples**: SemEval-2020 Task 11 propaganda techniques
47
- - **Hard negatives**: LIAR2 factual statements, Qbias center-biased news
48
- - **Train/Val/Test split**: 80/10/10
49
 
50
- ## Performance
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- | Metric | Score |
53
- |--------|-------|
54
- | Accuracy | ~95% |
55
- | F1 | ~94% |
56
- | Precision | ~96% |
57
- | Recall | ~92% |
58
 
59
- ## Usage
 
 
 
 
 
 
 
60
 
61
- ```python
62
- from transformers import pipeline
63
 
64
- # Load the model
65
- detector = pipeline("text-classification", model="synapti/nci-binary-detector")
66
 
67
- # Detect propaganda
68
- text = "The radical left wants to DESTROY our country!"
69
- result = detector(text)
70
-
71
- # Result: {'label': 'LABEL_1', 'score': 0.99}
72
- # LABEL_0 = no propaganda, LABEL_1 = has propaganda
73
- ```
74
-
75
- ### Two-Stage Pipeline
76
-
77
- For complete propaganda analysis, use with the technique classifier:
78
-
79
- ```python
80
- from transformers import pipeline
81
-
82
- binary = pipeline("text-classification", model="synapti/nci-binary-detector")
83
- technique = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None)
84
-
85
- text = "Your text here..."
86
-
87
- # Stage 1: Binary detection
88
- binary_result = binary(text)[0]
89
- has_propaganda = binary_result["label"] == "LABEL_1"
90
-
91
- if has_propaganda:
92
- # Stage 2: Technique classification
93
- techniques = technique(text)[0]
94
- detected = [t for t in techniques if t["score"] > 0.3]
95
- ```
96
-
97
- ## Model Architecture
98
-
99
- - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
100
- - **Parameters**: 149.6M
101
- - **Max Sequence Length**: 512 tokens
102
- - **Output**: 2 classes (no_propaganda, has_propaganda)
103
-
104
- ## Training Details
105
-
106
- - **Loss Function**: Focal Loss (gamma=2.0, alpha=0.25)
107
- - **Optimizer**: AdamW
108
- - **Learning Rate**: 2e-5
109
- - **Batch Size**: 16 (effective 64 with gradient accumulation)
110
- - **Epochs**: 5 with early stopping
111
- - **Hardware**: NVIDIA A10G GPU
112
-
113
- ## Limitations
114
-
115
- - Trained primarily on English text
116
- - May not detect novel propaganda techniques not in training data
117
- - Optimized for short-to-medium length text (tweets, headlines, paragraphs)
118
- - Should be used as part of a larger analysis pipeline, not as sole arbiter
119
-
120
- ## Citation
121
-
122
- ```bibtex
123
- @misc{nci-binary-detector,
124
- author = {NCI Protocol Team},
125
- title = {NCI Binary Propaganda Detector},
126
- year = {2024},
127
- publisher = {HuggingFace},
128
- url = {https://huggingface.co/synapti/nci-binary-detector}
129
- }
130
- ```
131
-
132
- ## License
133
-
134
- Apache 2.0
 
1
  ---
2
+ library_name: transformers
3
  license: apache-2.0
4
  base_model: answerdotai/ModernBERT-base
5
  tags:
6
+ - generated_from_trainer
 
 
 
 
 
 
 
7
  metrics:
8
  - accuracy
9
  - f1
10
  - precision
11
  - recall
12
+ model-index:
13
+ - name: nci-binary-detector
14
+ results: []
15
  ---
16
 
17
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
+ should probably proofread and complete it, then remove this comment. -->
19
 
20
+ # nci-binary-detector
21
 
22
+ This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
+ It achieves the following results on the evaluation set:
24
+ - Loss: 0.0031
25
+ - Accuracy: 0.9954
26
+ - F1: 0.9959
27
+ - Precision: 0.9919
28
+ - Recall: 1.0
29
+ - Roc Auc: 0.9986
30
 
31
+ ## Model description
32
 
33
+ More information needed
 
34
 
35
+ ## Intended uses & limitations
36
 
37
+ More information needed
38
 
39
+ ## Training and evaluation data
 
 
 
40
 
41
+ More information needed
42
 
43
+ ## Training procedure
44
 
45
+ ### Training hyperparameters
 
 
 
46
 
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 2e-05
49
+ - train_batch_size: 16
50
+ - eval_batch_size: 32
51
+ - seed: 42
52
+ - gradient_accumulation_steps: 2
53
+ - total_train_batch_size: 32
54
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
55
+ - lr_scheduler_type: linear
56
+ - lr_scheduler_warmup_ratio: 0.1
57
+ - num_epochs: 5
58
+ - mixed_precision_training: Native AMP
59
 
60
+ ### Training results
 
 
 
 
 
61
 
62
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | Roc Auc |
63
+ |:-------------:|:------:|:----:|:---------------:|:--------:|:------:|:---------:|:------:|:-------:|
64
+ | 0.0093 | 0.1634 | 100 | 0.0043 | 0.9844 | 0.9865 | 0.9763 | 0.9970 | 0.9990 |
65
+ | 0.0021 | 0.3268 | 200 | 0.0036 | 0.9954 | 0.9960 | 0.9930 | 0.9990 | 0.9978 |
66
+ | 0.0001 | 0.4902 | 300 | 0.0011 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 0.9999 |
67
+ | 0.0043 | 0.6536 | 400 | 0.0009 | 0.9959 | 0.9965 | 0.9930 | 1.0 | 1.0000 |
68
+ | 0.0001 | 0.8170 | 500 | 0.0006 | 0.9988 | 0.9990 | 0.9980 | 1.0 | 1.0000 |
69
+ | 0.0006 | 0.9804 | 600 | 0.0010 | 0.9977 | 0.9980 | 0.9980 | 0.9980 | 0.9999 |
70
 
 
 
71
 
72
+ ### Framework versions
 
73
 
74
+ - Transformers 4.57.3
75
+ - Pytorch 2.9.1+cu128
76
+ - Datasets 4.4.1
77
+ - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a024037033ebcb1cc2672d604b7ad65c9bb5cf40f6b885f9aaa31033514c7b18
3
  size 598439784
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e48724b649fbcc9f55d6b989782528171959b58f41052e547603338a6b75baa5
3
  size 598439784
test_results.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eval_loss": 0.003097335109487176,
3
+ "eval_accuracy": 0.995373048004627,
4
+ "eval_f1": 0.9959432048681541,
5
+ "eval_precision": 0.9919191919191919,
6
+ "eval_recall": 1.0,
7
+ "eval_roc_auc": 0.998592468993421,
8
+ "eval_runtime": 10.1758,
9
+ "eval_samples_per_second": 169.913,
10
+ "eval_steps_per_second": 5.405,
11
+ "epoch": 0.9803921568627451
12
+ }