--- license: apache-2.0 datasets: - synapti/nci-propaganda-production base_model: answerdotai/ModernBERT-base tags: - transformers - modernbert - text-classification - propaganda-detection - multi-label-classification - nci-protocol - semeval-2020 - onnx library_name: transformers pipeline_tag: text-classification --- # NCI Technique Classifier Multi-label classifier that identifies specific propaganda techniques in text. ## Model Description This model is **Stage 2** of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline: - **Stage 1**: Fast binary detection - "Does this text contain propaganda?" - **Stage 2 (this model)**: Multi-label technique classification - "Which specific techniques are used?" The classifier identifies **18 propaganda techniques** from the SemEval-2020 Task 11 taxonomy. ## Propaganda Techniques | # | Technique | F1 Score | Optimal Threshold | |---|-----------|----------|-------------------| | 0 | Loaded_Language | 95.3% | 0.3 | | 1 | Appeal_to_fear-prejudice | 85.1% | 0.3 | | 2 | Exaggeration,Minimisation | 49.0% | 0.4 | | 3 | Repetition | 55.9% | 0.4 | | 4 | Flag-Waving | 50.9% | 0.4 | | 5 | Name_Calling,Labeling | 79.0% | 0.1 | | 6 | Reductio_ad_hitlerum | 82.4% | 0.3 | | 7 | Black-and-White_Fallacy | 68.8% | 0.5 | | 8 | Causal_Oversimplification | 67.9% | 0.4 | | 9 | Whataboutism,Straw_Men,Red_Herring | 47.7% | 0.3 | | 10 | Straw_Man | 60.3% | 0.5 | | 11 | Red_Herring | 86.3% | 0.5 | | 12 | Doubt | 63.4% | 0.3 | | 13 | Appeal_to_Authority | 50.0% | 0.3 | | 14 | Thought-terminating_Cliches | 71.2% | 0.5 | | 15 | Bandwagon | 46.7% | 0.5 | | 16 | Slogans | 46.0% | 0.3 | | 17 | Obfuscation,Intentional_Vagueness,Confusion | 86.3% | 0.5 | ## Performance **Test Set Results (1,729 samples):** | Metric | Default (0.5) | Optimized Thresholds | |--------|--------------|---------------------| | Micro F1 | 72.7% | **80.3%** | | Macro F1 | 62.5% | **68.3%** | | ECE (Calibration Error) | - | **0.0096** | ## Usage ### Basic Usage ```python from transformers import pipeline classifier = pipeline( "text-classification", model="synapti/nci-technique-classifier", top_k=None # Return all labels ) text = "The radical left is DESTROYING our country!" results = classifier(text)[0] # Get detected techniques (using default 0.5 threshold) detected = [r for r in results if r["score"] > 0.5] for d in detected: print(f"{d['label']}: {d['score']:.2%}") ``` ### With Calibration Config (Recommended) The model includes a `calibration_config.json` file with optimized per-technique thresholds and temperature scaling for better calibrated confidence scores. ```python import json from transformers import pipeline from huggingface_hub import hf_hub_download # Load calibration config config_path = hf_hub_download( repo_id="synapti/nci-technique-classifier", filename="calibration_config.json" ) with open(config_path) as f: config = json.load(f) temperature = config["temperature"] # 0.75 thresholds = config["thresholds"] labels = config["technique_labels"] classifier = pipeline( "text-classification", model="synapti/nci-technique-classifier", top_k=None ) text = "Your text here..." results = classifier(text)[0] # Apply per-technique thresholds detected = [] for r in results: idx = int(r["label"].split("_")[1]) technique = labels[idx] threshold = thresholds.get(technique, 0.5) if r["score"] > threshold: detected.append((technique, r["score"])) ``` ### ONNX Inference (Faster) The model is also available in ONNX format for optimized inference: ```python import onnxruntime as ort from transformers import AutoTokenizer from huggingface_hub import hf_hub_download import numpy as np # Download ONNX model onnx_path = hf_hub_download( repo_id="synapti/nci-technique-classifier", filename="onnx/model.onnx" ) # Load tokenizer and ONNX session tokenizer = AutoTokenizer.from_pretrained("synapti/nci-technique-classifier") session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"]) # Inference text = "Your text here..." inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="np") onnx_inputs = { "input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], } logits = session.run(None, onnx_inputs)[0] probs = 1 / (1 + np.exp(-logits)) # Sigmoid for multi-label ``` ### Two-Stage Pipeline For best results, use with the binary detector: ```python from transformers import pipeline # Stage 1: Binary detection (fast filter) detector = pipeline("text-classification", model="synapti/nci-binary-detector") # Stage 2: Technique classification classifier = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None) text = "Your text to analyze..." # Quick check first detection = detector(text)[0] if detection["label"] == "has_propaganda" and detection["score"] > 0.5: # Detailed technique analysis techniques = classifier(text)[0] detected = [t for t in techniques if t["score"] > 0.3] for t in detected: print(f"{t['label']}: {t['score']:.2%}") else: print("No propaganda detected") ``` ## Calibration Config The `calibration_config.json` file contains: ```json { "temperature": 0.75, "thresholds": { "Loaded_Language": 0.3, "Appeal_to_fear-prejudice": 0.3, "Name_Calling,Labeling": 0.1, ... }, "metrics": { "ece": 0.0096, "micro_f1_optimized": 0.803, "macro_f1_optimized": 0.683 } } ``` ## Training Data Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production): - **23,000+ examples** with multi-hot technique labels - **Augmented data** for minority techniques (MLSMOTE) - **Hard negatives** from LIAR2 and Qbias datasets - **Class-weighted Focal Loss** to handle imbalance ## Model Architecture - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) - **Parameters**: 149.6M - **Max Sequence Length**: 512 tokens - **Output**: 18 labels (multi-label sigmoid) - **Calibration Temperature**: 0.75 ## Available Files | File | Description | |------|-------------| | `model.safetensors` | PyTorch model weights | | `calibration_config.json` | Optimized thresholds & temperature | | `onnx/model.onnx` | ONNX model for fast inference | | `config.json` | Model configuration | ## Training Details - **Loss Function**: Class-weighted Focal Loss (gamma=2.0) - **Class Weights**: Inverse frequency weighting - **Optimizer**: AdamW - **Learning Rate**: 2e-5 - **Batch Size**: 8 (effective 32 with gradient accumulation) - **Epochs**: 5 with early stopping (patience=3) - **Hardware**: NVIDIA A10G GPU ## Limitations - Trained primarily on English text - Performance varies by technique (see table above) - Some techniques overlap semantically - Should be used with binary detector for best results - Threshold optimization recommended for specific use cases ## Related Models - [synapti/nci-binary-detector](https://huggingface.co/synapti/nci-binary-detector) - Stage 1 binary detector ## Citation ```bibtex @inproceedings{da-san-martino-etal-2020-semeval, title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles", author = "Da San Martino, Giovanni and others", booktitle = "Proceedings of SemEval-2020", year = "2020", } @misc{nci-technique-classifier, author = {NCI Protocol Team}, title = {NCI Technique Classifier}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/synapti/nci-technique-classifier} } ``` ## License Apache 2.0