zohfur
/

distilbert-commissions

+---
+license: mit
+language:
+- en
+base_model: distilbert/distilbert-base-uncased
+library_name: transformers
+tags:
+- distilbert
+- bert
+- text-classification
+- commission-detection
+- social-media
+pipeline_tag: text-classification
+datasets:
+- custom
+model-index:
+- name: distilbert-commissions
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      type: custom
+      name: Scraped Social Media Profiles (Bluesky & Twitter)
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 0.9506
+      verified: false
+    - name: Precision
+      type: precision
+      value: 0.9513
+      verified: false
+    - name: Recall
+      type: recall
+      value: 0.9506
+      verified: false
+    - name: F1 Score
+      type: f1
+      value: 0.9508
+      verified: false
+---
+# DistilBERT Commission Detection Model
+## Model Description
+This is a fine-tuned DistilBERT model for detecting commission-related content in social media profiles and posts. The model classifies text to identify whether an artist's profile/bio/post content shows they are open or closed for commissions, or if the text is unclear.
+## Model Details
+### Model Architecture
+- **Base Model**: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
+- **Model Type**: Text Classification
+- **Language**: English
+- **License**: MIT
+### Training Data
+- **Sources**: Manually scraped profile names, bios, and posts from Bluesky and Twitter by a crowd of furries uploading classifications via a custom extension built specifically to make this dataset
+- **Dataset**: Custom dataset of ~1000 rows and user classifications with an equal amount of artificial data to boost pattern recognition
+## Performance
+| Metric | Value |
+|--------|-------|
+| Accuracy | 95.06% |
+| Precision | 95.13% |
+| Recall | 95.06% |
+| F1 Score | 95.08% |
+*Note: These metrics are not independently verified.*
+## Usage
+I recommend a high temperature when inferencing to lower the model's confidence. I use between 1.5 - 3.0.
+```python
+# Example inference #
+from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
+import torch
+# Load model and tokenizer #
+model_name = 'zohfur/distilbert-commissions'
+tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
+# Example usage #
+example_sentences = [
+    "Commissions are currently closed.",
+    "Check my bio for commission status.",
+    "C*mms 0pen on p-site",
+    "DM for comms",
+    "Taking art requests, dm me",
+    "comm completed for personmcperson, thank you <3",
+    "open for trades",
+    "Comms are not open",
+    "Comms form will be open soon, please check back later",
+    "~ Furry artist - 25 y.o - he/him - c*mms 0pen: 2/5 - bots dni ~"
+]
+# Map label integers back to strings #
+label_map = {0: 'open', 1: 'closed', 2: 'unclear'}
+def predict_with_temperature(model, tokenizer, sentences, temperature=1.5):
+    # Prepare input #
+    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    encoded_input = {key: value.to(device) for key, value in encoded_input.items()}
+    model.to(device)
+    model.eval()
+    # Make predictions with temperature scaling #
+    with torch.no_grad():
+        outputs = model(**encoded_input)
+        logits = outputs['logits'] / temperature  # Apply temperature scaling #
+        probabilities = torch.softmax(logits, dim=1)
+        # Extract predictions and confidence scores #
+        predicted_class_indices = torch.argmax(probabilities, dim=1)
+        confidences = torch.max(probabilities, dim=1).values
+        # Convert to CPU and prepare results #
+        predictions = {
+            'sentences': sentences,
+            'labels': [label_map[idx.item()] for idx in predicted_class_indices],
+            'confidences': [score.item() for score in confidences]
+        }
+        return predictions
+def print_predictions(predictions):
+    """Print formatted predictions with confidence scores."""
+    print("\nClassification Results:")
+    print("=" * 50)
+    for i, (sentence, label, confidence) in enumerate(zip(
+        predictions['sentences'],
+        predictions['labels'],
+        predictions['confidences']
+    ), 1):
+        print(f"\n{i}. Sentence: '{sentence}'")
+        print(f"   Predicted Label: {label}")
+        print(f"   Confidence Score: {confidence:.4f}")
+# Make predictions with temperature scaling #
+predictions = predict_with_temperature(model, tokenizer, example_sentences, temperature=1.5)
+# Print results #
+print_predictions(predictions)
+```
+## Limitations and Biases
+### Limitations
+- **Language**: Only trained on English text
+- **False Positives**: Requires a high temperature to avoid false positives (particularly with the words "open" and "closed")
+- **Platform Bias**: Trained on Bsky and Twitter/X data, might not perform as well on other platforms like FurAffinity or Instagra
+## Training Details
+### Training Procedure
+- **Base Model**: DistilBERT base uncased
+- **Fine-tuning**: Finetuned using Huggingface's Trainer, evaluated using Trainer and sklearn.metrics
+- **Optimization**: Wandb hyperparameter sweep using bayers algorithm to reach highest f1 score
+### Data Preprocessing
+- Classifications uploaded voluntarily by crowdsourcing extension users
+- Problematic unicode characters cleaned from dataset
+- Label encoding for classification
+- Class weights computed to adjust weights inversely proportional to class frequencies
+## Model Card Authors
+All credit to original author Zohfur. Base model attributed to distilbert.
+## Model Card Contact
+For questions or concerns about this model, please contact: [[email protected]]