Email Classifier

A fine-tuned DistilBERT model for binary classification of emails as productive or unproductive. This model is designed to automatically categorize emails to help prioritize important communications.

Model Details

Model Description

Model Type: Text Classification (Binary)
Base Model: distilbert-base-uncased
Task: Email productivity classification
Language: Portuguese and English (multilingual)
Labels:
- 0: Unproductive (emails that don't require action)
- 1: Productive (emails that require action or response)

Model Architecture

Architecture: DistilBERT (Distilled BERT)
Max Sequence Length: 512 tokens
Number of Labels: 2
Output: Binary classification with confidence scores

Intended Use

Primary Use Cases

Email Prioritization: Automatically identify emails that require immediate attention
Productivity Tools: Integrate into email management systems to filter and organize messages
Auto-Reply Systems: Determine which emails should trigger automated responses
Email Analytics: Analyze email patterns and productivity metrics

Out-of-Scope Use Cases

Spam detection (this model focuses on productivity, not spam)
Sentiment analysis (positive/negative emotions)
Topic classification (specific email topics)
Language detection (assumes input language is known)

Training Details

Training Data

The model was trained on a synthetic dataset of ~6,000 emails (balanced between productive and unproductive) generated using templates that simulate real-world email scenarios. The training data includes:

Productive Emails: Technical support requests, meeting requests, information requests, urgent problems, project discussions, etc.
Unproductive Emails: Thank you messages, congratulations, holiday greetings, status updates without action required, confirmations, etc.

Training Procedure

Training Framework: Hugging Face Transformers
Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 8
Epochs: 5 (with early stopping)
Early Stopping Patience: 3 epochs
Evaluation Metric: F1 score
Train/Test Split: 80/20

Training Features

Data Augmentation: Template-based generation with variations
Anti-Overfitting Techniques:
- Context shuffling (gratitude before/after requests)
- Negation injection
- Order inversion
- Noise injection
Multilingual Support: Portuguese and English emails in training data

Evaluation

Metrics

The model was evaluated on a held-out test set with the following metrics:

Accuracy: ~0.95+
F1 Score: ~0.95+
Precision: ~0.95+
Recall: ~0.95+

Note: Exact metrics may vary. Please refer to the model card for specific evaluation results.

How to Use

Installation

pip install transformers torch

Basic Usage

Using Pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="MiguelJeronimoOliveira/email-classifier"
)

# Classify an email
result = classifier("Hi, I need urgent technical support. The system is down.")
print(result)
# [{'label': 'LABEL_1', 'score': 0.98}]

result = classifier("Thank you for the excellent work!")
print(result)
# [{'label': 'LABEL_0', 'score': 0.95}]

Using Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "MiguelJeronimoOliveira/email-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
email_text = "Hi, I would like to schedule a meeting to discuss the project timeline."
inputs = tokenizer(
    email_text,
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = predictions.argmax(dim=-1).item()
    confidence = predictions[0][predicted_class].item()

# Interpret result
label = "productive" if predicted_class == 1 else "unproductive"
print(f"Classification: {label} (confidence: {confidence:.2f})")

Label Mapping

LABEL_0 or 0: Unproductive
LABEL_1 or 1: Productive

Limitations and Bias

Known Limitations

Language Coverage: While trained on Portuguese and English, performance may vary for other languages
Domain Specificity: Model is optimized for business/professional emails; may not perform well on personal emails
Context Dependency: Classification is based on email content only; doesn't consider sender, subject line, or metadata
Synthetic Training Data: Model was trained on synthetic data, which may not capture all real-world email patterns

Potential Biases

The model may have biases based on the training data distribution
Cultural and linguistic nuances may affect classification accuracy
Technical terminology may be over-represented in productive emails

Recommendations

Fine-tune on your specific email domain for best results
Consider combining with other signals (sender, subject, metadata)
Regularly evaluate and retrain with new data
Use confidence thresholds to filter uncertain predictions

Ethical Considerations

Privacy

This model processes email content; ensure compliance with privacy regulations (GDPR, etc.)
Consider data anonymization before processing
Be transparent about automated email classification to users

Fairness

Monitor for potential biases in classification across different email types
Ensure the model doesn't systematically misclassify emails from certain groups or domains
Provide mechanisms for users to correct misclassifications

Citation

If you use this model in your research or application, please cite:

@misc{email-classifier-2024,
  title={Email Classifier: A Fine-tuned DistilBERT for Productivity Classification},
  author={Miguel Jeronimo Oliveira},
  year={2024},
  howpublished={\url{https://huggingface.co/MiguelJeronimoOliveira/email-classifier}}
}

Model Card Contact

For questions, issues, or contributions, please contact:

Model Author: Miguel Jeronimo Oliveira
Repository: AutoU Case Project

License

This model is licensed under the Apache 2.0 License. See the LICENSE file for more details.

Acknowledgments

Built on top of DistilBERT by Hugging Face
Training infrastructure supported by Hugging Face Transformers
Part of the AutoU Case email management system

Model Version: 1.0.0
Last Updated: 2024
Base Model: distilbert-base-uncased
Framework: PyTorch

Downloads last month: 77

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for MiguelJeronimoOliveira/email-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(10852)

this model