Email Classifier
A fine-tuned DistilBERT model for binary classification of emails as productive or unproductive. This model is designed to automatically categorize emails to help prioritize important communications.
Model Details
Model Description
- Model Type: Text Classification (Binary)
- Base Model:
distilbert-base-uncased - Task: Email productivity classification
- Language: Portuguese and English (multilingual)
- Labels:
0: Unproductive (emails that don't require action)1: Productive (emails that require action or response)
Model Architecture
- Architecture: DistilBERT (Distilled BERT)
- Max Sequence Length: 512 tokens
- Number of Labels: 2
- Output: Binary classification with confidence scores
Intended Use
Primary Use Cases
- Email Prioritization: Automatically identify emails that require immediate attention
- Productivity Tools: Integrate into email management systems to filter and organize messages
- Auto-Reply Systems: Determine which emails should trigger automated responses
- Email Analytics: Analyze email patterns and productivity metrics
Out-of-Scope Use Cases
- Spam detection (this model focuses on productivity, not spam)
- Sentiment analysis (positive/negative emotions)
- Topic classification (specific email topics)
- Language detection (assumes input language is known)
Training Details
Training Data
The model was trained on a synthetic dataset of ~6,000 emails (balanced between productive and unproductive) generated using templates that simulate real-world email scenarios. The training data includes:
- Productive Emails: Technical support requests, meeting requests, information requests, urgent problems, project discussions, etc.
- Unproductive Emails: Thank you messages, congratulations, holiday greetings, status updates without action required, confirmations, etc.
Training Procedure
- Training Framework: Hugging Face Transformers
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 8
- Epochs: 5 (with early stopping)
- Early Stopping Patience: 3 epochs
- Evaluation Metric: F1 score
- Train/Test Split: 80/20
Training Features
- Data Augmentation: Template-based generation with variations
- Anti-Overfitting Techniques:
- Context shuffling (gratitude before/after requests)
- Negation injection
- Order inversion
- Noise injection
- Multilingual Support: Portuguese and English emails in training data
Evaluation
Metrics
The model was evaluated on a held-out test set with the following metrics:
- Accuracy: ~0.95+
- F1 Score: ~0.95+
- Precision: ~0.95+
- Recall: ~0.95+
Note: Exact metrics may vary. Please refer to the model card for specific evaluation results.
How to Use
Installation
pip install transformers torch
Basic Usage
Using Pipeline
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="MiguelJeronimoOliveira/email-classifier"
)
# Classify an email
result = classifier("Hi, I need urgent technical support. The system is down.")
print(result)
# [{'label': 'LABEL_1', 'score': 0.98}]
result = classifier("Thank you for the excellent work!")
print(result)
# [{'label': 'LABEL_0', 'score': 0.95}]
Using Model Directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "MiguelJeronimoOliveira/email-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Prepare input
email_text = "Hi, I would like to schedule a meeting to discuss the project timeline."
inputs = tokenizer(
email_text,
truncation=True,
padding=True,
max_length=512,
return_tensors="pt"
)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = predictions.argmax(dim=-1).item()
confidence = predictions[0][predicted_class].item()
# Interpret result
label = "productive" if predicted_class == 1 else "unproductive"
print(f"Classification: {label} (confidence: {confidence:.2f})")
Label Mapping
LABEL_0or0: UnproductiveLABEL_1or1: Productive
Limitations and Bias
Known Limitations
- Language Coverage: While trained on Portuguese and English, performance may vary for other languages
- Domain Specificity: Model is optimized for business/professional emails; may not perform well on personal emails
- Context Dependency: Classification is based on email content only; doesn't consider sender, subject line, or metadata
- Synthetic Training Data: Model was trained on synthetic data, which may not capture all real-world email patterns
Potential Biases
- The model may have biases based on the training data distribution
- Cultural and linguistic nuances may affect classification accuracy
- Technical terminology may be over-represented in productive emails
Recommendations
- Fine-tune on your specific email domain for best results
- Consider combining with other signals (sender, subject, metadata)
- Regularly evaluate and retrain with new data
- Use confidence thresholds to filter uncertain predictions
Ethical Considerations
Privacy
- This model processes email content; ensure compliance with privacy regulations (GDPR, etc.)
- Consider data anonymization before processing
- Be transparent about automated email classification to users
Fairness
- Monitor for potential biases in classification across different email types
- Ensure the model doesn't systematically misclassify emails from certain groups or domains
- Provide mechanisms for users to correct misclassifications
Citation
If you use this model in your research or application, please cite:
@misc{email-classifier-2024,
title={Email Classifier: A Fine-tuned DistilBERT for Productivity Classification},
author={Miguel Jeronimo Oliveira},
year={2024},
howpublished={\url{https://huggingface.co/MiguelJeronimoOliveira/email-classifier}}
}
Model Card Contact
For questions, issues, or contributions, please contact:
- Model Author: Miguel Jeronimo Oliveira
- Repository: AutoU Case Project
License
This model is licensed under the Apache 2.0 License. See the LICENSE file for more details.
Acknowledgments
- Built on top of DistilBERT by Hugging Face
- Training infrastructure supported by Hugging Face Transformers
- Part of the AutoU Case email management system
Model Version: 1.0.0
Last Updated: 2024
Base Model: distilbert-base-uncased
Framework: PyTorch
- Downloads last month
- 77
Model tree for MiguelJeronimoOliveira/email-classifier
Base model
distilbert/distilbert-base-uncased