🎯 MultiSent-E5-Pro: Advanced Thai Sentiment Classifier
📋 Quick Overview
MultiSent-E5-Pro is a fine-tuned sentiment analysis model based on intfloat/multilingual-e5-large
, specially optimized for Thai with support for multilingual contexts. The model classifies text into four categories: Positive, Negative, Neutral, and Question.
🎯 Key Features
- Handles Thai-specific expressions, colloquialisms, and sarcasm effectively
- Performs well on real-world social media & review data
- Multilingual support for Southeast and East Asian languages
🏆 Benchmark Summary
Rank |
Model |
Accuracy |
F1-Macro |
Notes |
🥇 1st |
MultiSent-E5-Pro |
84.61% |
84.61% |
Best overall |
2nd |
MultiSent-E5 |
80.62% |
80.62% |
Baseline model |
3rd |
sentiment-103 |
57.40% |
49.87% |
Moderate baseline |
📊 Detailed Metrics (2,183 samples)
Metric |
Score |
Accuracy |
84.61% |
F1-Macro |
84.61% |
F1-Weighted |
84.75% |
Avg Confidence |
98.53% |
Low Confidence Rate (<60%) |
0.96% |
Per-Class Performance
Class |
Precision |
Recall |
F1 |
Notes |
Negative |
91.0% |
84.6% |
87.7% |
Excellent |
Positive |
83.0% |
94.3% |
88.3% |
Excellent |
Neutral |
71.9% |
81.6% |
76.4% |
Moderate |
Question |
94.4% |
79.0% |
86.0% |
Good |
🌍 Language Support
Region |
Languages |
Performance |
Thai |
Thai |
🟢 Excellent |
SEA |
ID, VI, MS |
🟡 Good |
East Asia |
ZH, JA, KO |
🟠 Moderate |
Europe |
EN, ES, FR |
🔴 Low |
⚡ Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = "ZombitX64/MultiSent-E5-Pro"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSequenceClassification.from_pretrained(model)
text = "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted = torch.argmax(probs, dim=-1)
labels = ["Question", "Negative", "Neutral", "Positive"]
print(f"Sentiment: {labels[predicted.item()]} (Confidence: {probs[0][predicted].item():.2%})")
🌟 Use Cases
Application |
Suitability |
Product Reviews |
🟢 Excellent |
Social Media |
🟢 Excellent |
Customer Support |
🟢 Excellent |
Content Moderation |
🟡 Good |
Research Analysis |
🟡 Good |
⚠ Known Limitations
- Sarcasm Misclassification (especially in Chinese)
- Mixed Sentiments lean toward Neutral
- Low recall for Question class due to limited data
- Bias toward Positive due to class imbalance
- Overconfidence in some multilingual predictions
🛠 Technical Info
Config |
Value |
Base Model |
multilingual-e5-large |
Params |
~1.02B |
Classes |
4 |
Max Length |
512 |
Training Time |
~27 min |
Data Summary:
- Training: 2,456 samples
- Validation: 273 samples
- Evaluation: 2,183 samples
📄 Citation
@misc{MultiSent-E5-Pro-2024,
title={MultiSent-E5-Pro: Advanced Thai Sentiment Analysis},
author={ZombitX64, Janutsaha K., Saengwichain C.},
year={2024},
url={https://huggingface.co/ZombitX64/MultiSent-E5-Pro},
note={Hugging Face Model Card}
}
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
👨💼 Authors
Role |
Name |
Lead Dev |
ZombitX64 |
Data Scientist |
Krittanut Janutsaha |
Engineer |
Chanyut Saengwichain |
😊 Feedback & Contributions
Last Updated: Dec 2024 | Version: 1.1 | Docs: v2.0