Reddit Sentiment Analysis - Hybrid Model
π― Test Accuracy: 0.9966
Model Description
This hybrid sentiment analysis model combines Sentence Transformers for semantic embeddings with XGBoost for classification. Trained on Reddit comments for multiclass sentiment analysis: Negative, Positive, and Neutral.
Architecture
Input Text β SentenceTransformer β Embeddings (768D) β
Feature Engineering (Length + Sentiment + POS) β XGBoost β Prediction
Quick Start
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
from textblob import TextBlob
import nltk
from huggingface_hub import hf_hub_download
# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
# Load models
xgb_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="xgboost_model.pkl")
sentence_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="sentence_transformer")
# Load XGBoost model
with open(xgb_path, 'rb') as f:
pipeline_data = pickle.load(f)
xgb_model = pipeline_data['xgboost_model']
label_names = pipeline_data['label_names']
# Load SentenceTransformer
sentence_model = SentenceTransformer(sentence_path)
def predict_sentiment(text):
# Extract features
embedding = sentence_model.encode([text])
comment_length = np.array([len(text.split())]).reshape(-1, 1)
sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1)
# POS counts
try:
tags = nltk.pos_tag(nltk.word_tokenize(text))
pos_counts = np.array([[
sum(1 for _, tag in tags if tag.startswith('J')), # Adjectives
sum(1 for _, tag in tags if tag.startswith('N')), # Nouns
sum(1 for _, tag in tags if tag.startswith('V')) # Verbs
]])
except:
pos_counts = np.array([[0, 0, 0]])
# Combine features
features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts])
# Predict
prediction = xgb_model.predict(features)[0]
confidence = xgb_model.predict_proba(features)[0].max()
return {
'label': label_names[prediction],
'confidence': confidence,
'prediction_id': int(prediction)
}
# Example usage
result = predict_sentiment("I love this new phone! It's amazing!")
print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})")
Model Details
- Base Model:
paraphrase-mpnet-base-v2
- Classifier: XGBoost with GPU acceleration
- Features: 772 dimensions (768 embeddings + 4 engineered)
- Classes: 0=Negative, 1=Positive, 2=Neutral
- Training Data: Reddit comments
- Test Accuracy: 0.9966
Training Configuration
- XGBoost Parameters: n_estimators=300, learning_rate=0.05, max_depth=6
- Features: Embeddings + Comment Length + TextBlob Sentiment + POS Counts
- Class Balancing: Sample weights for imbalanced data
- Validation: Stratified train/val/test split
Citation
@misc{reddit-sentiment-hybrid,
title={Reddit Sentiment Analysis - Hybrid Model},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/USERNAME/mahek-sentiment}
}
License
MIT License
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support