Indonesian Spam Detection Model
Model Overview
Indonesian Spam Detection Model is a fine-tuned spam detection model based on the Gemma 2 2B architecture. This model is specifically designed for identifying spam messages in Indonesian text, particularly for WhatsApp chatbot interactions. It has been fine-tuned using a comprehensive dataset of 40,000 spam messages collected over a year.
Labels
The model classifies text into two categories:
- 0: Non-spam (legitimate message)
- 1: Spam (unwanted/malicious message)
Detection Capabilities
The model can effectively detect various types of spam including:
- Offensive and abusive language
- Profane content
- Gibberish text and random characters
- Suspicious links and URLs
- Promotional spam
- Fraudulent messages
Use this Model
Installation
First, install the required dependencies:
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "nahiar/spam-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example texts to classify
texts = [
"Halo, bagaimana kabar Anda hari ini?", # Non-spam
"MENANG JUTAAN RUPIAH! Klik link ini sekarang: http://suspicious-link.com", # Spam
"adsfwcasdfad12345", # Spam (gibberish)
"Terima kasih atas informasinya" # Non-spam
]
# Tokenize and predict
for text in texts:
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(prediction, dim=1).item()
confidence = torch.max(prediction, dim=1)[0].item()
label = "Spam" if predicted_class == 1 else "Non-spam"
print(f"Text: {text}")
print(f"Prediction: {label} (confidence: {confidence:.4f})")
print("-" * 50)
Batch Processing
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def classify_spam_batch(texts, model_name="nahiar/spam-analysis"):
"""
Classify multiple texts for spam detection
Args:
texts (list): List of texts to classify
model_name (str): Hugging Face model name
Returns:
list: List of predictions with confidence scores
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize all texts
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=1)
confidences = torch.max(predictions, dim=1)[0]
results = []
for i, text in enumerate(texts):
results.append({
'text': text,
'is_spam': bool(predicted_classes[i].item()),
'confidence': confidences[i].item(),
'label': 'Spam' if predicted_classes[i].item() == 1 else 'Non-spam'
})
return results
# Example usage
texts = [
"Selamat pagi, semoga harimu menyenangkan",
"URGENT!!! Dapatkan uang 10 juta hanya dengan klik link ini",
"Terima kasih sudah membantu kemarin"
]
results = classify_spam_batch(texts)
for result in results:
print(f"Text: {result['text']}")
print(f"Label: {result['label']} (Confidence: {result['confidence']:.4f})")
print()
Model Performance
This model has been trained on a diverse dataset of Indonesian text messages and demonstrates strong performance in distinguishing between spam and legitimate messages across various contexts including:
- WhatsApp chatbot interactions
- SMS messages
- Social media content
- Customer service communications
Limitations
- The model is primarily trained on Indonesian language text
- Performance may vary with very short messages (< 10 characters)
- Context-dependent spam (messages that are spam only in specific contexts) may be challenging
Repository
For more information about the training process and code implementation, visit:
https://github.com/nahiar/spam-analysis
Citation
If you use this model in your research or applications, please cite:
@misc{spam-analysis-indo,
title={Indonesian Spam Detection Model},
author={Nahiar},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/nahiar/spam-analysis}
}
- Downloads last month
- 99
Model tree for nahiar/spam-analysis
Base model
google/gemma-2-2b