visobert-spam-binary / README.md

AnnyNguyen

Upload README.md with huggingface_hub

aa388c9 verified 9 days ago

preview code

raw

history blame contribute delete

3.23 kB

metadata

license: apache-2.0
base_model: uitnlp/visobert
tags:
  - vietnamese
  - spam-detection
  - text-classification
  - e-commerce
datasets:
  - ViSpamReviews
metrics:
  - accuracy
  - macro-f1
  - macro-precision
  - macro-recall
model-index:
  - name: visobert-spam-binary
    results:
      - task:
          type: text-classification
          name: Spam Review Detection
        dataset:
          name: ViSpamReviews
          type: ViSpamReviews
        metrics:
          - type: accuracy
            value: 0.9144
          - type: macro-f1
            value: 0.8916

visobert-spam-binary: Spam Review Detection for Vietnamese Text

This model is a fine-tuned version of uitnlp/visobert on the ViSpamReviews dataset for spam review detection in Vietnamese e-commerce reviews.

Model Details

Base Model: uitnlp/visobert
Description: ViSoBERT - Vietnamese Social BERT
Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
Fine-tuning Framework: HuggingFace Transformers
Task: Spam Review Detection (binary)
Number of Classes: 2

Hyperparameters

Max sequence length: 256
Learning rate: 5e-5
Batch size: 32
Epochs: 100
Early stopping patience: 5

Dataset

The model was trained on the ViSpamReviews dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:

Train set: 14,299 samples (72%)
Validation set: 1,590 samples (8%)
Test set: 3,971 samples (20%)

Label Distribution

Non-spam (0): Genuine product reviews
Spam (1): Fake or promotional reviews

Results

The model was evaluated on the test set with the following metrics:

Accuracy: 0.9144
Macro-F1: 0.8916

Usage

You can use this model for spam review detection in Vietnamese text. Below is an example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "visolex/visobert-spam-binary"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example review text
text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax(dim=-1).item()
    probabilities = torch.softmax(outputs.logits, dim=-1)


# Map to label
label_map = {0: "Non-spam", 1: "Spam"}
predicted_label = label_map[predicted_class]
confidence = probabilities[0][predicted_class].item()

print(f"Text: {text}")
print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")

Citation

If you use this model, please cite:

@misc{{
  {model_key}_spam_detection,
  title={{{description}}},
  author={{ViSoLex Team}},
  year={{2025}},
  howpublished={{\url{{https://huggingface.co/{visolex/visobert-spam-binary}}}}}
}}

License

This model is released under the Apache-2.0 license.

Acknowledgments

Base model: {base_model}
Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
ViSoLex Toolkit for Vietnamese NLP