You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

QomSSLab/Anonymizer-4b

QomSSLab/Anonymizer-4b is a fine-tuned Gemma 3 4B model designed to anonymize Persian legal texts by masking or replacing all personally identifiable information (PII). It is trained on the QomSSLab/Anonymized_Cases dataset.

💡 Use Cases

Data privacy for legal document processing.
Preprocessing step for building publicly shareable Persian legal corpora.
Protecting PII in judicial NLP pipelines.

🧠 Model Details

Base Model: Gemma 3 4B
Language: Persian (Farsi)
Training Data: Synthetic and real anonymized Persian legal cases.
Task: Text-to-text generation (anonymization)

📦 Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from transformers import pipeline


model = AutoModelForTokenClassification.from_pretrained("QomSSLab/Anonymizer-xlm-roberta",  device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("QomSSLab/Anonymizer-xlm-roberta")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text="پرونده‌ای درباره ازدواج بین هانیه و عبدالرحیم با اطلاعات هویتی متعدد"

entities = ner(text)

for ent in entities:
    print(f"Entity: {ent['word'],ent['start'], ent['end']}, Type: {ent['entity_group']}, Score: {ent['score']:.4f}")

📊 Evaluation

The model was evaluated qualitatively on a diverse collection of Persian legal documents. It effectively identifies and anonymizes a range of personally identifiable information (PII), including:

Full names
National IDs
Addresses
Dates of birth
Case numbers
Geographic locations

The model is particularly well-suited for preprocessing court cases for research, public data release, or downstream tasks like summarization and classification while preserving privacy.

Limitations

May occasionally miss rare or out-of-distribution PII formats.
Not guaranteed to anonymize very short or extremely noisy texts.
Trained primarily on formal legal language; performance may degrade on informal Persian.

📁 Dataset

This model was fine-tuned on the QomSSLab/Anonymized_Cases dataset, which includes manually and synthetically anonymized court documents and legal filings in Persian. The dataset contains a mix of real and simulated entities, helping the model generalize across varied legal formats and writing styles.

Downloads last month: 40

Safetensors

Model size

0.6B params

Tensor type

F32

QomSSLab
/

Anonymizer-xlm-roberta