Garbled Text Detector
Detect garbled, corrupted, or malformed text with 96.2% accuracy. Perfect for data quality assessment, PDF parsing validation, and text preprocessing pipelines.
Quick Start
Installation
pip install transformers torch
Basic Usage
The simplest way to use the model:
from transformers import pipeline
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
result = classifier("Your text here")[0]
print(f"Label: {result['label']}, Confidence: {result['score']:.2%}")
Batch Processing
from transformers import pipeline
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
texts = [
"This is normal, well-formed text.",
"H3ll0 w0rld! Th1s 1s g4rbl3d t3xt.",
"The quick brown fox jumps over the lazy dog."
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"{text[:50]:50} -> {result['label']} ({result['score']:.2%})")
Manual Model Loading (Advanced)
If you need more control:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("brightertiger/garbled-text-detector")
model = AutoModelForSequenceClassification.from_pretrained("brightertiger/garbled-text-detector")
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
prediction = torch.argmax(probs, dim=1).item()
confidence = torch.max(probs).item()
label = model.config.id2label[prediction]
print(f"Prediction: {label} (confidence: {confidence:.2%})")
What It Does
This model classifies text into two categories:
Normal (Label 0): Clean, well-formed, readable text
- Example: "The company reported revenue of $1.2 million in Q4."
Garbled (Label 1): Corrupted, malformed, or unreadable text
- Example: "The%รขรฃรรcompany%รขรฃรรreported$1.2%รขรฃรรmillion"
Common Use Cases
- Data Quality: Filter corrupted text from datasets before training
- PDF Parsing: Validate text extraction quality from PDFs
- OCR Validation: Check if OCR output is readable
- Encoding Detection: Identify character encoding issues
- Content Filtering: Remove malformed user-generated content
What It Handles
โ Normal Text: Including technical content with:
- Mathematical symbols: โ, โซ, ยฑ, โค, ฮฑ, ฮฒ
- Currency symbols: $, โฌ, ยฃ, ยฅ
- Legal citations and references
- Formatted tables and structured data
โ Detects Corruption:
- Binary data interpretation
- Encoding corruption
- Mixed character sets
- Non-ASCII noise
Performance
| Metric | Score |
|---|---|
| Accuracy | 96.2% |
| Precision | 92.5% |
| Recall | 95.4% |
| F1-Score | 93.9% |
| ROC-AUC | 99.1% |
Inference Performance (CPU):
- Throughput: ~70 samples/second
- Latency: ~14ms per sample
- Model Size: 55MB
- Device: CPU optimized with dynamic quantization
Confusion Matrix (6,646 validation samples):
- True Negatives: 4,417 (Normal โ Normal)
- False Positives: 171 (Normal โ Garbled)
- False Negatives: 86 (Garbled โ Normal)
- True Positives: 1,972 (Garbled โ Garbled)
Model Details
Architecture:
- Base Model: TinyBERT (huawei-noah/TinyBERT_General_4L_312D)
- Model Type:
AutoModelForSequenceClassification - Parameters: 14.4M
- Max Sequence Length: 512 tokens
- Task: Binary classification (Normal vs Garbled)
Model Specifications:
- Framework: PyTorch + Transformers
- Model Size: 55MB
- Input: Text (up to 512 tokens)
- Output: Binary classification with confidence scores
- Labels:
NORMAL(0) andGARBLED(1)
Training Data:
- Synthetic data generated with LLMs simulating real-world PDF parsing scenarios
- Includes diverse corruption patterns and edge cases
- Balanced dataset with both normal and garbled examples
Example Outputs
from transformers import pipeline
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
# Normal text
classifier("The quarterly earnings report shows strong growth.")
# โ [{'label': 'NORMAL', 'score': 0.998}]
# Technical content (correctly identified as normal)
classifier("Calculate ROI = (Gain - Cost) / Cost ร 100%")
# โ [{'label': 'NORMAL', 'score': 0.973}]
# Currency symbols (correctly identified as normal)
classifier("Exchange rates: USD$1.00, EURโฌ0.85, GBPยฃ0.73")
# โ [{'label': 'NORMAL', 'score': 0.961}]
# Garbled text
classifier("PDF-1.4%รขรฃรร1 0 obj<< /Type /Catalog")
# โ [{'label': 'GARBLED', 'score': 0.992}]
# Encoding corruption
classifier("Revenue%รขรฃรร$1,250,000%รขรฃรรProfit%รขรฃรร$890,000")
# โ [{'label': 'GARBLED', 'score': 0.987}]
Integration Examples
Data Cleaning Pipeline
from transformers import pipeline
def clean_dataset(texts):
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
results = classifier(texts)
clean_texts = [
text for text, result in zip(texts, results)
if result['label'] == "NORMAL" and result['score'] > 0.9
]
return clean_texts
Quality Control
from transformers import pipeline
def validate_text_quality(text, threshold=0.85):
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
result = classifier(text)[0]
if result['label'] == "GARBLED" and result['score'] > threshold:
return {
"status": "rejected",
"reason": "garbled_text",
"confidence": result['score']
}
return {"status": "approved", "confidence": result['score']}
PDF Extraction Validation
from transformers import pipeline
def validate_pdf_extraction(pdf_text):
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
chunks = [pdf_text[i:i+1000] for i in range(0, len(pdf_text), 1000)]
results = classifier(chunks)
garbled_count = sum(1 for r in results if r['label'] == "GARBLED")
garbled_ratio = garbled_count / len(chunks)
if garbled_ratio > 0.2:
return {"quality": "poor", "garbled_ratio": garbled_ratio}
return {"quality": "good", "garbled_ratio": garbled_ratio}
API Reference
Using with Transformers Pipeline
from transformers import pipeline
# Load the model
classifier = pipeline(
"text-classification",
model="brightertiger/garbled-text-detector"
)
# Single prediction
result = classifier("Your text here")[0]
print(f"{result['label']}: {result['score']:.2%}")
# Batch prediction
results = classifier(["Text 1", "Text 2", "Text 3"])
Custom Device Selection
from transformers import pipeline
# Use GPU if available
classifier = pipeline(
"text-classification",
model="brightertiger/garbled-text-detector",
device=0 # Use GPU 0, or -1 for CPU
)
# Use CPU explicitly
classifier = pipeline(
"text-classification",
model="brightertiger/garbled-text-detector",
device=-1
)
Confidence Thresholding
from transformers import pipeline
classifier = pipeline("text-classification", model="brightertiger/garbled-text-detector")
def is_text_valid(text, confidence_threshold=0.85):
result = classifier(text)[0]
# Accept if classified as NORMAL with high confidence
if result['label'] == 'NORMAL' and result['score'] > confidence_threshold:
return True
# Reject if classified as GARBLED with high confidence
if result['label'] == 'GARBLED' and result['score'] > confidence_threshold:
return False
# Uncertain - manual review recommended
return None # Flag for manual review
Limitations
- Optimized for English text
- May struggle with intentionally stylized text (e.g., l33t speak used artistically)
- Designed for PDF parsing validation; may not generalize to all corruption types
- Max sequence length: 512 tokens
License
Apache 2.0
Citation
@misc{garbled_text_detector_2025,
title={Garbled Text Detector: BERT-based Binary Classifier for Text Quality Assessment},
author={BrighterTiger},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/brightertiger/garbled-text-detector}
}
Acknowledgments
- Hugging Face for model hosting and Transformers library
- Huawei Noah's Ark Lab for TinyBERT base model
- PyTorch Lightning for training framework
Model Card: brightertiger/garbled-text-detector
Version: 1.0.0
Last Updated: September 2025
Task: Binary Text Classification
Support & Issues
For questions, bug reports, or feature requests:
- ๐ Open an issue on GitHub
- ๐ฌ Discuss on the Hugging Face Community
- ๐ง Contact: [[email protected]]
Contributing
Contributions are welcome! If you'd like to improve the model or add features, please open a pull request on GitHub.
- Downloads last month
- 2
Evaluation results
- Accuracyself-reported0.962
- F1 Scoreself-reported0.939
- Precisionself-reported0.925
- Recallself-reported0.954
- ROC-AUCself-reported0.991