ParaDetect: DeBERTa-v3-Large Fine-tuned for AI vs Human Text Detection

Model Description

ParaDetect is a fine-tuned DeBERTa-v3-large model using LoRA (Low-Rank Adaptation) for detecting AI-generated vs human-written text. This model achieves ~99% accuracy in distinguishing between human and AI-generated content, making it highly effective for academic integrity, content verification, and research applications.

Model Details

  • Base Model: microsoft/deberta-v3-large (~435M parameters)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Trainable Parameters: ~28M parameters (6% of total)
  • Task: Binary text classification (Human: 0, AI: 1)
  • Dataset: AI Text Detection Pile (cleaned, 100K samples)
  • Training Framework: Hugging Face Transformers + PEFT

Performance Metrics

Test Set Results

  • Accuracy: 99.31%
  • Precision (Weighted): 99.31%
  • Recall (Weighted): 99.31%
  • F1-Score (Weighted): 99.31%

Class-wise Performance

Class Precision Recall F1-Score Support
Human (0) 99.72% 98.89% 99.30% 7,500
AI (1) 98.91% 99.72% 99.31% 7,500

Training Details

LoRA Configuration

  • Rank (r): 64
  • Alpha: 128
  • Dropout: 0.1
  • Target Modules: query_proj, key_proj, value_proj, dense, output.dense
  • Bias: all

Training Parameters

  • Epochs: 3 (with early stopping)
  • Batch Size: 32 (train/eval)
  • Learning Rate: 2e-4
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Warmup Ratio: 0.1
  • Max Gradient Norm: 1.0

Early Stopping

  • Patience: 5 evaluation steps
  • Metric: F1-score
  • Threshold: 0.001

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained("srikanthgali/paradetect-deberta-v3-lora")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-large", 
    num_labels=2
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "srikanthgali/paradetect-deberta-v3-lora")

# Prediction function
def predict_text_origin(text):
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512,
        padding=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(probabilities, dim=-1)
    
    human_prob = probabilities[0][0].item()
    ai_prob = probabilities[0][1].item()
    
    return {
        "prediction": "AI" if prediction.item() == 1 else "Human",
        "confidence": max(human_prob, ai_prob),
        "human_probability": human_prob,
        "ai_probability": ai_prob
    }

# Example usage
text = "Your text here..."
result = predict_text_origin(text)
print(f"Prediction: {result['prediction']} (Confidence: {result['confidence']:.1%})")

Gradio Interface

import gradio as gr

# Create interface (see full notebook for complete implementation)
demo = gr.Interface(
    fn=predict_text_origin,
    inputs=gr.Textbox(lines=10, placeholder="Enter text to analyze..."),
    outputs=[
        gr.Textbox(label="Prediction"),
        gr.Label(label="Confidence Scores")
    ],
    title="ParaDetect - AI vs Human Text Detection",
    description="Detect whether text is written by humans or generated by AI"
)

demo.launch()
## Technical Specifications

- **Input**: Text (up to 512 tokens)
- **Output**: Binary classification with confidence scores
- **Inference Speed**: ~100ms per text
- **Memory Usage**: Optimized with LoRA (reduced by ~94%)
- **GPU Support**: CUDA-enabled for faster inference

## Training Dataset

- **Source**: artem9k/ai-text-detection-pile (cleaned)
- **Size**: 100,000 samples (subset for efficient training)
- **Split**: 70% train, 15% validation, 15% test
- **Balance**: Equal distribution of human vs AI text
- **Text Length**: 10-512 tokens, optimized for 50-500 words

## Limitations and Considerations

- **Language**: Optimized for English text
- **Text Length**: Best performance on 50-500 word texts
- **Domain**: May not generalize to very recent AI models
- **Context**: Performance may vary on highly technical or domain-specific content
- **Updates**: Regular retraining recommended as AI models evolve

## Intended Use Cases

### Primary Applications
- Academic integrity verification
- Content authenticity checking
- Research and analysis
- Educational demonstrations
- Journalism and fact-checking

### Not Recommended For
- Legal evidence without human verification
- Automated content moderation decisions
- High-stakes authentication without additional validation

## Ethical Considerations

- **Bias**: Model trained on specific dataset; may not represent all text types
- **Fairness**: Regular evaluation across different demographics recommended
- **Transparency**: Predictions are probabilistic, not definitive
- **Human Oversight**: Critical decisions should involve human judgment

## Model Card Authors

- **Developer**: Srikanth Gali
- **Organization**: Independent Research
- **Contact**: [GitHub Repository](https://github.com/srikanthgali/ParaDetect)

## Citation
@misc{paradetect2024,
  title={ParaDetect: AI vs Human Text Detection with DeBERTa-v3-Large},
  author={Srikanth Gali},
  year={2024},
  url={https://github.com/srikanthgali/ParaDetect},
  note={Fine-tuned using LoRA for efficient parameter adaptation}
}

## Additional Resources
- **πŸ“ GitHub Repository**: ParaDetect
- **πŸ“Š Dataset**: AI Text Detection Pile - Cleaned
- **🎯 Demo:**: Gradio Interface
- **πŸ“ˆ Training Notebook**: Fine-tuning Details
- **πŸ” EDA**: Data Analysis
## Version History
- **v1.0**: Initial release with DeBERTa-v3-Large + LoRA
- **Training Date**: 2025-10-06
- **Model Size**: ~28M trainable parameters
- **Performance**: 99.31% test accuracy
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for srikanthgali/paradetect-deberta-v3-lora

Adapter
(5)
this model

Dataset used to train srikanthgali/paradetect-deberta-v3-lora