Model Card for Csv-AI-Cleaner-V3

Transform English instructions with data context into executable pandas code with AI

Model Description

Csv-AI-Cleaner converts natural language instructions into pandas code for data cleaning, filtering, grouping, sorting, merges, and more.
Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency.

  • Developed by: ArhanSD1
  • Model type: Seq2Seq Transformer (CodeT5)
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: Salesforce/CodeT5-base

Model Sources

Direct Use

  • Input: Context (sample dataset) + instruction in natural language
  • Output: Executable pandas code snippet

Example:

Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000

Output:

df[(df['department'] == 'IT') & (df['salary'] > 45000)]

Out-of-Scope Use

  • Ambiguous or poorly defined instructions without dataset context
  • Complex multi-step pipelines exceeding ~500 token context limit

Bias, Risks, and Limitations

  • Works best with clean and clear column names
  • May generate suboptimal code if context is incomplete or contains noise
  • No awareness of business logic correctness — only syntax + pattern learning

Recommendations

Users should verify generated code for correctness and safety before execution.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO)

def generate_code(input_text):
    prefixed_input = "Generate pandas code: " + input_text
    inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=128,
            num_beams=5,
            temperature=0.7,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
input_example = """
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000
"""
print(generate_code(input_example))

Training Data

  • Combination of synthetic data cleaning instructions + public dataset column contexts
  • Augmented with filtered StackOverflow code snippets for pandas tasks

Preprocessing

  • Normalized table format for context section
  • Instruction phrasing normalized to imperative form

Training Hyperparameters

  • LoRA fine-tuning
  • Learning rate: 5e-5
  • Epochs: 3
  • Batch size: 8
  • Precision: fp16 mixed precision

Testing Data

  • Held-out set of 500 natural language → pandas task pairs

Metrics

Metric Score
Exact Match 71%
Partial Match 92%
Syntax Accuracy 100%

Results Summary

High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations.

Environmental Impact

  • Hardware Type: Single NVIDIA A100
  • Hours used: ~4 hours
  • Cloud Provider: AWS
  • Compute Region: US-East
  • Carbon Emitted: ~1.2 kg CO2eq (estimate)

Model Architecture and Objective

  • CodeT5-base (encoder-decoder)
  • Objective: Seq2Seq code generation from natural language + data context

Compute Infrastructure

  • Training done on 1×A100 GPU
  • Fine-tuning with Hugging Face Transformers + PEFT (LoRA)

Model Card Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arhansd1/Csv-AI-Cleaner-V3

Finetuned
(84)
this model

Dataset used to train arhansd1/Csv-AI-Cleaner-V3