Model Card for Csv-AI-Cleaner-V3

Transform English instructions with data context into executable pandas code with AI

Model Description

Csv-AI-Cleaner converts natural language instructions into pandas code for data cleaning, filtering, grouping, sorting, merges, and more.
Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency.

Developed by: ArhanSD1
Model type: Seq2Seq Transformer (CodeT5)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Salesforce/CodeT5-base

Model Sources

Repository: https://huggingface.co/arhansd1/Csv-AI-Cleaner-V3

Direct Use

Input: Context (sample dataset) + instruction in natural language
Output: Executable pandas code snippet

Example:

Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000

Output:

df[(df['department'] == 'IT') & (df['salary'] > 45000)]

Out-of-Scope Use

Ambiguous or poorly defined instructions without dataset context
Complex multi-step pipelines exceeding ~500 token context limit

Bias, Risks, and Limitations

Works best with clean and clear column names
May generate suboptimal code if context is incomplete or contains noise
No awareness of business logic correctness — only syntax + pattern learning

Recommendations

Users should verify generated code for correctness and safety before execution.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO)

def generate_code(input_text):
    prefixed_input = "Generate pandas code: " + input_text
    inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=128,
            num_beams=5,
            temperature=0.7,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
input_example = """
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000
"""
print(generate_code(input_example))

Training Data

Combination of synthetic data cleaning instructions + public dataset column contexts
Augmented with filtered StackOverflow code snippets for pandas tasks

Preprocessing

Normalized table format for context section
Instruction phrasing normalized to imperative form

Training Hyperparameters

LoRA fine-tuning
Learning rate: 5e-5
Epochs: 3
Batch size: 8
Precision: fp16 mixed precision

Testing Data

Held-out set of 500 natural language → pandas task pairs

Metrics

Metric	Score
Exact Match	71%
Partial Match	92%
Syntax Accuracy	100%

Results Summary

High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations.

Environmental Impact

Hardware Type: Single NVIDIA A100
Hours used: ~4 hours
Cloud Provider: AWS
Compute Region: US-East
Carbon Emitted: ~1.2 kg CO2eq (estimate)

Model Architecture and Objective

CodeT5-base (encoder-decoder)
Objective: Seq2Seq code generation from natural language + data context

Compute Infrastructure

Training done on 1×A100 GPU
Fine-tuning with Hugging Face Transformers + PEFT (LoRA)

Model Card Contact

Author: ArhanSD1
Hugging Face: https://huggingface.co/arhansd1
Email: N/A

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arhansd1/Csv-AI-Cleaner-V3

Base model

Salesforce/codet5-base

Finetuned

(84)

this model

arhansd1
/

Csv-AI-Cleaner-V3