Model Card for Csv-AI-Cleaner-V3
Transform English instructions with data context into executable pandas code with AI
Model Description
Csv-AI-Cleaner converts natural language instructions into pandas code for data cleaning, filtering, grouping, sorting, merges, and more.
Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency.
- Developed by: ArhanSD1
- Model type: Seq2Seq Transformer (CodeT5)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Salesforce/CodeT5-base
Model Sources
Direct Use
- Input: Context (sample dataset) + instruction in natural language
- Output: Executable pandas code snippet
Example:
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR
Instruction: Show IT department employees earning over 45000
Output:
df[(df['department'] == 'IT') & (df['salary'] > 45000)]
Out-of-Scope Use
- Ambiguous or poorly defined instructions without dataset context
- Complex multi-step pipelines exceeding ~500 token context limit
Bias, Risks, and Limitations
- Works best with clean and clear column names
- May generate suboptimal code if context is incomplete or contains noise
- No awareness of business logic correctness — only syntax + pattern learning
Recommendations
Users should verify generated code for correctness and safety before execution.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO)
def generate_code(input_text):
prefixed_input = "Generate pandas code: " + input_text
inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=128,
num_beams=5,
temperature=0.7,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example
input_example = """
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR
Instruction: Show IT department employees earning over 45000
"""
print(generate_code(input_example))
Training Data
- Combination of synthetic data cleaning instructions + public dataset column contexts
- Augmented with filtered StackOverflow code snippets for pandas tasks
Preprocessing
- Normalized table format for context section
- Instruction phrasing normalized to imperative form
Training Hyperparameters
- LoRA fine-tuning
- Learning rate: 5e-5
- Epochs: 3
- Batch size: 8
- Precision: fp16 mixed precision
Testing Data
- Held-out set of 500 natural language → pandas task pairs
Metrics
| Metric | Score |
|---|---|
| Exact Match | 71% |
| Partial Match | 92% |
| Syntax Accuracy | 100% |
Results Summary
High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations.
Environmental Impact
- Hardware Type: Single NVIDIA A100
- Hours used: ~4 hours
- Cloud Provider: AWS
- Compute Region: US-East
- Carbon Emitted: ~1.2 kg CO2eq (estimate)
Model Architecture and Objective
- CodeT5-base (encoder-decoder)
- Objective: Seq2Seq code generation from natural language + data context
Compute Infrastructure
- Training done on 1×A100 GPU
- Fine-tuning with Hugging Face Transformers + PEFT (LoRA)
Model Card Contact
- Author: ArhanSD1
- Hugging Face: https://huggingface.co/arhansd1
- Email: N/A
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for arhansd1/Csv-AI-Cleaner-V3
Base model
Salesforce/codet5-base