Qwen 1.5 1.8B - Python Code Generation with Step-by-Step Reasoning
A fine-tuned version of Qwen 1.5 1.8B that generates Python code with detailed step-by-step reasoning explanations. This model teaches users how to solve programming problems by explaining its thought process before writing code.
Model Details
Model Description
This model is fine-tuned using QLoRA on a synthetic dataset of 1,000 Python programming problems enriched with step-by-step reasoning. The model learns to explain its problem-solving approach before generating code, making it ideal for educational purposes and transparent code generation.
- Developed by: [Your Name/Organization]
- Model type: Causal Language Model (Fine-tuned with LoRA adapters)
- Language(s): English (code generation in Python)
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen1.5-1.8B
Model Sources
- Base Model: Qwen/Qwen1.5-1.8B
- Training Data: Synthetic dataset generated from MBPP and CodeAlpaca using Llama 3.1 8B
Uses
Direct Use
This model is designed for:
- Educational code generation: Teaching programming concepts through explained solutions
- Transparent AI coding assistants: Understanding how the model approaches problems
- Code explanation: Generating step-by-step breakdowns of problem-solving strategies
- Learning tool: Helping beginners understand algorithmic thinking
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-1.8B",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[YOUR_MODEL_PATH]")
# Generate code with reasoning
prompt = "Write a Python function to find the longest common prefix in a list of strings."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Out-of-Scope Use
- Production-critical systems: This model is fine-tuned on a limited dataset and should not be used for safety-critical applications
- Non-Python languages: The model is specifically trained on Python problems
- Complex software architecture: Best suited for algorithm-level problems, not large-scale system design
- Security-sensitive code: Should not be used for generating cryptographic or security-critical code without expert review
Bias, Risks, and Limitations
Limitations
- Dataset size: Trained on only 1,000 examples, may not generalize to all problem types
- Teacher model quality: Synthetic data generated by Llama 3.1 8B may contain errors
- Small test set: Evaluated on only 7 problems, true generalization unknown
- Potential overfitting: High accuracy on test set may indicate memorization rather than true learning
- No code validation: Training data was not validated for correctness before fine-tuning
Recommendations
- Always review and test generated code before using in production
- Use as a learning tool rather than a replacement for human expertise
- Validate outputs against test cases and edge cases
- Consider the model's explanations as one perspective, not absolute truth
Training Details
Training Data
- Source datasets: MBPP (Mostly Basic Programming Problems) and CodeAlpaca
- Dataset size: 1,000 Python programming problems
- Data generation: Synthetic step-by-step reasoning generated using Llama 3.1 8B Instant via Groq API
- Data structure: Each example contains:
- Original programming problem
- Step-by-step reasoning (problem understanding, algorithm design, implementation strategy)
- Python solution
Training Procedure
Fine-tuning Method
- Technique: QLoRA (Quantized Low-Rank Adaptation)
- Quantization: 4-bit quantization for memory efficiency
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Target modules: q_proj, k_proj, v_proj, o_proj (attention layers)
- Dropout: 0.05
Training Hyperparameters
- Training epochs: 3
- Learning rate: 2e-4
- Optimizer: paged_adamw_8bit
- Batch size: [Specify if known]
- Training regime: Mixed precision (4-bit quantization)
- Hardware: Google Colab T4 GPU (free tier)
- Framework: PEFT 0.17.1, Transformers, bitsandbytes
Training Time
- Approximately [X hours] on Google Colab T4 GPU
Evaluation
Testing Data & Metrics
Testing Data
- Test set size: 7 diverse Python programming problems
- Problem types: Mix of algorithmic challenges from the training distribution
Metrics
- Primary metric: Pass@1 (functional correctness - does the generated code execute correctly?)
- Secondary metric: Reasoning structure presence (does output include step-by-step explanation?)
Results
| Metric | Base Model (Qwen 1.5 1.8B) | Fine-tuned Model |
|---|---|---|
| Pass@1 | 75% | 100% |
| Reasoning Structure | Inconsistent | 100% |
Key Findings:
- +25 percentage point improvement in functional correctness
- 100% of outputs now include structured step-by-step reasoning
- All 7 test cases passed successfully
Important Note: Results are based on a small test set (7 examples). Larger-scale evaluation needed to confirm generalization.
Environmental Impact
- Hardware Type: NVIDIA T4 GPU (Google Colab)
- Hours used: ~[X hours for fine-tuning]
- Cloud Provider: Google Cloud Platform
- Compute Region: [Specify if known]
- Carbon Emitted: Minimal due to use of QLoRA on single T4 GPU
Carbon emissions can be estimated using the Machine Learning Impact calculator.
Technical Specifications
Model Architecture
- Base architecture: Qwen 1.5 1.8B (Transformer decoder)
- Fine-tuning method: LoRA adapters on attention layers
- Total parameters: 1.8B (base) + ~4.7M (LoRA adapters)
- Trainable parameters: ~4.7M (0.26% of total)
Compute Infrastructure
Hardware
- GPU: NVIDIA T4 (16GB VRAM)
- Platform: Google Colab (free tier)
Software
- PEFT 0.17.1
- Transformers
- bitsandbytes (for 4-bit quantization)
- PyTorch
- Groq API (for synthetic data generation)
Project Insights
What Worked Well
- Cross-model knowledge distillation (8B teacher โ 1.8B student)
- QLoRA enabled fine-tuning on free-tier GPU
- Structured prompts for synthetic data generation
- Teaching reasoning process alongside code generation
Future Improvements
- Better teacher model: Use Llama 3.1 70B for higher-quality synthetic data
- Data validation: Verify all generated code executes correctly before training
- Larger dataset: Scale to 5,000-10,000 examples
- Robust evaluation: Test on 50-100 problems from benchmarks like HumanEval
- Higher LoRA rank: Experiment with rank 16 or 32 for more capacity
Citation
If you use this model, please cite:
@misc{qwen15-code-reasoning,
author = {[Rachit Verma]},
title = {Qwen 1.5 1.8B Fine-tuned for Python Code Generation with Reasoning},
year = {2025},
publisher = {HuggingFace},
}
Model Card Authors
[Rachit Verma]
- Downloads last month
- 5
Model tree for vrachit/Qwen-1.5-1.8b-PythonCOT-coder
Base model
Qwen/Qwen1.5-1.8B