Qwen 1.5 1.8B - Python Code Generation with Step-by-Step Reasoning

A fine-tuned version of Qwen 1.5 1.8B that generates Python code with detailed step-by-step reasoning explanations. This model teaches users how to solve programming problems by explaining its thought process before writing code.

Model Details

Model Description

This model is fine-tuned using QLoRA on a synthetic dataset of 1,000 Python programming problems enriched with step-by-step reasoning. The model learns to explain its problem-solving approach before generating code, making it ideal for educational purposes and transparent code generation.

Developed by: [Your Name/Organization]
Model type: Causal Language Model (Fine-tuned with LoRA adapters)
Language(s): English (code generation in Python)
License: Apache 2.0
Finetuned from model: Qwen/Qwen1.5-1.8B

Model Sources

Base Model: Qwen/Qwen1.5-1.8B
Training Data: Synthetic dataset generated from MBPP and CodeAlpaca using Llama 3.1 8B

Uses

Direct Use

This model is designed for:

Educational code generation: Teaching programming concepts through explained solutions
Transparent AI coding assistants: Understanding how the model approaches problems
Code explanation: Generating step-by-step breakdowns of problem-solving strategies
Learning tool: Helping beginners understand algorithmic thinking

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-1.8B",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[YOUR_MODEL_PATH]")

# Generate code with reasoning
prompt = "Write a Python function to find the longest common prefix in a list of strings."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Out-of-Scope Use

Production-critical systems: This model is fine-tuned on a limited dataset and should not be used for safety-critical applications
Non-Python languages: The model is specifically trained on Python problems
Complex software architecture: Best suited for algorithm-level problems, not large-scale system design
Security-sensitive code: Should not be used for generating cryptographic or security-critical code without expert review

Bias, Risks, and Limitations

Limitations

Dataset size: Trained on only 1,000 examples, may not generalize to all problem types
Teacher model quality: Synthetic data generated by Llama 3.1 8B may contain errors
Small test set: Evaluated on only 7 problems, true generalization unknown
Potential overfitting: High accuracy on test set may indicate memorization rather than true learning
No code validation: Training data was not validated for correctness before fine-tuning

Recommendations

Always review and test generated code before using in production
Use as a learning tool rather than a replacement for human expertise
Validate outputs against test cases and edge cases
Consider the model's explanations as one perspective, not absolute truth

Training Details

Training Data

Source datasets: MBPP (Mostly Basic Programming Problems) and CodeAlpaca
Dataset size: 1,000 Python programming problems
Data generation: Synthetic step-by-step reasoning generated using Llama 3.1 8B Instant via Groq API
Data structure: Each example contains:
- Original programming problem
- Step-by-step reasoning (problem understanding, algorithm design, implementation strategy)
- Python solution

Training Procedure

Fine-tuning Method

Technique: QLoRA (Quantized Low-Rank Adaptation)
Quantization: 4-bit quantization for memory efficiency
LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Target modules: q_proj, k_proj, v_proj, o_proj (attention layers)
- Dropout: 0.05

Training Hyperparameters

Training epochs: 3
Learning rate: 2e-4
Optimizer: paged_adamw_8bit
Batch size: [Specify if known]
Training regime: Mixed precision (4-bit quantization)
Hardware: Google Colab T4 GPU (free tier)
Framework: PEFT 0.17.1, Transformers, bitsandbytes

Training Time

Approximately [X hours] on Google Colab T4 GPU

Evaluation

Testing Data & Metrics

Testing Data

Test set size: 7 diverse Python programming problems
Problem types: Mix of algorithmic challenges from the training distribution

Metrics

Primary metric: Pass@1 (functional correctness - does the generated code execute correctly?)
Secondary metric: Reasoning structure presence (does output include step-by-step explanation?)

Results

Metric	Base Model (Qwen 1.5 1.8B)	Fine-tuned Model
Pass@1	75%	100%
Reasoning Structure	Inconsistent	100%

Key Findings:

+25 percentage point improvement in functional correctness
100% of outputs now include structured step-by-step reasoning
All 7 test cases passed successfully

Important Note: Results are based on a small test set (7 examples). Larger-scale evaluation needed to confirm generalization.

Environmental Impact

Hardware Type: NVIDIA T4 GPU (Google Colab)
Hours used: ~[X hours for fine-tuning]
Cloud Provider: Google Cloud Platform
Compute Region: [Specify if known]
Carbon Emitted: Minimal due to use of QLoRA on single T4 GPU

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Technical Specifications

Model Architecture

Base architecture: Qwen 1.5 1.8B (Transformer decoder)
Fine-tuning method: LoRA adapters on attention layers
Total parameters: 1.8B (base) + ~4.7M (LoRA adapters)
Trainable parameters: ~4.7M (0.26% of total)

Compute Infrastructure

Hardware

GPU: NVIDIA T4 (16GB VRAM)
Platform: Google Colab (free tier)

Software

PEFT 0.17.1
Transformers
bitsandbytes (for 4-bit quantization)
PyTorch
Groq API (for synthetic data generation)

Project Insights

What Worked Well

Cross-model knowledge distillation (8B teacher → 1.8B student)
QLoRA enabled fine-tuning on free-tier GPU
Structured prompts for synthetic data generation
Teaching reasoning process alongside code generation

Future Improvements

Better teacher model: Use Llama 3.1 70B for higher-quality synthetic data
Data validation: Verify all generated code executes correctly before training
Larger dataset: Scale to 5,000-10,000 examples
Robust evaluation: Test on 50-100 problems from benchmarks like HumanEval
Higher LoRA rank: Experiment with rank 16 or 32 for more capacity

Citation

If you use this model, please cite:

@misc{qwen15-code-reasoning,
  author = {[Rachit Verma]},
  title = {Qwen 1.5 1.8B Fine-tuned for Python Code Generation with Reasoning},
  year = {2025},
  publisher = {HuggingFace},
}

Model Card Authors

[Rachit Verma]

Downloads last month: 5

Model tree for vrachit/Qwen-1.5-1.8b-PythonCOT-coder

Base model

Qwen/Qwen1.5-1.8B

Adapter

(30539)

this model