Qwen 1.5 1.8B - Python Code Generation with Step-by-Step Reasoning

A fine-tuned version of Qwen 1.5 1.8B that generates Python code with detailed step-by-step reasoning explanations. This model teaches users how to solve programming problems by explaining its thought process before writing code.

Model Details

Model Description

This model is fine-tuned using QLoRA on a synthetic dataset of 1,000 Python programming problems enriched with step-by-step reasoning. The model learns to explain its problem-solving approach before generating code, making it ideal for educational purposes and transparent code generation.

  • Developed by: [Your Name/Organization]
  • Model type: Causal Language Model (Fine-tuned with LoRA adapters)
  • Language(s): English (code generation in Python)
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen1.5-1.8B

Model Sources

  • Base Model: Qwen/Qwen1.5-1.8B
  • Training Data: Synthetic dataset generated from MBPP and CodeAlpaca using Llama 3.1 8B

Uses

Direct Use

This model is designed for:

  • Educational code generation: Teaching programming concepts through explained solutions
  • Transparent AI coding assistants: Understanding how the model approaches problems
  • Code explanation: Generating step-by-step breakdowns of problem-solving strategies
  • Learning tool: Helping beginners understand algorithmic thinking

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-1.8B",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[YOUR_MODEL_PATH]")

# Generate code with reasoning
prompt = "Write a Python function to find the longest common prefix in a list of strings."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Out-of-Scope Use

  • Production-critical systems: This model is fine-tuned on a limited dataset and should not be used for safety-critical applications
  • Non-Python languages: The model is specifically trained on Python problems
  • Complex software architecture: Best suited for algorithm-level problems, not large-scale system design
  • Security-sensitive code: Should not be used for generating cryptographic or security-critical code without expert review

Bias, Risks, and Limitations

Limitations

  1. Dataset size: Trained on only 1,000 examples, may not generalize to all problem types
  2. Teacher model quality: Synthetic data generated by Llama 3.1 8B may contain errors
  3. Small test set: Evaluated on only 7 problems, true generalization unknown
  4. Potential overfitting: High accuracy on test set may indicate memorization rather than true learning
  5. No code validation: Training data was not validated for correctness before fine-tuning

Recommendations

  • Always review and test generated code before using in production
  • Use as a learning tool rather than a replacement for human expertise
  • Validate outputs against test cases and edge cases
  • Consider the model's explanations as one perspective, not absolute truth

Training Details

Training Data

  • Source datasets: MBPP (Mostly Basic Programming Problems) and CodeAlpaca
  • Dataset size: 1,000 Python programming problems
  • Data generation: Synthetic step-by-step reasoning generated using Llama 3.1 8B Instant via Groq API
  • Data structure: Each example contains:
    • Original programming problem
    • Step-by-step reasoning (problem understanding, algorithm design, implementation strategy)
    • Python solution

Training Procedure

Fine-tuning Method

  • Technique: QLoRA (Quantized Low-Rank Adaptation)
  • Quantization: 4-bit quantization for memory efficiency
  • LoRA Configuration:
    • Rank (r): 8
    • Alpha: 16
    • Target modules: q_proj, k_proj, v_proj, o_proj (attention layers)
    • Dropout: 0.05

Training Hyperparameters

  • Training epochs: 3
  • Learning rate: 2e-4
  • Optimizer: paged_adamw_8bit
  • Batch size: [Specify if known]
  • Training regime: Mixed precision (4-bit quantization)
  • Hardware: Google Colab T4 GPU (free tier)
  • Framework: PEFT 0.17.1, Transformers, bitsandbytes

Training Time

  • Approximately [X hours] on Google Colab T4 GPU

Evaluation

Testing Data & Metrics

Testing Data

  • Test set size: 7 diverse Python programming problems
  • Problem types: Mix of algorithmic challenges from the training distribution

Metrics

  • Primary metric: Pass@1 (functional correctness - does the generated code execute correctly?)
  • Secondary metric: Reasoning structure presence (does output include step-by-step explanation?)

Results

Metric Base Model (Qwen 1.5 1.8B) Fine-tuned Model
Pass@1 75% 100%
Reasoning Structure Inconsistent 100%

Key Findings:

  • +25 percentage point improvement in functional correctness
  • 100% of outputs now include structured step-by-step reasoning
  • All 7 test cases passed successfully

Important Note: Results are based on a small test set (7 examples). Larger-scale evaluation needed to confirm generalization.

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU (Google Colab)
  • Hours used: ~[X hours for fine-tuning]
  • Cloud Provider: Google Cloud Platform
  • Compute Region: [Specify if known]
  • Carbon Emitted: Minimal due to use of QLoRA on single T4 GPU

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Technical Specifications

Model Architecture

  • Base architecture: Qwen 1.5 1.8B (Transformer decoder)
  • Fine-tuning method: LoRA adapters on attention layers
  • Total parameters: 1.8B (base) + ~4.7M (LoRA adapters)
  • Trainable parameters: ~4.7M (0.26% of total)

Compute Infrastructure

Hardware

  • GPU: NVIDIA T4 (16GB VRAM)
  • Platform: Google Colab (free tier)

Software

  • PEFT 0.17.1
  • Transformers
  • bitsandbytes (for 4-bit quantization)
  • PyTorch
  • Groq API (for synthetic data generation)

Project Insights

What Worked Well

  • Cross-model knowledge distillation (8B teacher โ†’ 1.8B student)
  • QLoRA enabled fine-tuning on free-tier GPU
  • Structured prompts for synthetic data generation
  • Teaching reasoning process alongside code generation

Future Improvements

  1. Better teacher model: Use Llama 3.1 70B for higher-quality synthetic data
  2. Data validation: Verify all generated code executes correctly before training
  3. Larger dataset: Scale to 5,000-10,000 examples
  4. Robust evaluation: Test on 50-100 problems from benchmarks like HumanEval
  5. Higher LoRA rank: Experiment with rank 16 or 32 for more capacity

Citation

If you use this model, please cite:

@misc{qwen15-code-reasoning,
  author = {[Rachit Verma]},
  title = {Qwen 1.5 1.8B Fine-tuned for Python Code Generation with Reasoning},
  year = {2025},
  publisher = {HuggingFace},
}

Model Card Authors

[Rachit Verma]

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vrachit/Qwen-1.5-1.8b-PythonCOT-coder

Base model

Qwen/Qwen1.5-1.8B
Adapter
(30539)
this model