qwen0.5b-tech-interview-test

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on mathematical reasoning tasks. It has been trained using TRL with QLoRA (Quantized LoRA).

Model Details

  • Base Model: Qwen/Qwen2.5-0.5B
  • Fine-tuning Method: QLoRA (Quantized LoRA) followed by weight merging
  • Task: Mathematical reasoning (GSM8K benchmark)
  • Training Framework: TRL (Transformer Reinforcement Learning)

Training Data

The model was fine-tuned on a mixture of datasets:

  • GSM8K (15.7%): 7,473 samples from the GSM8K training set (human-written natural reasoning)
  • NuminaMath-CoT (84.3%): 40,000 samples from the NuminaMath-CoT dataset (model-generated CoT examples)

Total training samples: 47,473 Train/Test Split: 90%/10% (42,726 train / 4,747 test)

Dataset Composition Strategy

The combination strategy aimed to balance:

  • Natural human reasoning patterns from GSM8K
  • Diverse Chain-of-Thought (CoT) patterns from NuminaMath-CoT

Both datasets were converted to a unified messages format compatible with Qwen's chat template.

Evaluation Results

GSM8K Benchmark

Metric Method Few-shot Score Std Error
exact_match flexible-extract 5 34.12% ±1.31%
exact_match strict-match 5 33.59% ±1.30%
  • Baseline (Qwen2.5-0.5B-Instruct): 34.42% (flexible-extract), 31.69% (strict-match)
  • Improvement:
    • Flexible-extract: Comparable performance (34.12% vs 34.42%)
    • Strict-match: +1.90% improvement (33.59% vs 31.69%)
  • Note: This model was fine-tuned on a curated dataset mixture of 47,473 samples to improve mathematical reasoning capabilities

Evaluation Details

  • Evaluation Tool: EleutherAI's lm-evaluation-harness
  • Inference Engine: vLLM (for efficient batch inference)
  • Test Samples: 1,319 (GSM8K test split)
  • Generation Settings:
    • temperature=0.0
    • do_sample=False
    • max_tokens=256
  • Evaluation Method: Few-shot evaluation with 5 examples
  • Data Leakage Prevention: Only GSM8K test split used for evaluation, train split was used for training

Training Procedure

Training Hyperparameters

  • Learning Rate: 2e-5 (increased from 5e-6 for faster convergence)
  • Training Epochs: 2 (with early stopping)
  • Batch Size: 1 (per device)
  • Effective Batch Size: 8 (with gradient accumulation)
  • Gradient Accumulation Steps: 8 (increased from 4 for stable gradients)
  • Weight Decay: 0.01
  • Max Gradient Norm: 1.0
  • Max Sequence Length: 2048
  • Warmup Ratio: 0.15 (increased from 0.05 for better training stability)

QLoRA Configuration

  • Quantization: 8-bit (BitsAndBytes)
  • Quantization Config:
    • llm_int8_threshold=6.0
    • llm_int8_has_fp16_weight=False
    • llm_int8_enable_fp32_cpu_offload=False
  • LoRA Rank (r): 32 (increased from 16 for more capacity)
  • LoRA Alpha: 64 (increased from 32, typically 2x rank)
  • LoRA Dropout: 0.1
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Trainable Parameters: ~17.6M (3.4% of total parameters: 511.6M)
  • Gradient Checkpointing: Enabled (for memory efficiency)

Training Process

The model was trained using:

  • Training Framework: TRL SFTTrainer with QLoRA
  • Data Formatting: Qwen chat template applied to messages format
  • Evaluation Strategy: Steps (every 250 steps)
  • Checkpoint Saving: Every 500 steps
  • Early Stopping: Enabled with patience=3 (based on eval_loss)
  • Best Model Selection: Based on lowest eval_loss
  • Optimizer: paged_adamw_8bit (8-bit AdamW optimizer for memory efficiency)
  • Learning Rate Schedule: Cosine decay
  • Packing: Enabled (for efficient batch processing)
  • Model Merging: LoRA weights merged with base model after training for inference

Key Optimizations

  1. Dataset Curation: Combined GSM8K (human-written) and NuminaMath-CoT (model-generated) for balanced learning
  2. Hyperparameter Tuning: Increased learning rate and warmup ratio for better convergence
  3. Memory Efficiency: 8-bit quantization + gradient checkpointing + LoRA adapters
  4. Training Stability: Gradient accumulation and early stopping to prevent overfitting

Model Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "dongwookkwon/qwen0.5b-tech-interview-test"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Format your question
question = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

messages = [
    {"role": "user", "content": question}
]

# Apply chat template
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

# Generate
outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.0,
    do_sample=False
)

response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Using vLLM (for faster inference)

from vllm import LLM, SamplingParams

model = LLM(
    model="dongwookkwon/qwen0.5b-tech-interview-test",
    trust_remote_code=True,
    dtype="float16",
    gpu_memory_utilization=0.5
)

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=256
)

prompt = "Question: Natalia sold clips to 48 of her friends in April..."
outputs = model.generate([prompt], sampling_params)

Limitations

  • Domain Specificity: This model is fine-tuned specifically for mathematical reasoning tasks and may not perform well on other domains
  • Model Size: The 0.5B parameter size limits reasoning capabilities compared to larger models (7B+)
  • Problem Complexity: Performance may vary depending on the complexity of mathematical problems
  • Data Dependency: Model performance is dependent on the quality and diversity of the training data mixture
  • Inference Requirements: While optimized for inference, the model still requires GPU resources for best performance

Training Infrastructure

Framework Versions

  • TRL: 0.24.0 (SFTTrainer)
  • Transformers: 4.57.1
  • PyTorch: 2.8.0
  • Datasets: 4.3.0
  • PEFT: Latest (for LoRA/QLoRA support)
  • BitsAndBytes: Latest (for 8-bit quantization)
  • Accelerate: >=0.26.0
  • lm-evaluation-harness: 0.4.9.1 (for evaluation)
  • vLLM: Latest (for efficient batch inference during evaluation)

Hardware Requirements

  • Training: GPU with CUDA support (tested on A100, T4)
  • Inference: GPU recommended for best performance
  • Memory: ~8GB VRAM minimum for 8-bit QLoRA training

Citation

If you use this model, please cite:

@misc{qwen0.5b-tech-interview-test,
  title={qwen0.5b-tech-interview-test: Fine-tuned Qwen2.5-0.5B for Mathematical Reasoning},
  author={Dongwook Kwon},
  year={2024},
  howpublished={\url{https://huggingface.co/dongwookkwon/qwen0.5b-tech-interview-test}}
}

Base Model Citation

@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2024},
  howpublished={\url{https://huggingface.co/Qwen/Qwen2.5-0.5B}}
}

Dataset Citations

Acknowledgments

This model was developed as part of a coding challenge focused on optimizing small language models for mathematical reasoning tasks. The approach combines efficient fine-tuning techniques (QLoRA) with curated dataset mixtures to improve performance on the GSM8K benchmark.

Downloads last month
205
Safetensors
Model size
0.5B params
Tensor type
F32
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dongwookkwon/qwen0.5b-tech-interview-test

Base model

Qwen/Qwen2.5-0.5B
Adapter
(325)
this model