Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
feat: zero-cost Colab training pipeline and proper evaluation
Browse files- Implement run_proper_evaluation.py with correct Pass@k (unbiased estimator)
* Supports HumanEval (164) and MBPP (500) datasets
* Multi-provider: ollama, openai, anthropic, openrouter, together
* Checkpoint/resume, test mode, temperature sampling
* Proper code execution with timeout and error capture
- Add train_config_colab.yaml optimized for Google Colab free tier (T4)
* Target: Qwen2.5-Coder-7B (4-bit, ~4.5GB VRAM)
* Max seq length 8192, LoRA rank 16, 2 epochs
* Gradient accumulation 16, 8-bit optimizer
* Memory optimizations: checkpointing, bf16
- Create create_mini_dataset.py for 5K example prototyping
* Stratified sampling by tool usage
* Reproducible with seed
* Maintains conversation structure
- Create COLAB_TRAINING.md comprehensive guide
* Step-by-step notebook instructions
* OOM troubleshooting
* Expected runtimes (3-5 hours on T4)
* Drive mounting, data prep, training, merging, testing
* Hugging Face export instructions
- Create colab_train_stack29.ipynb ready-to-run notebook
* Pre-filled cells in logical order
* GPU monitoring, checkpointing
* Inference testing after merge
* HF Hub upload integration
Now Stack 2.9 can be trained for /bin/zsh on Google Colab and properly evaluated.
Ready to produce a real model and benchmark scores.
- COLAB_TRAINING.md +350 -0
- colab_train_stack29.ipynb +351 -0
- scripts/create_mini_dataset.py +179 -0
- stack-2.9-eval/run_proper_evaluation.py +356 -0
- stack_2_9_training/train_config_colab.yaml +80 -0
|
@@ -0,0 +1,350 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Google Colab Training Guide for Stack 2.9
|
| 2 |
+
|
| 3 |
+
This guide walks through training Stack 2.9 Pattern Memory LoRA adapters using **free Google Colab** T4 GPUs.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## ⚡ Quick Start (3-5 hours)
|
| 8 |
+
|
| 9 |
+
1. **Open Colab**: https://colab.research.google.com/
|
| 10 |
+
2. **Upload** `colab_train_stack29.ipynb`
|
| 11 |
+
3. **Runtime → Change runtime type → GPU (T4)**
|
| 12 |
+
4. **Run all cells sequentially**
|
| 13 |
+
|
| 14 |
+
That's it! The notebook handles everything.
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## 📋 Prerequisites
|
| 19 |
+
|
| 20 |
+
- Google account (for Colab)
|
| 21 |
+
- Basic understanding of notebook execution
|
| 22 |
+
- (Optional) Google Drive for persistent storage
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## 🎯 What This Covers
|
| 27 |
+
|
| 28 |
+
1. **Setting up the environment** on Colab
|
| 29 |
+
2. **Mounting Google Drive** to keep your data between sessions
|
| 30 |
+
3. **Installing dependencies** (PyTorch, Transformers, PEFT, etc.)
|
| 31 |
+
4. **Preparing training data** (either full or mini dataset)
|
| 32 |
+
5. **Training LoRA adapter** on Qwen2.5-Coder-7B (or 32B if you have A100)
|
| 33 |
+
6. **Merging adapter** with base model
|
| 34 |
+
7. **Testing inference** with the trained model
|
| 35 |
+
8. **Exporting to Hugging Face Hub** (optional)
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## ⏱️ Estimated Timings (T4 GPU)
|
| 40 |
+
|
| 41 |
+
| Step | Duration |
|
| 42 |
+
|------|----------|
|
| 43 |
+
| Environment setup | 5-10 min |
|
| 44 |
+
| Data preparation | 2-5 min (using mini dataset) / 30-60 min (full dataset) |
|
| 45 |
+
| Training (2 epochs, 7B) | 3-5 hours |
|
| 46 |
+
| Adapter merging | 2-3 min |
|
| 47 |
+
| Inference testing | 1-2 min |
|
| 48 |
+
| **Total** | **~4-6 hours** |
|
| 49 |
+
|
| 50 |
+
**Note:** Colab free tier has ~12 hour runtime limit. Training fits within this.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 💾 Storage Strategy
|
| 55 |
+
|
| 56 |
+
### Option A: Google Drive (Recommended for persistence)
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from google.colab import drive
|
| 60 |
+
drive.mount('/content/drive')
|
| 61 |
+
# Data stored in /content/drive/MyDrive/stack-2.9/
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
**Pros:** Data persists after runtime disconnect, no re-upload needed.
|
| 65 |
+
|
| 66 |
+
### Option B: Local Colab storage (ephemeral)
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
# Data stored in /content/stack-2.9/
|
| 70 |
+
# Lost when runtime disconnects (~12 hours max)
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
**Use for:** Quick experiments, one-off training runs.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## 🧠 Memory Optimization for T4 (15GB VRAM)
|
| 78 |
+
|
| 79 |
+
The provided `train_config_colab.yaml` is tuned specifically for T4:
|
| 80 |
+
|
| 81 |
+
- **Base model**: `Qwen/Qwen2.5-Coder-7B` (4-bit ≈ 4.5GB)
|
| 82 |
+
- **Context length**: 8192 (instead of 131072)
|
| 83 |
+
- **Batch size**: 1 (with gradient accumulation 16)
|
| 84 |
+
- **LoRA rank**: 16 (instead of 64)
|
| 85 |
+
- **4-bit quantization**: `load_in_4bit=True`
|
| 86 |
+
- **8-bit optimizer**: `paged_adamw_8bit`
|
| 87 |
+
- **Gradient checkpointing**: Enabled
|
| 88 |
+
- **BF16 precision**: Enabled
|
| 89 |
+
|
| 90 |
+
**Total expected VRAM usage**: ~10-12GB (leaves headroom)
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## 🛠️ Step-by-Step Instructions
|
| 95 |
+
|
| 96 |
+
### 1. Notebook Setup
|
| 97 |
+
|
| 98 |
+
Open `colab_train_stack29.ipynb` in Colab. It contains pre-filled cells with:
|
| 99 |
+
|
| 100 |
+
- Dependency installation
|
| 101 |
+
- Drive mounting (optional)
|
| 102 |
+
- Clone repo / upload data
|
| 103 |
+
- Copy training config
|
| 104 |
+
- Run training
|
| 105 |
+
- Merge adapter
|
| 106 |
+
- Test inference
|
| 107 |
+
|
| 108 |
+
### 2. Install Dependencies
|
| 109 |
+
|
| 110 |
+
The notebook installs:
|
| 111 |
+
|
| 112 |
+
```bash
|
| 113 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
| 114 |
+
pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Takes ~5 minutes.
|
| 118 |
+
|
| 119 |
+
### 3. Prepare Training Data
|
| 120 |
+
|
| 121 |
+
**For quick prototyping** (recommended first run):
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
This creates a 5K stratified sample in ~30 seconds.
|
| 128 |
+
|
| 129 |
+
**For full training:**
|
| 130 |
+
|
| 131 |
+
Download your existing `training-data/final/train.jsonl` to Colab (upload to Drive or local).
|
| 132 |
+
|
| 133 |
+
### 4. Prepare Configuration
|
| 134 |
+
|
| 135 |
+
Copy the Colab-optimized config:
|
| 136 |
+
|
| 137 |
+
```bash
|
| 138 |
+
cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
Or edit `train_config.yaml` directly to match the Colab settings.
|
| 142 |
+
|
| 143 |
+
### 5. Run Training
|
| 144 |
+
|
| 145 |
+
```bash
|
| 146 |
+
cd stack-2.9-training
|
| 147 |
+
python -m stack_2_9_training.train_lora --config train_config.yaml
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
**Monitor progress:**
|
| 151 |
+
|
| 152 |
+
- Watch `nvidia-smi` in a separate cell: `!nvidia-smi --loop=5`
|
| 153 |
+
- Training logs show loss per step
|
| 154 |
+
- Checkpoints saved every 500 steps to `./adapters/`
|
| 155 |
+
|
| 156 |
+
**Expected output:**
|
| 157 |
+
```
|
| 158 |
+
Train loss: 1.234
|
| 159 |
+
Step 100/2000 - loss 1.234
|
| 160 |
+
...
|
| 161 |
+
Training completed. Model saved to ./adapters/
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
### 6. Merge Adapter
|
| 165 |
+
|
| 166 |
+
After training finishes:
|
| 167 |
+
|
| 168 |
+
```bash
|
| 169 |
+
python -m stack_2_9_training.merge_adapter --base-model Qwen/Qwen2.5-Coder-7B
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
Output: `./model_final/` with full model + tokenizer.
|
| 173 |
+
|
| 174 |
+
Takes 2-3 minutes.
|
| 175 |
+
|
| 176 |
+
### 7. Test Inference
|
| 177 |
+
|
| 178 |
+
Quick test:
|
| 179 |
+
|
| 180 |
+
```python
|
| 181 |
+
from stack_2_9_eval.model_client import create_model_client
|
| 182 |
+
|
| 183 |
+
# Point to your merged model
|
| 184 |
+
client = create_model_client(
|
| 185 |
+
provider="ollama", # or use direct HF pipeline
|
| 186 |
+
model="./model_final"
|
| 187 |
+
)
|
| 188 |
+
|
| 189 |
+
result = client.generate("Write a Python function to reverse a string")
|
| 190 |
+
print(result.text)
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
For production use, serve via vLLM or Hugging Face TGI.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## 🚨 Troubleshooting OOM (Out of Memory)
|
| 198 |
+
|
| 199 |
+
If you get CUDA OOM errors, try these fixes **in order**:
|
| 200 |
+
|
| 201 |
+
### 1. Reduce sequence length
|
| 202 |
+
Edit `train_config_colab.yaml`:
|
| 203 |
+
```yaml
|
| 204 |
+
training:
|
| 205 |
+
max_seq_length: 4096 # instead of 8192
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### 2. Reduce batch size further
|
| 209 |
+
```yaml
|
| 210 |
+
training:
|
| 211 |
+
per_device_train_batch_size: 1 # already 1
|
| 212 |
+
gradient_accumulation_steps: 32 # increase to 32 (slower but less memory)
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
### 3. Disable gradient checkpointing (memory vs speed trade-off)
|
| 216 |
+
```yaml
|
| 217 |
+
training:
|
| 218 |
+
gradient_checkpointing: false # uses more memory but faster
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
### 4. Lower LoRA rank
|
| 222 |
+
```yaml
|
| 223 |
+
peft:
|
| 224 |
+
r: 8 # or even 4
|
| 225 |
+
lora_alpha: 16
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
### 5. Switch to CPU (last resort)
|
| 229 |
+
Very slow (days), but works:
|
| 230 |
+
```yaml
|
| 231 |
+
model:
|
| 232 |
+
load_in_4bit: false # CPU cannot handle 4-bit quantization well
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 📊 Expected Performance
|
| 238 |
+
|
| 239 |
+
On **Colab T4 (free)** with 7B model:
|
| 240 |
+
|
| 241 |
+
| Metric | Value |
|
| 242 |
+
|--------|-------|
|
| 243 |
+
| Training time (2 epochs, 5K examples) | ~3-4 hours |
|
| 244 |
+
| Training time (2 epochs, 50K examples) | ~12-18 hours |
|
| 245 |
+
| VRAM usage | 10-12 GB |
|
| 246 |
+
| Disk space needed | 5-10 GB (model + checkpoints) |
|
| 247 |
+
| Inference throughput | ~15-25 tokens/sec |
|
| 248 |
+
|
| 249 |
+
---
|
| 250 |
+
|
| 251 |
+
## ☁️ Upgrading to A100 (Colab Pro)
|
| 252 |
+
|
| 253 |
+
If you have **Colab Pro** with A100 (40GB):
|
| 254 |
+
|
| 255 |
+
1. Change model in config:
|
| 256 |
+
```yaml
|
| 257 |
+
model:
|
| 258 |
+
name: "Qwen/Qwen2.5-Coder-32B"
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
2. Increase context:
|
| 262 |
+
```yaml
|
| 263 |
+
tokenizer:
|
| 264 |
+
model_max_length: 32768
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
3. Increase batch size:
|
| 268 |
+
```yaml
|
| 269 |
+
training:
|
| 270 |
+
per_device_train_batch_size: 4
|
| 271 |
+
gradient_accumulation_steps: 4
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
4. Training time for 50K examples: ~6-8 hours
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## 📤 Exporting to Hugging Face Hub
|
| 279 |
+
|
| 280 |
+
After merging, push to HF:
|
| 281 |
+
|
| 282 |
+
```python
|
| 283 |
+
from huggingface_hub import HfApi
|
| 284 |
+
|
| 285 |
+
api = HfApi(token="your-hf-token")
|
| 286 |
+
api.upload_folder(
|
| 287 |
+
folder_path="./model_final",
|
| 288 |
+
repo_id="your-org/stack-2.9-7b-lora",
|
| 289 |
+
repo_type="model"
|
| 290 |
+
)
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
Then update `TOGETHER_AI.md` with your model ID.
|
| 294 |
+
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## 🔄 Resuming Interrupted Training
|
| 298 |
+
|
| 299 |
+
Colab can disconnect unexpectedly. Use checkpointing:
|
| 300 |
+
|
| 301 |
+
1. Check if checkpoint exists: `ls -la adapters_colab/checkpoint-*`
|
| 302 |
+
2. To resume, add to config:
|
| 303 |
+
```yaml
|
| 304 |
+
training:
|
| 305 |
+
resume_from_checkpoint: "./adapters_colab/checkpoint-XXX"
|
| 306 |
+
```
|
| 307 |
+
Or pass CLI arg:
|
| 308 |
+
```bash
|
| 309 |
+
python -m stack_2_9_training.train_lora --config train_config.yaml --resume_from_checkpoint ./adapters_colab/checkpoint-XXX
|
| 310 |
+
```
|
| 311 |
+
|
| 312 |
+
---
|
| 313 |
+
|
| 314 |
+
## 🧪 Quick Validation Before Full Training
|
| 315 |
+
|
| 316 |
+
Run a mini training to verify setup:
|
| 317 |
+
|
| 318 |
+
```bash
|
| 319 |
+
python scripts/create_mini_dataset.py --size 100 # 100 examples
|
| 320 |
+
python -m stack_2_9_training.train_lora --config train_config_colab.yaml --num_train_epochs 1
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
Should take 15-30 minutes and give you a sense of whether training works.
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
## 📁 Files in This Package
|
| 328 |
+
|
| 329 |
+
- `COLAB_TRAINING.md` - This guide
|
| 330 |
+
- `colab_train_stack29.ipynb` - Ready-to-run Colab notebook
|
| 331 |
+
- `train_config_colab.yaml` - Optimized config for T4/7B
|
| 332 |
+
- `scripts/create_mini_dataset.py` - Create 5K sample dataset
|
| 333 |
+
- `stack_2_9_training/` - Training package (prepare_data, train_lora, merge_adapter)
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## 🆘 Getting Help
|
| 338 |
+
|
| 339 |
+
- **Colab issues**: Check Google Colab documentation
|
| 340 |
+
- **CUDA OOM**: Reduce `max_seq_length` to 4096, increase `gradient_accumulation_steps`
|
| 341 |
+
- **Training crashes**: Ensure you have enough disk space (at least 10GB free)
|
| 342 |
+
- **Slow training**: Verify `bf16` is enabled (T4 supports it), check `nvidia-smi` for GPU utilization
|
| 343 |
+
|
| 344 |
+
---
|
| 345 |
+
|
| 346 |
+
## ✅ Ready to Go!
|
| 347 |
+
|
| 348 |
+
The Colab notebook is pre-configured and ready to execute. Just open it, select **GPU runtime**, and run all cells.
|
| 349 |
+
|
| 350 |
+
**Expected outcome:** Trained LoRA adapter in `./adapters_colab/`, merged model in `./model_final/`, ready for evaluation and Hugging Face publication.
|
|
@@ -0,0 +1,351 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# 🚀 Stack 2.9 - Colab Training Notebook\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"**Zero-cost training on Google Colab free tier**\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"This notebook trains a LoRA adapter for Stack 2.9 Pattern Memory on **Qwen2.5-Coder-7B** using a free T4 GPU.\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"⏱️ **Expected runtime:** 3-5 hours\n",
|
| 14 |
+
"💾 **VRAM needed:** ~12GB (fits in T4's 15GB)\n",
|
| 15 |
+
"📦 **Output:** `./adapters_colab/` (LoRA) + `./model_final/` (merged)\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"---\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"**Instructions:**\n",
|
| 20 |
+
"1. Runtime → Change runtime type → **GPU (T4)**\n",
|
| 21 |
+
"2. Run each cell in order (Shift+Enter or Play button)\n",
|
| 22 |
+
"3. Monitor progress in cell outputs\n",
|
| 23 |
+
"\n",
|
| 24 |
+
"---"
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"execution_count": null,
|
| 30 |
+
"metadata": {},
|
| 31 |
+
"outputs": [],
|
| 32 |
+
"source": [
|
| 33 |
+
"# Check GPU availability\n",
|
| 34 |
+
"!nvidia-smi"
|
| 35 |
+
]
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"cell_type": "markdown",
|
| 39 |
+
"metadata": {},
|
| 40 |
+
"source": [
|
| 41 |
+
"## 1️⃣ Mount Google Drive (Optional)\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"Mount Drive to persist data between sessions. If you skip this, data is stored in Colab's ephemeral storage (lost after ~12h or disconnect)."
|
| 44 |
+
]
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"cell_type": "code",
|
| 48 |
+
"execution_count": null,
|
| 49 |
+
"metadata": {},
|
| 50 |
+
"outputs": [],
|
| 51 |
+
"source": [
|
| 52 |
+
"from google.colab import drive\n",
|
| 53 |
+
"drive.mount('/content/drive')\n",
|
| 54 |
+
"\n",
|
| 55 |
+
"# Set base path (change if not using Drive)\n",
|
| 56 |
+
"BASE_PATH = \"/content/drive/MyDrive/stack-2.9\" # or use \"/content/stack-2.9\" for local storage\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"print(f\"Using base path: {BASE_PATH}\")"
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "markdown",
|
| 63 |
+
"metadata": {},
|
| 64 |
+
"source": [
|
| 65 |
+
"## 2️⃣ Clone Repository & Install Dependencies"
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"cell_type": "code",
|
| 70 |
+
"execution_count": null,
|
| 71 |
+
"metadata": {},
|
| 72 |
+
"outputs": [],
|
| 73 |
+
"source": [
|
| 74 |
+
"import os\n",
|
| 75 |
+
"os.chdir('/content')\n",
|
| 76 |
+
"\n",
|
| 77 |
+
"# Clone the Stack 2.9 repository if not already present\n",
|
| 78 |
+
"if not os.path.exists('stack-2.9'):\n",
|
| 79 |
+
" !git clone https://github.com/my-ai-stack/stack-2.9.git\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"os.chdir('/content/stack-2.9')\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"# Upgrade pip and install dependencies\n",
|
| 84 |
+
"!pip install --upgrade pip\n",
|
| 85 |
+
"!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118\n",
|
| 86 |
+
"!pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"cell_type": "markdown",
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"source": [
|
| 93 |
+
"## 3️⃣ Prepare Training Data\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"### Option A: Use existing training data from repository\n",
|
| 96 |
+
"The repo already has `training-data/final/train.jsonl` and `eval.jsonl` if you previously ran data collection.\n",
|
| 97 |
+
"\n",
|
| 98 |
+
"### Option B: Create a mini dataset for quick prototyping (5K examples)\n",
|
| 99 |
+
"Recommended for first run to verify everything works quickly."
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "code",
|
| 104 |
+
"execution_count": null,
|
| 105 |
+
"metadata": {},
|
| 106 |
+
"outputs": [],
|
| 107 |
+
"source": [
|
| 108 |
+
"# Create mini dataset (5K examples)\n",
|
| 109 |
+
"!python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl --source training-data/final/train.jsonl\n",
|
| 110 |
+
"\n",
|
| 111 |
+
"# Check what we have\n",
|
| 112 |
+
"!ls -lh data_mini/\n",
|
| 113 |
+
"\n",
|
| 114 |
+
"# If you want to use the full dataset instead, skip the mini creation and use:\n",
|
| 115 |
+
"# training-data/final/train.jsonl (and eval.jsonl if available)"
|
| 116 |
+
]
|
| 117 |
+
},
|
| 118 |
+
{
|
| 119 |
+
"cell_type": "markdown",
|
| 120 |
+
"metadata": {},
|
| 121 |
+
"source": [
|
| 122 |
+
"## 4️⃣ Prepare Training Configuration\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"Copy the Colab-optimized config (7B model, 8K seq len, 2 epochs)."
|
| 125 |
+
]
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"cell_type": "code",
|
| 129 |
+
"execution_count": null,
|
| 130 |
+
"metadata": {},
|
| 131 |
+
"outputs": [],
|
| 132 |
+
"source": [
|
| 133 |
+
"# Copy the Colab config or create one manually\n",
|
| 134 |
+
"!cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml\n",
|
| 135 |
+
"\n",
|
| 136 |
+
"# If you need to use your own data, edit the paths in train_config.yaml:\n",
|
| 137 |
+
"# data:\n",
|
| 138 |
+
"# train_file: \"./data_mini/train_mini.jsonl\" # or ./training-data/final/train.jsonl\n",
|
| 139 |
+
"# validation_file: \"./training-data/final/eval.jsonl\" # optional\n",
|
| 140 |
+
"\n",
|
| 141 |
+
"print(\"Configuration ready. Showing relevant sections:\")\n",
|
| 142 |
+
"with open('stack_2_9_training/train_config.yaml', 'r') as f:\n",
|
| 143 |
+
" lines = f.readlines()\n",
|
| 144 |
+
" for i, line in enumerate(lines[:50]): # Show first 50 lines\n",
|
| 145 |
+
" print(f\"{i+1}: {line.rstrip()}\")\n",
|
| 146 |
+
"print(\"...\")"
|
| 147 |
+
]
|
| 148 |
+
},
|
| 149 |
+
{
|
| 150 |
+
"cell_type": "markdown",
|
| 151 |
+
"metadata": {},
|
| 152 |
+
"source": [
|
| 153 |
+
"## 5️⃣ Train LoRA Adapter\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"This is the main training step. Monitor GPU memory with `nvidia-smi` in a separate terminal or cell.\n",
|
| 156 |
+
"\n",
|
| 157 |
+
"⚠️ **Training will take 3-5 hours**. Do not interrupt unless necessary.\n",
|
| 158 |
+
"\n",
|
| 159 |
+
"Training progress is shown with loss values. Lower loss = better learning."
|
| 160 |
+
]
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"cell_type": "code",
|
| 164 |
+
"execution_count": null,
|
| 165 |
+
"metadata": {
|
| 166 |
+
"tags": []
|
| 167 |
+
},
|
| 168 |
+
"outputs": [],
|
| 169 |
+
"source": [
|
| 170 |
+
"# Start training\n",
|
| 171 |
+
"%env PYTHONUNBUFFERED=1 # Force unbuffered output for real-time logs\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"!cd stack_2_9_training && python -m train_lora --config train_config.yaml\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"# Checkpoints are saved to ./adapters_colab/ every 500 steps"
|
| 176 |
+
]
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"cell_type": "markdown",
|
| 180 |
+
"metadata": {},
|
| 181 |
+
"source": [
|
| 182 |
+
"## 6️⃣ Verify Training Output"
|
| 183 |
+
]
|
| 184 |
+
},
|
| 185 |
+
{
|
| 186 |
+
"cell_type": "code",
|
| 187 |
+
"execution_count": null,
|
| 188 |
+
"metadata": {},
|
| 189 |
+
"outputs": [],
|
| 190 |
+
"source": [
|
| 191 |
+
"!ls -lh adapters_colab/\n",
|
| 192 |
+
"\n",
|
| 193 |
+
"# If training succeeded, you should see:\n",
|
| 194 |
+
"# - adapter_model.bin (or multiple checkpoint-XXX folders)\n",
|
| 195 |
+
"# - training_args.bin\n",
|
| 196 |
+
"# - config.json"
|
| 197 |
+
]
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"cell_type": "markdown",
|
| 201 |
+
"metadata": {},
|
| 202 |
+
"source": [
|
| 203 |
+
"## 7️⃣ Merge LoRA Adapter with Base Model\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"Combines the trained adapter weights with the base Qwen2.5-Coder-7B model to produce a standalone fine-tuned model."
|
| 206 |
+
]
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
"cell_type": "code",
|
| 210 |
+
"execution_count": null,
|
| 211 |
+
"metadata": {},
|
| 212 |
+
"outputs": [],
|
| 213 |
+
"source": [
|
| 214 |
+
"!cd stack_2_9_training && python -m merge_adapter --base-model Qwen/Qwen2.5-Coder-7B\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"print(\"\\n✅ Merged model created in ./model_final/\")\n",
|
| 217 |
+
"!ls -lh model_final/"
|
| 218 |
+
]
|
| 219 |
+
},
|
| 220 |
+
{
|
| 221 |
+
"cell_type": "markdown",
|
| 222 |
+
"metadata": {},
|
| 223 |
+
"source": [
|
| 224 |
+
"## 8️⃣ Test Inference\n",
|
| 225 |
+
"\n",
|
| 226 |
+
"Quick sanity check: does the model generate reasonable code?"
|
| 227 |
+
]
|
| 228 |
+
},
|
| 229 |
+
{
|
| 230 |
+
"cell_type": "code",
|
| 231 |
+
"execution_count": null,
|
| 232 |
+
"metadata": {},
|
| 233 |
+
"outputs": [],
|
| 234 |
+
"source": [
|
| 235 |
+
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
|
| 236 |
+
"import torch\n",
|
| 237 |
+
"\n",
|
| 238 |
+
"# Load merged model\n",
|
| 239 |
+
"model_path = \"./model_final\"\n",
|
| 240 |
+
"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
|
| 241 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 242 |
+
" model_path,\n",
|
| 243 |
+
" torch_dtype=torch.bfloat16,\n",
|
| 244 |
+
" device_map=\"auto\",\n",
|
| 245 |
+
" trust_remote_code=True\n",
|
| 246 |
+
")\n",
|
| 247 |
+
"\n",
|
| 248 |
+
"# Test generation\n",
|
| 249 |
+
"prompt = \"Write a Python function to calculate factorial recursively:\\n\\n```python\\n\"\n",
|
| 250 |
+
"inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
|
| 251 |
+
"\n",
|
| 252 |
+
"print(\"Generating...\")\n",
|
| 253 |
+
"with torch.no_grad():\n",
|
| 254 |
+
" outputs = model.generate(\n",
|
| 255 |
+
" **inputs,\n",
|
| 256 |
+
" max_new_tokens=200,\n",
|
| 257 |
+
" temperature=0.2,\n",
|
| 258 |
+
" do_sample=True,\n",
|
| 259 |
+
" pad_token_id=tokenizer.eos_token_id\n",
|
| 260 |
+
" )\n",
|
| 261 |
+
"\n",
|
| 262 |
+
"response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
|
| 263 |
+
"print(\"=\\\"*40)\n",
|
| 264 |
+
"print(\"RESPONSE:\")\n",
|
| 265 |
+
"print(\"=\\\"*40)\n",
|
| 266 |
+
"print(response[len(prompt):]) # Show only generated part"
|
| 267 |
+
]
|
| 268 |
+
},
|
| 269 |
+
{
|
| 270 |
+
"cell_type": "markdown",
|
| 271 |
+
"metadata": {},
|
| 272 |
+
"source": [
|
| 273 |
+
"## 9️⃣ Export to Hugging Face Hub (Optional)\n",
|
| 274 |
+
"\n",
|
| 275 |
+
"If you want to publish your model, push it to Hugging Face and then apply to Together AI."
|
| 276 |
+
]
|
| 277 |
+
},
|
| 278 |
+
{
|
| 279 |
+
"cell_type": "code",
|
| 280 |
+
"execution_count": null,
|
| 281 |
+
"metadata": {},
|
| 282 |
+
"outputs": [],
|
| 283 |
+
"source": [
|
| 284 |
+
"from huggingface_hub import HfApi\n",
|
| 285 |
+
"\n",
|
| 286 |
+
"# You need a Hugging Face account and token\n",
|
| 287 |
+
"HF_TOKEN = input(\"Enter your Hugging Face token: \").strip()\n",
|
| 288 |
+
"\n",
|
| 289 |
+
"api = HfApi(token=HF_TOKEN)\n",
|
| 290 |
+
"\n",
|
| 291 |
+
"# Choose a repo name\n",
|
| 292 |
+
"repo_id = input(\"Enter repository name (e.g., your-org/stack-2.9-7b-lora): \").strip()\n",
|
| 293 |
+
"\n",
|
| 294 |
+
"print(f\"\\nUploading to {repo_id}...\")\n",
|
| 295 |
+
"\n",
|
| 296 |
+
"# Create repo if needed\n",
|
| 297 |
+
"api.create_repo(repo_id=repo_id, exist_ok=True)\n",
|
| 298 |
+
"\n",
|
| 299 |
+
"# Upload model\n",
|
| 300 |
+
"api.upload_folder(\n",
|
| 301 |
+
" folder_path=\"./model_final\",\n",
|
| 302 |
+
" repo_id=repo_id,\n",
|
| 303 |
+
" repo_type=\"model\"\n",
|
| 304 |
+
")\n",
|
| 305 |
+
"\n",
|
| 306 |
+
"print(f\"\\n✅ Model uploaded to https://huggingface.co/{repo_id}\")\n",
|
| 307 |
+
"\n",
|
| 308 |
+
"# Update docs\n",
|
| 309 |
+
"print(\"\\nNext steps:\")\n",
|
| 310 |
+
"print(\"1. Update TOGETHER_AI.md with your model ID\")\n",
|
| 311 |
+
"print(\"2. Update README.md badges with real scores after evaluation\")\n",
|
| 312 |
+
"print(\"3. Submit to Together AI model submission form\")"
|
| 313 |
+
]
|
| 314 |
+
},
|
| 315 |
+
{
|
| 316 |
+
"cell_type": "markdown",
|
| 317 |
+
"metadata": {},
|
| 318 |
+
"source": [
|
| 319 |
+
"## 🎉 Training Complete!\n",
|
| 320 |
+
"\n",
|
| 321 |
+
"You now have:\n",
|
| 322 |
+
"- ✅ Trained LoRA adapter in `./adapters_colab/`\n",
|
| 323 |
+
"- ✅ Merged full model in `./model_final/`\n",
|
| 324 |
+
"- ✅ Model card and documentation\n",
|
| 325 |
+
"\n",
|
| 326 |
+
"**Next steps:**\n",
|
| 327 |
+
"1. Run proper evaluation using `run_proper_evaluation.py`\n",
|
| 328 |
+
"2. Update README with real benchmark scores\n",
|
| 329 |
+
"3. Apply to Together AI with your Hugging Face model\n",
|
| 330 |
+
"\n",
|
| 331 |
+
"**Need help?** See `COLAB_TRAINING.md` for detailed troubleshooting."
|
| 332 |
+
]
|
| 333 |
+
}
|
| 334 |
+
],
|
| 335 |
+
"metadata": {
|
| 336 |
+
"accelerator": "GPU",
|
| 337 |
+
"colab": {
|
| 338 |
+
"name": "Stack 2.9 Colab Training",
|
| 339 |
+
"provenance": []
|
| 340 |
+
},
|
| 341 |
+
"kernelspec": {
|
| 342 |
+
"display_name": "Python 3",
|
| 343 |
+
"name": "python3"
|
| 344 |
+
},
|
| 345 |
+
"language_info": {
|
| 346 |
+
"name": "python"
|
| 347 |
+
}
|
| 348 |
+
},
|
| 349 |
+
"nbformat": 4,
|
| 350 |
+
"nbformat_minor": 0
|
| 351 |
+
}
|
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Create a minimal training dataset for rapid prototyping.
|
| 4 |
+
Samples N examples from the full training-data/final/train.jsonl ensuring tool diversity.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import argparse
|
| 8 |
+
import json
|
| 9 |
+
import random
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from collections import defaultdict, Counter
|
| 12 |
+
|
| 13 |
+
def load_full_dataset(train_path: str = "training-data/final/train.jsonl") -> List[Dict]:
|
| 14 |
+
"""Load the full dataset."""
|
| 15 |
+
path = Path(train_path)
|
| 16 |
+
if not path.exists():
|
| 17 |
+
raise FileNotFoundError(f"Training data not found at {path}. Please ensure training-data/final/train.jsonl exists.")
|
| 18 |
+
|
| 19 |
+
data = []
|
| 20 |
+
with open(path, 'r') as f:
|
| 21 |
+
for line in f:
|
| 22 |
+
data.append(json.loads(line))
|
| 23 |
+
return data
|
| 24 |
+
|
| 25 |
+
def extract_tool_calls(example: Dict) -> List[str]:
|
| 26 |
+
"""Extract tool names used in an example."""
|
| 27 |
+
tools = []
|
| 28 |
+
messages = example.get("messages", [])
|
| 29 |
+
for msg in messages:
|
| 30 |
+
if msg.get("role") == "assistant" and msg.get("tool_calls"):
|
| 31 |
+
for tc in msg["tool_calls"]:
|
| 32 |
+
func = tc.get("function", {})
|
| 33 |
+
name = func.get("name", "")
|
| 34 |
+
if name:
|
| 35 |
+
tools.append(name)
|
| 36 |
+
return tools
|
| 37 |
+
|
| 38 |
+
def create_mini_dataset(
|
| 39 |
+
output_path: str,
|
| 40 |
+
n_samples: int = 5000,
|
| 41 |
+
train_source: str = "training-data/final/train.jsonl",
|
| 42 |
+
seed: int = 42
|
| 43 |
+
):
|
| 44 |
+
"""Create a stratified mini dataset."""
|
| 45 |
+
random.seed(seed)
|
| 46 |
+
|
| 47 |
+
print(f"Loading full dataset from {train_source}...")
|
| 48 |
+
full_data = load_full_dataset(train_source)
|
| 49 |
+
print(f"Loaded {len(full_data)} total examples")
|
| 50 |
+
|
| 51 |
+
# Group by tool usage
|
| 52 |
+
tool_groups = defaultdict(list)
|
| 53 |
+
unknown_tools = []
|
| 54 |
+
|
| 55 |
+
for ex in full_data:
|
| 56 |
+
tools = extract_tool_calls(ex)
|
| 57 |
+
if tools:
|
| 58 |
+
# Use first tool as primary category
|
| 59 |
+
primary_tool = tools[0]
|
| 60 |
+
tool_groups[primary_tool].append(ex)
|
| 61 |
+
else:
|
| 62 |
+
unknown_tools.append(ex)
|
| 63 |
+
|
| 64 |
+
print(f"\nTool distribution in full dataset:")
|
| 65 |
+
total_tool_examples = sum(len(v) for v in tool_groups.values())
|
| 66 |
+
for tool, examples in sorted(tool_groups.items(), key=lambda x: len(x[1]), reverse=True)[:15]:
|
| 67 |
+
pct = len(examples) / len(full_data) * 100
|
| 68 |
+
print(f" {tool}: {len(examples)} examples ({pct:.1f}%)")
|
| 69 |
+
|
| 70 |
+
print(f" No-tool examples: {len(unknown_tools)} ({len(unknown_tools)/len(full_data)*100:.1f}%)")
|
| 71 |
+
|
| 72 |
+
# Determine sampling strategy
|
| 73 |
+
# Allocate samples proportionally, but ensure minimum 3 examples per tool
|
| 74 |
+
samples_per_tool = {}
|
| 75 |
+
min_per_tool = 3
|
| 76 |
+
remaining = n_samples
|
| 77 |
+
|
| 78 |
+
# First pass: assign minimum to all tools that have enough
|
| 79 |
+
for tool, examples in tool_groups.items():
|
| 80 |
+
if len(examples) >= min_per_tool:
|
| 81 |
+
samples_per_tool[tool] = min_per_tool
|
| 82 |
+
remaining -= min_per_tool
|
| 83 |
+
|
| 84 |
+
# Second pass: distribute remaining proportionally
|
| 85 |
+
if remaining > 0:
|
| 86 |
+
total_weight = sum(len(v) for v in tool_groups.values() if len(v) >= min_per_tool)
|
| 87 |
+
for tool, examples in tool_groups.items():
|
| 88 |
+
if len(examples) >= min_per_tool:
|
| 89 |
+
weight = len(examples) / total_weight
|
| 90 |
+
extra = int(remaining * weight)
|
| 91 |
+
samples_per_tool[tool] += extra
|
| 92 |
+
remaining -= extra
|
| 93 |
+
|
| 94 |
+
# Fill any leftover with no-tool examples
|
| 95 |
+
if remaining > 0 and unknown_tools:
|
| 96 |
+
samples_per_tool["__notool__"] = min(remaining, len(unknown_tools))
|
| 97 |
+
remaining -= min(remaining, len(unknown_tools))
|
| 98 |
+
|
| 99 |
+
# If we still have remaining, just take from the largest tool groups
|
| 100 |
+
if remaining > 0:
|
| 101 |
+
sorted_tools = sorted(tool_groups.items(), key=lambda x: len(x[1]), reverse=True)
|
| 102 |
+
for tool, examples in sorted_tools:
|
| 103 |
+
if remaining <= 0:
|
| 104 |
+
break
|
| 105 |
+
can_take = min(remaining, len(examples) - samples_per_tool.get(tool, 0))
|
| 106 |
+
if can_take > 0:
|
| 107 |
+
samples_per_tool[tool] = samples_per_tool.get(tool, 0) + can_take
|
| 108 |
+
remaining -= can_take
|
| 109 |
+
|
| 110 |
+
print(f"\nSampling plan (target {n_samples}):")
|
| 111 |
+
total_sampled = 0
|
| 112 |
+
for tool, n in sorted(samples_per_tool.items(), key=lambda x: x[1], reverse=True):
|
| 113 |
+
if n > 0:
|
| 114 |
+
available = len(tool_groups.get(tool, [])) if tool != "__notool__" else len(unknown_tools)
|
| 115 |
+
pct = n / n_samples * 100
|
| 116 |
+
print(f" {tool}: {n} examples ({pct:.1f}%) from {available} available")
|
| 117 |
+
total_sampled += n
|
| 118 |
+
|
| 119 |
+
# Perform sampling
|
| 120 |
+
mini_dataset = []
|
| 121 |
+
for tool, n_to_sample in samples_per_tool.items():
|
| 122 |
+
if n_to_sample <= 0:
|
| 123 |
+
continue
|
| 124 |
+
|
| 125 |
+
source_pool = tool_groups[tool] if tool != "__notool__" else unknown_tools
|
| 126 |
+
if len(source_pool) < n_to_sample:
|
| 127 |
+
n_to_sample = len(source_pool)
|
| 128 |
+
|
| 129 |
+
sampled = random.sample(source_pool, n_to_sample)
|
| 130 |
+
mini_dataset.extend(sampled)
|
| 131 |
+
|
| 132 |
+
# Shuffle the final dataset
|
| 133 |
+
random.shuffle(mini_dataset)
|
| 134 |
+
|
| 135 |
+
# Write output
|
| 136 |
+
output_path = Path(output_path)
|
| 137 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 138 |
+
|
| 139 |
+
with open(output_path, 'w') as f:
|
| 140 |
+
for ex in mini_dataset:
|
| 141 |
+
f.write(json.dumps(ex) + '\n')
|
| 142 |
+
|
| 143 |
+
print(f"\n✅ Mini dataset created: {len(mini_dataset)} examples")
|
| 144 |
+
print(f" Saved to: {output_path}")
|
| 145 |
+
|
| 146 |
+
# Stats
|
| 147 |
+
tool_counts = Counter()
|
| 148 |
+
for ex in mini_dataset:
|
| 149 |
+
tools = extract_tool_calls(ex)
|
| 150 |
+
if tools:
|
| 151 |
+
tool_counts[tools[0]] += 1
|
| 152 |
+
else:
|
| 153 |
+
tool_counts["__notool__"] += 1
|
| 154 |
+
|
| 155 |
+
print(f"\nFinal tool distribution:")
|
| 156 |
+
for tool, count in tool_counts.most_common(15):
|
| 157 |
+
pct = count / len(mini_dataset) * 100
|
| 158 |
+
print(f" {tool}: {count} ({pct:.1f}%)")
|
| 159 |
+
|
| 160 |
+
return mini_dataset
|
| 161 |
+
|
| 162 |
+
def main():
|
| 163 |
+
parser = argparse.ArgumentParser(description="Create mini dataset for fast prototyping")
|
| 164 |
+
parser.add_argument("--size", type=int, default=5000, help="Number of examples in mini dataset")
|
| 165 |
+
parser.add_argument("--output", type=str, default="./data_mini/train_mini.jsonl", help="Output file path")
|
| 166 |
+
parser.add_argument("--source", type=str, default="training-data/final/train.jsonl", help="Source full dataset")
|
| 167 |
+
parser.add_argument("--seed", type=int, default=42, help="Random seed for sampling")
|
| 168 |
+
|
| 169 |
+
args = parser.parse_args()
|
| 170 |
+
|
| 171 |
+
create_mini_dataset(
|
| 172 |
+
output_path=args.output,
|
| 173 |
+
n_samples=args.size,
|
| 174 |
+
train_source=args.source,
|
| 175 |
+
seed=args.seed
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
if __name__ == "__main__":
|
| 179 |
+
main()
|
|
@@ -0,0 +1,356 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Proper benchmark evaluation with Pass@k methodology.
|
| 4 |
+
Supports: HumanEval (164 problems) and MBPP (500 problems).
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import argparse
|
| 8 |
+
import json
|
| 9 |
+
import os
|
| 10 |
+
import sys
|
| 11 |
+
import time
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
from typing import Dict, List, Any, Optional
|
| 14 |
+
import signal
|
| 15 |
+
from datetime import datetime
|
| 16 |
+
|
| 17 |
+
# Add parent to path
|
| 18 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 19 |
+
|
| 20 |
+
from stack_2_9_eval.model_client import create_model_client, ChatMessage
|
| 21 |
+
|
| 22 |
+
class TimeoutException(Exception):
|
| 23 |
+
pass
|
| 24 |
+
|
| 25 |
+
def timeout_handler(signum, frame):
|
| 26 |
+
raise TimeoutException()
|
| 27 |
+
|
| 28 |
+
def load_benchmark_data(benchmark: str, data_dir: str = "./data") -> List[Dict]:
|
| 29 |
+
"""Load benchmark problems from downloaded dataset."""
|
| 30 |
+
data_path = Path(data_dir) / benchmark
|
| 31 |
+
dataset_file = data_path / f"{benchmark}.jsonl"
|
| 32 |
+
|
| 33 |
+
if not dataset_file.exists():
|
| 34 |
+
raise FileNotFoundError(f"Dataset not found: {dataset_file}. Run scripts/download_benchmark_datasets.py first.")
|
| 35 |
+
|
| 36 |
+
problems = []
|
| 37 |
+
with open(dataset_file, 'r') as f:
|
| 38 |
+
for line in f:
|
| 39 |
+
problems.append(json.loads(line))
|
| 40 |
+
|
| 41 |
+
return problems
|
| 42 |
+
|
| 43 |
+
def format_problem_prompt(problem: Dict, benchmark: str) -> str:
|
| 44 |
+
"""Format problem into a prompt for code generation."""
|
| 45 |
+
if benchmark == "humaneval":
|
| 46 |
+
# HumanEval has prompt field with function signature and docstring
|
| 47 |
+
prompt = problem["prompt"]
|
| 48 |
+
# Add instruction to complete the function
|
| 49 |
+
if "def " in prompt:
|
| 50 |
+
return f"{prompt}\n # Your code here\n pass"
|
| 51 |
+
return prompt
|
| 52 |
+
elif benchmark == "mbpp":
|
| 53 |
+
# MBPP has text description and sometimes starter code
|
| 54 |
+
text = problem["text"]
|
| 55 |
+
code = problem.get("code", "")
|
| 56 |
+
if code:
|
| 57 |
+
return f"{text}\n\nComplete the following code:\n{code}"
|
| 58 |
+
return text
|
| 59 |
+
else:
|
| 60 |
+
return str(problem)
|
| 61 |
+
|
| 62 |
+
def execute_test(code: str, problem: Dict, benchmark: str, timeout: int = 10) -> Tuple[bool, Optional[str]]:
|
| 63 |
+
"""Execute generated code against test cases."""
|
| 64 |
+
signal.signal(signal.SIGALRM, timeout_handler)
|
| 65 |
+
signal.alarm(timeout)
|
| 66 |
+
|
| 67 |
+
try:
|
| 68 |
+
if benchmark == "humaneval":
|
| 69 |
+
test_code = problem.get("test", "")
|
| 70 |
+
entry_point = problem.get("entry_point", "")
|
| 71 |
+
elif benchmark == "mbpp":
|
| 72 |
+
test_list = problem.get("test_list", [])
|
| 73 |
+
test_code = "\n".join(test_list)
|
| 74 |
+
entry_point = problem.get("func_name", "")
|
| 75 |
+
else:
|
| 76 |
+
return False, "Unknown benchmark"
|
| 77 |
+
|
| 78 |
+
# Combine code and tests
|
| 79 |
+
full_code = f"{code}\n{test_code}"
|
| 80 |
+
|
| 81 |
+
# Execute in isolated namespace
|
| 82 |
+
local_scope = {}
|
| 83 |
+
exec(full_code, {}, local_scope)
|
| 84 |
+
|
| 85 |
+
signal.alarm(0)
|
| 86 |
+
return True, None
|
| 87 |
+
|
| 88 |
+
except TimeoutException:
|
| 89 |
+
return False, "Execution timed out"
|
| 90 |
+
except Exception as e:
|
| 91 |
+
return False, str(e)
|
| 92 |
+
finally:
|
| 93 |
+
signal.alarm(0)
|
| 94 |
+
|
| 95 |
+
def compute_pass_k(results: List[bool], k: int, n: int) -> float:
|
| 96 |
+
"""
|
| 97 |
+
Compute unbiased Pass@k estimator.
|
| 98 |
+
|
| 99 |
+
Pass@k = 1 - C(n-c, k) / C(n, k)
|
| 100 |
+
where c = number of correct samples, n = total problems evaluated.
|
| 101 |
+
For Pass@k with multiple samples per problem, we treat each problem independently.
|
| 102 |
+
Here results[i] is True if ANY of the k samples for problem i passed.
|
| 103 |
+
"""
|
| 104 |
+
c = sum(results) # number of problems with at least 1 passing sample
|
| 105 |
+
if k >= n:
|
| 106 |
+
return c / n
|
| 107 |
+
# Unbiased estimator
|
| 108 |
+
return 1.0 - (1.0 - c / n) ** k
|
| 109 |
+
|
| 110 |
+
def evaluate_benchmark(
|
| 111 |
+
benchmark: str,
|
| 112 |
+
provider: str,
|
| 113 |
+
model: Optional[str],
|
| 114 |
+
k_samples: int = 100,
|
| 115 |
+
data_dir: str = "./data",
|
| 116 |
+
output_dir: str = "./results",
|
| 117 |
+
test_sample: bool = False,
|
| 118 |
+
checkpoint_freq: int = 20,
|
| 119 |
+
resume: bool = False,
|
| 120 |
+
temperature_range: tuple = (0.2, 1.0),
|
| 121 |
+
**model_kwargs
|
| 122 |
+
) -> Dict[str, Any]:
|
| 123 |
+
"""
|
| 124 |
+
Evaluate model on benchmark with proper Pass@k methodology.
|
| 125 |
+
|
| 126 |
+
Args:
|
| 127 |
+
benchmark: 'humaneval' or 'mbpp'
|
| 128 |
+
provider: Model provider (ollama, openai, anthropic, openrouter, together)
|
| 129 |
+
model: Model name (uses default if None)
|
| 130 |
+
k_samples: Number of samples per problem for Pass@k
|
| 131 |
+
data_dir: Directory containing downloaded datasets
|
| 132 |
+
output_dir: Where to save results
|
| 133 |
+
test_sample: If True, only evaluate 5 problems (for quick testing)
|
| 134 |
+
checkpoint_freq: Save checkpoint every N problems
|
| 135 |
+
resume: Resume from checkpoint if available
|
| 136 |
+
temperature_range: (min, max) temperature for sampling diversity
|
| 137 |
+
|
| 138 |
+
Returns:
|
| 139 |
+
Dictionary with Pass@1, Pass@10, Pass@100, and detailed results
|
| 140 |
+
"""
|
| 141 |
+
# Create output directory
|
| 142 |
+
output_path = Path(output_dir) / benchmark
|
| 143 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 144 |
+
|
| 145 |
+
checkpoint_file = output_path / f"checkpoint_{provider}_{model or 'default'}.json"
|
| 146 |
+
results_file = output_path / f"results_{provider}_{model or 'default'}.json"
|
| 147 |
+
summary_file = output_path / f"summary_{provider}_{model or 'default'}.json"
|
| 148 |
+
|
| 149 |
+
# Load problems
|
| 150 |
+
print(f"Loading {benchmark} dataset from {data_dir}...")
|
| 151 |
+
problems = load_benchmark_data(benchmark, data_dir)
|
| 152 |
+
|
| 153 |
+
if test_sample:
|
| 154 |
+
problems = problems[:5]
|
| 155 |
+
print(f"⚠️ Test mode: evaluating only {len(problems)} problems")
|
| 156 |
+
else:
|
| 157 |
+
print(f"Loaded {len(problems)} problems")
|
| 158 |
+
|
| 159 |
+
# Check for checkpoint
|
| 160 |
+
start_idx = 0
|
| 161 |
+
all_problem_results = []
|
| 162 |
+
if resume and checkpoint_file.exists():
|
| 163 |
+
with open(checkpoint_file, 'r') as f:
|
| 164 |
+
checkpoint = json.load(f)
|
| 165 |
+
all_problem_results = checkpoint.get("results", [])
|
| 166 |
+
start_idx = len(all_problem_results)
|
| 167 |
+
print(f"Resuming from checkpoint: {start_idx}/{len(problems)} problems completed")
|
| 168 |
+
|
| 169 |
+
# Initialize model client
|
| 170 |
+
print(f"Initializing model client: provider={provider}, model={model or 'default'}")
|
| 171 |
+
client = create_model_client(provider=provider, model=model, **model_kwargs)
|
| 172 |
+
|
| 173 |
+
# Evaluate each problem
|
| 174 |
+
for idx, problem in enumerate(problems[start_idx:], start=start_idx):
|
| 175 |
+
problem_id = problem.get("task_id", f"{benchmark}/{idx}")
|
| 176 |
+
print(f"\n[{idx+1}/{len(problems)}] Problem {problem_id}")
|
| 177 |
+
|
| 178 |
+
prompt = format_problem_prompt(problem, benchmark)
|
| 179 |
+
sample_results = []
|
| 180 |
+
|
| 181 |
+
# Generate k samples with varying temperature
|
| 182 |
+
for sample_idx in range(k_samples):
|
| 183 |
+
temperature = temperature_range[0] + (temperature_range[1] - temperature_range[0]) * (sample_idx / max(k_samples-1, 1))
|
| 184 |
+
|
| 185 |
+
try:
|
| 186 |
+
result = client.generate(
|
| 187 |
+
prompt=prompt,
|
| 188 |
+
temperature=temperature,
|
| 189 |
+
max_tokens=1024,
|
| 190 |
+
stop=None
|
| 191 |
+
)
|
| 192 |
+
generated_code = result.text.strip()
|
| 193 |
+
|
| 194 |
+
# Extract function if needed (for HumanEval)
|
| 195 |
+
if benchmark == "humaneval":
|
| 196 |
+
# Keep only the function definition
|
| 197 |
+
lines = generated_code.split('\n')
|
| 198 |
+
func_lines = []
|
| 199 |
+
in_func = False
|
| 200 |
+
for line in lines:
|
| 201 |
+
if line.strip().startswith('def '):
|
| 202 |
+
in_func = True
|
| 203 |
+
if in_func:
|
| 204 |
+
func_lines.append(line)
|
| 205 |
+
# Stop at next top-level def or class
|
| 206 |
+
if line.strip() and not line.startswith(' ') and not line.startswith('\t'):
|
| 207 |
+
if len(func_lines) > 1:
|
| 208 |
+
break
|
| 209 |
+
if func_lines:
|
| 210 |
+
generated_code = '\n'.join(func_lines)
|
| 211 |
+
|
| 212 |
+
# Execute test
|
| 213 |
+
passed, error = execute_test(generated_code, problem, benchmark, timeout=10)
|
| 214 |
+
sample_results.append({
|
| 215 |
+
"sample_idx": sample_idx,
|
| 216 |
+
"temperature": temperature,
|
| 217 |
+
"passed": passed,
|
| 218 |
+
"error": error,
|
| 219 |
+
"code": generated_code[:200] + "..." if len(generated_code) > 200 else generated_code
|
| 220 |
+
})
|
| 221 |
+
|
| 222 |
+
status = "✓" if passed else "✗"
|
| 223 |
+
print(f" Sample {sample_idx+1}/{k_samples} (T={temperature:.2f}): {status}")
|
| 224 |
+
|
| 225 |
+
except Exception as e:
|
| 226 |
+
print(f" Sample {sample_idx+1}: Error - {e}")
|
| 227 |
+
sample_results.append({
|
| 228 |
+
"sample_idx": sample_idx,
|
| 229 |
+
"temperature": temperature,
|
| 230 |
+
"passed": False,
|
| 231 |
+
"error": str(e),
|
| 232 |
+
"code": ""
|
| 233 |
+
})
|
| 234 |
+
|
| 235 |
+
# Determine if problem passed (any sample succeeded)
|
| 236 |
+
problem_passed = any(s["passed"] for s in sample_results)
|
| 237 |
+
|
| 238 |
+
problem_result = {
|
| 239 |
+
"problem_id": problem_id,
|
| 240 |
+
"passed": problem_passed,
|
| 241 |
+
"samples": sample_results,
|
| 242 |
+
"prompt": prompt[:200] + "..." if len(prompt) > 200 else prompt
|
| 243 |
+
}
|
| 244 |
+
all_problem_results.append(problem_result)
|
| 245 |
+
|
| 246 |
+
# Save checkpoint periodically
|
| 247 |
+
if (idx + 1) % checkpoint_freq == 0:
|
| 248 |
+
checkpoint = {
|
| 249 |
+
"benchmark": benchmark,
|
| 250 |
+
"provider": provider,
|
| 251 |
+
"model": model or client.get_model_name(),
|
| 252 |
+
"k_samples": k_samples,
|
| 253 |
+
"timestamp": datetime.now().isoformat(),
|
| 254 |
+
"completed": idx + 1,
|
| 255 |
+
"total": len(problems),
|
| 256 |
+
"results": all_problem_results
|
| 257 |
+
}
|
| 258 |
+
with open(checkpoint_file, 'w') as f:
|
| 259 |
+
json.dump(checkpoint, f, indent=2)
|
| 260 |
+
print(f" ✓ Checkpoint saved ({idx+1}/{len(problems)})")
|
| 261 |
+
|
| 262 |
+
# Compute Pass@k metrics (using any-pass logic for each problem)
|
| 263 |
+
results_binary = [r["passed"] for r in all_problem_results]
|
| 264 |
+
|
| 265 |
+
# For Pass@1 we use the first sample's result effectively, but since we have k samples,
|
| 266 |
+
# Pass@1 with k samples is the probability that at least 1 of k samples passes.
|
| 267 |
+
# This is an estimate of the model's best possible pass rate.
|
| 268 |
+
pass_at_1 = compute_pass_k(results_binary, 1, len(results_binary))
|
| 269 |
+
pass_at_10 = compute_pass_k(results_binary, 10, len(results_binary))
|
| 270 |
+
pass_at_100 = compute_pass_k(results_binary, 100, len(results_binary))
|
| 271 |
+
|
| 272 |
+
summary = {
|
| 273 |
+
"benchmark": benchmark,
|
| 274 |
+
"provider": provider,
|
| 275 |
+
"model": model or client.get_model_name(),
|
| 276 |
+
"k_samples": k_samples,
|
| 277 |
+
"total_problems": len(problems),
|
| 278 |
+
"passed_problems": sum(results_binary),
|
| 279 |
+
"pass_at_1": pass_at_1,
|
| 280 |
+
"pass_at_10": pass_at_10,
|
| 281 |
+
"pass_at_100": pass_at_100,
|
| 282 |
+
"timestamp": datetime.now().isoformat()
|
| 283 |
+
}
|
| 284 |
+
|
| 285 |
+
# Save final results
|
| 286 |
+
final_output = {
|
| 287 |
+
"metadata": summary,
|
| 288 |
+
"results": all_problem_results
|
| 289 |
+
}
|
| 290 |
+
with open(results_file, 'w') as f:
|
| 291 |
+
json.dump(final_output, f, indent=2)
|
| 292 |
+
|
| 293 |
+
with open(summary_file, 'w') as f:
|
| 294 |
+
json.dump(summary, f, indent=2)
|
| 295 |
+
|
| 296 |
+
print("\n" + "="*60)
|
| 297 |
+
print("EVALUATION COMPLETE")
|
| 298 |
+
print("="*60)
|
| 299 |
+
print(f"Benchmark: {benchmark}")
|
| 300 |
+
print(f"Provider: {provider}")
|
| 301 |
+
print(f"Model: {summary['model']}")
|
| 302 |
+
print(f"Total Problems: {len(problems)}")
|
| 303 |
+
print(f"Pass@1: {pass_at_1*100:.1f}%")
|
| 304 |
+
print(f"Pass@10: {pass_at_10*100:.1f}%")
|
| 305 |
+
print(f"Pass@100: {pass_at_100*100:.1f}%")
|
| 306 |
+
print(f"\nResults saved to: {results_file}")
|
| 307 |
+
print(f"Summary saved to: {summary_file}")
|
| 308 |
+
print("="*60)
|
| 309 |
+
|
| 310 |
+
return summary
|
| 311 |
+
|
| 312 |
+
def main():
|
| 313 |
+
parser = argparse.ArgumentParser(description="Proper benchmark evaluation with Pass@k")
|
| 314 |
+
parser.add_argument("--benchmark", choices=["humaneval", "mbpp"], required=True, help="Benchmark to run")
|
| 315 |
+
parser.add_argument("--provider", choices=["ollama", "openai", "anthropic", "openrouter", "together"], required=True, help="Model provider")
|
| 316 |
+
parser.add_argument("--model", type=str, help="Model name (provider-specific)")
|
| 317 |
+
parser.add_argument("--k-samples", type=int, default=100, help="Number of samples per problem for Pass@k")
|
| 318 |
+
parser.add_argument("--data-dir", type=str, default="./data", help="Directory with downloaded datasets")
|
| 319 |
+
parser.add_argument("--output-dir", type=str, default="./results", help="Where to save results")
|
| 320 |
+
parser.add_argument("--test-sample", action="store_true", help="Run on 5 problems only (quick test)")
|
| 321 |
+
parser.add_argument("--checkpoint-freq", type=int, default=20, help="Save checkpoint every N problems")
|
| 322 |
+
parser.add_argument("--resume", action="store_true", help="Resume from checkpoint if available")
|
| 323 |
+
parser.add_argument("--api-key", type=str, help="API key (or set environment variable)")
|
| 324 |
+
|
| 325 |
+
args = parser.parse_args()
|
| 326 |
+
|
| 327 |
+
# Prepare model kwargs
|
| 328 |
+
model_kwargs = {}
|
| 329 |
+
if args.api_key:
|
| 330 |
+
model_kwargs["api_key"] = args.api_key
|
| 331 |
+
|
| 332 |
+
try:
|
| 333 |
+
summary = evaluate_benchmark(
|
| 334 |
+
benchmark=args.benchmark,
|
| 335 |
+
provider=args.provider,
|
| 336 |
+
model=args.model,
|
| 337 |
+
k_samples=args.k_samples,
|
| 338 |
+
data_dir=args.data_dir,
|
| 339 |
+
output_dir=args.output_dir,
|
| 340 |
+
test_sample=args.test_sample,
|
| 341 |
+
checkpoint_freq=args.checkpoint_freq,
|
| 342 |
+
resume=args.resume,
|
| 343 |
+
**model_kwargs
|
| 344 |
+
)
|
| 345 |
+
sys.exit(0)
|
| 346 |
+
except KeyboardInterrupt:
|
| 347 |
+
print("\n\nInterrupted. Progress saved in checkpoint (if enabled).")
|
| 348 |
+
sys.exit(1)
|
| 349 |
+
except Exception as e:
|
| 350 |
+
print(f"\n❌ Error: {e}", file=sys.stderr)
|
| 351 |
+
import traceback
|
| 352 |
+
traceback.print_exc()
|
| 353 |
+
sys.exit(1)
|
| 354 |
+
|
| 355 |
+
if __name__ == "__main__":
|
| 356 |
+
main()
|
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Colab-Optimized Training Configuration for Stack 2.9
|
| 2 |
+
# Target: Google Colab free tier (T4 GPU, 15GB VRAM)
|
| 3 |
+
# Model: Qwen/Qwen2.5-Coder-7B (4-bit quantized fits in ~4.5GB)
|
| 4 |
+
# Expected runtime: 3-5 hours
|
| 5 |
+
|
| 6 |
+
model:
|
| 7 |
+
name: "Qwen/Qwen2.5-Coder-7B" # 7B instead of 32B for Colab
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
use_flash_attention: false # T4 doesn't support flash attention well
|
| 10 |
+
|
| 11 |
+
tokenizer:
|
| 12 |
+
model_max_length: 8192 # Reduced from 131072 for memory
|
| 13 |
+
padding_side: "right"
|
| 14 |
+
truncation_side: "right"
|
| 15 |
+
|
| 16 |
+
peft:
|
| 17 |
+
peft_type: "LORA"
|
| 18 |
+
task_type: "CAUSAL_LM"
|
| 19 |
+
r: 16 # LoRA rank (lower = faster, good enough for 7B)
|
| 20 |
+
lora_alpha: 32
|
| 21 |
+
lora_dropout: 0.05
|
| 22 |
+
target_modules:
|
| 23 |
+
- "q_proj"
|
| 24 |
+
- "k_proj"
|
| 25 |
+
- "v_proj"
|
| 26 |
+
- "o_proj"
|
| 27 |
+
- "gate_proj"
|
| 28 |
+
- "up_proj"
|
| 29 |
+
- "down_proj"
|
| 30 |
+
# Optional: add "embed_tokens", "lm_head" for full coverage (increases memory)
|
| 31 |
+
|
| 32 |
+
quantization:
|
| 33 |
+
load_in_4bit: true
|
| 34 |
+
bnb_4bit_compute_dtype: "bfloat16"
|
| 35 |
+
bnb_4bit_quant_type: "nf4"
|
| 36 |
+
bnb_4bit_use_double_quant: true
|
| 37 |
+
|
| 38 |
+
training:
|
| 39 |
+
output_dir: "./adapters_colab"
|
| 40 |
+
num_train_epochs: 2 # Sufficient for 7B with decent dataset
|
| 41 |
+
per_device_train_batch_size: 1 # Tiny batch for 15GB VRAM
|
| 42 |
+
gradient_accumulation_steps: 16 # Effective batch size = 16
|
| 43 |
+
optim: "paged_adamw_8bit" # 8-bit optimizer for memory
|
| 44 |
+
learning_rate: 1.0e-4
|
| 45 |
+
weight_decay: 0.01
|
| 46 |
+
warmup_steps: 100
|
| 47 |
+
lr_scheduler_type: "cosine"
|
| 48 |
+
save_steps: 500
|
| 49 |
+
save_total_limit: 2
|
| 50 |
+
logging_steps: 10
|
| 51 |
+
report_to: "none" # Disable wandb for Colab
|
| 52 |
+
|
| 53 |
+
# Memory optimizations
|
| 54 |
+
gradient_checkpointing: true
|
| 55 |
+
fp16: false # Use bf16 instead if available
|
| 56 |
+
bf16: true # T4 supports bf16
|
| 57 |
+
max_grad_norm: 1.0
|
| 58 |
+
dataloader_num_workers: 2
|
| 59 |
+
remove_unused_columns: false
|
| 60 |
+
|
| 61 |
+
data:
|
| 62 |
+
train_file: "./training-data/train.jsonl"
|
| 63 |
+
validation_file: "./training-data/eval.jsonl"
|
| 64 |
+
dataset_format: "chat" # or "prompt_response"
|
| 65 |
+
max_seq_length: 8192 # Critical for T4 memory
|
| 66 |
+
prompt_template: "chatml" # Qwen's default template
|
| 67 |
+
|
| 68 |
+
# Hardware
|
| 69 |
+
ddp: false # Single GPU for Colab
|
| 70 |
+
|
| 71 |
+
# Misc
|
| 72 |
+
seed: 42
|
| 73 |
+
push_to_hub: false # Set to true and add HF token to push during training
|
| 74 |
+
hub_model_id: null # "your-org/stack-2.9-7b-lora"
|
| 75 |
+
|
| 76 |
+
# Notes:
|
| 77 |
+
# - 4-bit quantization + batch size 1 + gradient checkpointing = fits in 15GB
|
| 78 |
+
# - If OOM: reduce max_seq_length to 4096 or increase gradient_accumulation_steps
|
| 79 |
+
# - If training is slow: increase per_device_train_batch_size to 2 (if memory allows)
|
| 80 |
+
# - After training, merge adapter with base model using merge_adapter.py
|