zgrgr commited on
Commit
59848a4
·
verified ·
1 Parent(s): 384f47e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -10
README.md CHANGED
@@ -12,6 +12,7 @@ tags:
12
  - custom-reward
13
  - trl
14
  - llm
 
15
  library_name: transformers
16
  model_name: newmindai/QwQ-32B-r1
17
  pipeline_tag: text-generation
@@ -21,11 +22,15 @@ datasets:
21
 
22
  # Overview
23
 
24
- This model was fine-tuned using **Reinforcement Learning** on top of a pretrained LLM, enhanced with:
25
 
26
  - **ORMs** (Open Reward Modules)
27
  - **DAPO** (Decoder Appearance Optimization)
28
- - **SimpleScaling** (loss scaling strategy)
 
 
 
 
29
 
30
  ## Training Setup
31
 
@@ -37,8 +42,6 @@ This model was fine-tuned using **Reinforcement Learning** on top of a pretraine
37
 
38
  ### Reward Modules (ORMs)
39
 
40
- The following reward functions guided RL fine-tuning:
41
-
42
  | Reward Function | Description |
43
  |-------------------|-------------------------------------------------------|
44
  | `math` | Evaluates symbolic math correctness (MathORM) |
@@ -55,27 +58,49 @@ These were combined and scaled during training with adaptive weighting.
55
  - **DAPO (Appearance Optimization):** Regularizes attention and layout structure in decoder outputs.
56
  - **SimpleScaling** ([`newmindai/simplescaling`](https://huggingface.co/newmindai/simplescaling)): Controls optimizer behavior and reward balance across multiple objectives.
57
 
 
 
58
  ## Training Regime
59
 
60
  - **Stage 1 (Wait #1):** Model explores reward landscape; initial rewards unstable.
61
  - **Stage 2 (Wait #2):** Convergence improves as ORM signals align.
62
  - **Aha Moment:** Clear gains in math and formatting scores around ~2K steps after warm-up.
63
 
 
 
64
  ## Evaluation
65
 
66
  🐍 **Mezura-SnakeBench Benchmarking**
67
- Final performance was benchmarked using the [Mezura](https://huggingface.co/spaces/newmindai/Mezura) SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.
 
 
68
 
 
69
 
70
- ## Usage Example
71
 
72
  ```python
73
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- model_id = "newmindai/QwQ-32B-r1"
76
- model = AutoModelForCausalLM.from_pretrained(model_id)
77
- tokenizer = AutoTokenizer.from_pretrained(model_id)
78
 
 
79
  prompt = "Türkiye'nin en yüksek dağı nedir?"
80
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
81
  outputs = model.generate(**inputs, max_new_tokens=100)
 
12
  - custom-reward
13
  - trl
14
  - llm
15
+ - adapter
16
  library_name: transformers
17
  model_name: newmindai/QwQ-32B-r1
18
  pipeline_tag: text-generation
 
22
 
23
  # Overview
24
 
25
+ **newmindai/QwQ-32B-r1** is a **LoRA adapter**, fine-tuned via **Reinforcement Learning (RL)** on top of the base model `QwQ-32B`. It incorporates:
26
 
27
  - **ORMs** (Open Reward Modules)
28
  - **DAPO** (Decoder Appearance Optimization)
29
+ - **SimpleScaling** (Multi-objective loss balancing)
30
+
31
+ > This is an **adapter**, not a fully merged model. To use it, you must load it on top of the base model (`Qwen/QwQ-32B`) using the `peft` library.
32
+
33
+ ---
34
 
35
  ## Training Setup
36
 
 
42
 
43
  ### Reward Modules (ORMs)
44
 
 
 
45
  | Reward Function | Description |
46
  |-------------------|-------------------------------------------------------|
47
  | `math` | Evaluates symbolic math correctness (MathORM) |
 
58
  - **DAPO (Appearance Optimization):** Regularizes attention and layout structure in decoder outputs.
59
  - **SimpleScaling** ([`newmindai/simplescaling`](https://huggingface.co/newmindai/simplescaling)): Controls optimizer behavior and reward balance across multiple objectives.
60
 
61
+ ---
62
+
63
  ## Training Regime
64
 
65
  - **Stage 1 (Wait #1):** Model explores reward landscape; initial rewards unstable.
66
  - **Stage 2 (Wait #2):** Convergence improves as ORM signals align.
67
  - **Aha Moment:** Clear gains in math and formatting scores around ~2K steps after warm-up.
68
 
69
+ ---
70
+
71
  ## Evaluation
72
 
73
  🐍 **Mezura-SnakeBench Benchmarking**
74
+ Final performance was benchmarked using the [Mezura](https://huggingface.co/spaces/newmindai/Mezura) SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.
75
+
76
+ ---
77
 
78
+ ## Usage Example (LoRA Adapter)
79
 
80
+ This adapter must be loaded on top of the base model `Qwen/QwQ-32B` using the [`peft`](https://github.com/huggingface/peft) library:
81
 
82
  ```python
83
+ from transformers import AutoTokenizer, AutoModelForCausalLM
84
+ from peft import PeftModel
85
+ import torch
86
+
87
+ base_model_id = "Qwen/QwQ-32B"
88
+ adapter_id = "newmindai/QwQ-32B-r1"
89
+
90
+ # Load tokenizer
91
+ tokenizer = AutoTokenizer.from_pretrained(base_model_id)
92
+
93
+ # Load base model
94
+ base_model = AutoModelForCausalLM.from_pretrained(
95
+ base_model_id,
96
+ torch_dtype=torch.float16,
97
+ device_map="auto"
98
+ )
99
 
100
+ # Load LoRA adapter
101
+ model = PeftModel.from_pretrained(base_model, adapter_id)
 
102
 
103
+ # Inference
104
  prompt = "Türkiye'nin en yüksek dağı nedir?"
105
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
106
  outputs = model.generate(**inputs, max_new_tokens=100)