BFS-Search commited on
Commit
71630e5
·
verified ·
1 Parent(s): 622455f

Upload model files

Browse files
Files changed (1) hide show
  1. checkpoint-100/trainer_state.json +2134 -0
checkpoint-100/trainer_state.json ADDED
@@ -0,0 +1,2134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "episode": 2400,
5
+ "epoch": 0.34285714285714286,
6
+ "eval_steps": 500,
7
+ "global_step": 100,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "episode": 24,
14
+ "epoch": 0.0034285714285714284,
15
+ "eps": 3,
16
+ "loss/policy_avg": -0.00039759615901857615,
17
+ "loss/value_avg": 0.001173102529719472,
18
+ "lr": 0.0,
19
+ "objective/entropy": 9.603790283203125,
20
+ "objective/kl": 0.0522441528737545,
21
+ "objective/non_score_reward": -0.0026122077833861113,
22
+ "objective/rlhf_reward": -0.0026122077833861113,
23
+ "objective/scores": 0.0,
24
+ "policy/approxkl_avg": 0.00023563228023704141,
25
+ "policy/clipfrac_avg": 0.00039308174746111035,
26
+ "policy/entropy_avg": 0.17652443051338196,
27
+ "step": 1,
28
+ "val/clipfrac_avg": 0.0,
29
+ "val/num_eos_tokens": 0,
30
+ "val/ratio": 0.9997509717941284,
31
+ "val/ratio_var": 1.331204543930653e-06
32
+ },
33
+ {
34
+ "episode": 48,
35
+ "epoch": 0.006857142857142857,
36
+ "eps": 2,
37
+ "loss/policy_avg": 9.456602856516838e-05,
38
+ "loss/value_avg": 0.0012460823636502028,
39
+ "lr": 3.3333333333333334e-08,
40
+ "objective/entropy": 11.654745101928711,
41
+ "objective/kl": 0.04007173329591751,
42
+ "objective/non_score_reward": -0.002003586385399103,
43
+ "objective/rlhf_reward": -0.002003586385399103,
44
+ "objective/scores": 0.0,
45
+ "policy/approxkl_avg": 0.0005037242081016302,
46
+ "policy/clipfrac_avg": 0.000589622650295496,
47
+ "policy/entropy_avg": 0.22399930655956268,
48
+ "step": 2,
49
+ "val/clipfrac_avg": 0.0,
50
+ "val/num_eos_tokens": 0,
51
+ "val/ratio": 0.999730110168457,
52
+ "val/ratio_var": 1.496027721259452e-06
53
+ },
54
+ {
55
+ "episode": 72,
56
+ "epoch": 0.010285714285714285,
57
+ "eps": 3,
58
+ "loss/policy_avg": 0.0005133129889145494,
59
+ "loss/value_avg": 0.0011833460303023458,
60
+ "lr": 6.666666666666667e-08,
61
+ "objective/entropy": 9.299583435058594,
62
+ "objective/kl": -0.02202693186700344,
63
+ "objective/non_score_reward": 0.0011013468028977513,
64
+ "objective/rlhf_reward": 0.0011013468028977513,
65
+ "objective/scores": 0.0,
66
+ "policy/approxkl_avg": 0.0002756627509370446,
67
+ "policy/clipfrac_avg": 0.00019654087373055518,
68
+ "policy/entropy_avg": 0.16759081184864044,
69
+ "step": 3,
70
+ "val/clipfrac_avg": 0.0,
71
+ "val/num_eos_tokens": 0,
72
+ "val/ratio": 1.0005955696105957,
73
+ "val/ratio_var": 1.5924795206956333e-06
74
+ },
75
+ {
76
+ "episode": 96,
77
+ "epoch": 0.013714285714285714,
78
+ "eps": 3,
79
+ "loss/policy_avg": -0.0007660436676815152,
80
+ "loss/value_avg": 0.0012440377613529563,
81
+ "lr": 1e-07,
82
+ "objective/entropy": 10.092122077941895,
83
+ "objective/kl": -0.001453061937354505,
84
+ "objective/non_score_reward": 7.26534053683281e-05,
85
+ "objective/rlhf_reward": 7.26534053683281e-05,
86
+ "objective/scores": 0.0,
87
+ "policy/approxkl_avg": 0.0004117934440728277,
88
+ "policy/clipfrac_avg": 0.001179245300590992,
89
+ "policy/entropy_avg": 0.18802356719970703,
90
+ "step": 4,
91
+ "val/clipfrac_avg": 0.0,
92
+ "val/num_eos_tokens": 0,
93
+ "val/ratio": 1.0004326105117798,
94
+ "val/ratio_var": 2.240923549834406e-06
95
+ },
96
+ {
97
+ "episode": 120,
98
+ "epoch": 0.017142857142857144,
99
+ "eps": 3,
100
+ "loss/policy_avg": -0.0014500913675874472,
101
+ "loss/value_avg": 0.001122939051128924,
102
+ "lr": 1.3333333333333334e-07,
103
+ "objective/entropy": 10.164737701416016,
104
+ "objective/kl": 0.05467784032225609,
105
+ "objective/non_score_reward": -0.002733892295509577,
106
+ "objective/rlhf_reward": -0.002733892295509577,
107
+ "objective/scores": 0.0,
108
+ "policy/approxkl_avg": 0.0004754411638714373,
109
+ "policy/clipfrac_avg": 0.002358490601181984,
110
+ "policy/entropy_avg": 0.18593449890613556,
111
+ "step": 5,
112
+ "val/clipfrac_avg": 0.0,
113
+ "val/num_eos_tokens": 0,
114
+ "val/ratio": 1.0011851787567139,
115
+ "val/ratio_var": 2.860480663002818e-06
116
+ },
117
+ {
118
+ "episode": 144,
119
+ "epoch": 0.02057142857142857,
120
+ "eps": 3,
121
+ "loss/policy_avg": -0.00038924324326217175,
122
+ "loss/value_avg": 0.0011501125991344452,
123
+ "lr": 1.6666666666666665e-07,
124
+ "objective/entropy": 10.74693489074707,
125
+ "objective/kl": -0.019970744848251343,
126
+ "objective/non_score_reward": 0.0009985374053940177,
127
+ "objective/rlhf_reward": 0.0009985374053940177,
128
+ "objective/scores": 0.0,
129
+ "policy/approxkl_avg": 0.0004121576203033328,
130
+ "policy/clipfrac_avg": 0.0013757861452177167,
131
+ "policy/entropy_avg": 0.20963406562805176,
132
+ "step": 6,
133
+ "val/clipfrac_avg": 0.0,
134
+ "val/num_eos_tokens": 0,
135
+ "val/ratio": 1.0004911422729492,
136
+ "val/ratio_var": 3.83469341613818e-06
137
+ },
138
+ {
139
+ "episode": 168,
140
+ "epoch": 0.024,
141
+ "eps": 3,
142
+ "loss/policy_avg": -0.0006043941248208284,
143
+ "loss/value_avg": 0.0011014372576028109,
144
+ "lr": 2e-07,
145
+ "objective/entropy": 9.983142852783203,
146
+ "objective/kl": -0.023275122046470642,
147
+ "objective/non_score_reward": 0.0011637563584372401,
148
+ "objective/rlhf_reward": 0.0011637563584372401,
149
+ "objective/scores": 0.0,
150
+ "policy/approxkl_avg": 0.00038099882658571005,
151
+ "policy/clipfrac_avg": 0.00039308174746111035,
152
+ "policy/entropy_avg": 0.18453757464885712,
153
+ "step": 7,
154
+ "val/clipfrac_avg": 0.0,
155
+ "val/num_eos_tokens": 0,
156
+ "val/ratio": 0.998975396156311,
157
+ "val/ratio_var": 2.2136255211080424e-06
158
+ },
159
+ {
160
+ "episode": 192,
161
+ "epoch": 0.027428571428571427,
162
+ "eps": 3,
163
+ "loss/policy_avg": -0.0012414506636559963,
164
+ "loss/value_avg": 0.0011209447402507067,
165
+ "lr": 2.3333333333333333e-07,
166
+ "objective/entropy": 10.798426628112793,
167
+ "objective/kl": 0.0037710717879235744,
168
+ "objective/non_score_reward": -0.00018855356029234827,
169
+ "objective/rlhf_reward": -0.00018855356029234827,
170
+ "objective/scores": 0.0,
171
+ "policy/approxkl_avg": 0.00043036130955442786,
172
+ "policy/clipfrac_avg": 0.0021619496401399374,
173
+ "policy/entropy_avg": 0.1988888382911682,
174
+ "step": 8,
175
+ "val/clipfrac_avg": 0.0,
176
+ "val/num_eos_tokens": 0,
177
+ "val/ratio": 0.999695360660553,
178
+ "val/ratio_var": 1.6811103478175937e-06
179
+ },
180
+ {
181
+ "episode": 216,
182
+ "epoch": 0.030857142857142857,
183
+ "eps": 3,
184
+ "loss/policy_avg": -0.0001550057204440236,
185
+ "loss/value_avg": 0.0011778806801885366,
186
+ "lr": 2.6666666666666667e-07,
187
+ "objective/entropy": 11.547513961791992,
188
+ "objective/kl": -0.006700088735669851,
189
+ "objective/non_score_reward": 0.00033500464633107185,
190
+ "objective/rlhf_reward": 0.00033500464633107185,
191
+ "objective/scores": 0.0,
192
+ "policy/approxkl_avg": 0.0003637151967268437,
193
+ "policy/clipfrac_avg": 0.00039308174746111035,
194
+ "policy/entropy_avg": 0.20382140576839447,
195
+ "step": 9,
196
+ "val/clipfrac_avg": 0.0,
197
+ "val/num_eos_tokens": 0,
198
+ "val/ratio": 1.000375509262085,
199
+ "val/ratio_var": 2.181258651035023e-06
200
+ },
201
+ {
202
+ "episode": 240,
203
+ "epoch": 0.03428571428571429,
204
+ "eps": 3,
205
+ "loss/policy_avg": 0.00045341846998780966,
206
+ "loss/value_avg": 0.0011125364108011127,
207
+ "lr": 3e-07,
208
+ "objective/entropy": 10.174510955810547,
209
+ "objective/kl": 0.07684259861707687,
210
+ "objective/non_score_reward": -0.0038421298377215862,
211
+ "objective/rlhf_reward": -0.0038421298377215862,
212
+ "objective/scores": 0.0,
213
+ "policy/approxkl_avg": 0.00037233877810649574,
214
+ "policy/clipfrac_avg": 0.00039308174746111035,
215
+ "policy/entropy_avg": 0.20036275684833527,
216
+ "step": 10,
217
+ "val/clipfrac_avg": 0.0,
218
+ "val/num_eos_tokens": 0,
219
+ "val/ratio": 1.0004992485046387,
220
+ "val/ratio_var": 3.829253728326876e-06
221
+ },
222
+ {
223
+ "episode": 264,
224
+ "epoch": 0.037714285714285714,
225
+ "eps": 4,
226
+ "loss/policy_avg": -0.0006368225440382957,
227
+ "loss/value_avg": 0.0012326778378337622,
228
+ "lr": 3.333333333333333e-07,
229
+ "objective/entropy": 9.915946006774902,
230
+ "objective/kl": 0.08295571058988571,
231
+ "objective/non_score_reward": -0.0041477857157588005,
232
+ "objective/rlhf_reward": -0.0041477857157588005,
233
+ "objective/scores": 0.0,
234
+ "policy/approxkl_avg": 0.0003957168955821544,
235
+ "policy/clipfrac_avg": 0.00019654087373055518,
236
+ "policy/entropy_avg": 0.18829384446144104,
237
+ "step": 11,
238
+ "val/clipfrac_avg": 0.0,
239
+ "val/num_eos_tokens": 0,
240
+ "val/ratio": 1.0000991821289062,
241
+ "val/ratio_var": 2.684702167243813e-06
242
+ },
243
+ {
244
+ "episode": 288,
245
+ "epoch": 0.04114285714285714,
246
+ "eps": 4,
247
+ "loss/policy_avg": -0.00045989686623215675,
248
+ "loss/value_avg": 0.001033363863825798,
249
+ "lr": 3.666666666666666e-07,
250
+ "objective/entropy": 9.914142608642578,
251
+ "objective/kl": 0.001713767647743225,
252
+ "objective/non_score_reward": -8.56887418194674e-05,
253
+ "objective/rlhf_reward": -8.56887418194674e-05,
254
+ "objective/scores": 0.0,
255
+ "policy/approxkl_avg": 0.00041497976053506136,
256
+ "policy/clipfrac_avg": 0.0013757861452177167,
257
+ "policy/entropy_avg": 0.20258067548274994,
258
+ "step": 12,
259
+ "val/clipfrac_avg": 0.0,
260
+ "val/num_eos_tokens": 0,
261
+ "val/ratio": 1.000044345855713,
262
+ "val/ratio_var": 2.319321311006206e-06
263
+ },
264
+ {
265
+ "episode": 312,
266
+ "epoch": 0.044571428571428574,
267
+ "eps": 4,
268
+ "loss/policy_avg": -4.981772508472204e-05,
269
+ "loss/value_avg": 0.0011141800787299871,
270
+ "lr": 4e-07,
271
+ "objective/entropy": 8.326139450073242,
272
+ "objective/kl": 0.03665222227573395,
273
+ "objective/non_score_reward": -0.001832611276768148,
274
+ "objective/rlhf_reward": -0.001832611276768148,
275
+ "objective/scores": 0.0,
276
+ "policy/approxkl_avg": 0.0002924801083281636,
277
+ "policy/clipfrac_avg": 0.00019654087373055518,
278
+ "policy/entropy_avg": 0.15716251730918884,
279
+ "step": 13,
280
+ "val/clipfrac_avg": 0.0,
281
+ "val/num_eos_tokens": 0,
282
+ "val/ratio": 1.0000745058059692,
283
+ "val/ratio_var": 9.655751682657865e-07
284
+ },
285
+ {
286
+ "episode": 336,
287
+ "epoch": 0.048,
288
+ "eps": 4,
289
+ "loss/policy_avg": -0.0008058485109359026,
290
+ "loss/value_avg": 0.0011609160574153066,
291
+ "lr": 4.3333333333333335e-07,
292
+ "objective/entropy": 10.01640510559082,
293
+ "objective/kl": 0.02480611763894558,
294
+ "objective/non_score_reward": -0.001240305951796472,
295
+ "objective/rlhf_reward": -0.001240305951796472,
296
+ "objective/scores": 0.0,
297
+ "policy/approxkl_avg": 0.00044258785783313215,
298
+ "policy/clipfrac_avg": 0.001965408679097891,
299
+ "policy/entropy_avg": 0.19537438452243805,
300
+ "step": 14,
301
+ "val/clipfrac_avg": 0.0,
302
+ "val/num_eos_tokens": 0,
303
+ "val/ratio": 0.9992372989654541,
304
+ "val/ratio_var": 2.064736008833279e-06
305
+ },
306
+ {
307
+ "episode": 360,
308
+ "epoch": 0.05142857142857143,
309
+ "eps": 4,
310
+ "loss/policy_avg": -0.0012431369395926595,
311
+ "loss/value_avg": 0.0011450606398284435,
312
+ "lr": 4.6666666666666666e-07,
313
+ "objective/entropy": 8.250815391540527,
314
+ "objective/kl": 0.026728516444563866,
315
+ "objective/non_score_reward": -0.0013364258920773864,
316
+ "objective/rlhf_reward": -0.0013364258920773864,
317
+ "objective/scores": 0.0,
318
+ "policy/approxkl_avg": 0.00041782623156905174,
319
+ "policy/clipfrac_avg": 0.0027515722904354334,
320
+ "policy/entropy_avg": 0.15008953213691711,
321
+ "step": 15,
322
+ "val/clipfrac_avg": 0.0,
323
+ "val/num_eos_tokens": 0,
324
+ "val/ratio": 0.9994743466377258,
325
+ "val/ratio_var": 1.3626447525894037e-06
326
+ },
327
+ {
328
+ "episode": 384,
329
+ "epoch": 0.054857142857142854,
330
+ "eps": 4,
331
+ "loss/policy_avg": -0.0016457759775221348,
332
+ "loss/value_avg": 0.0012071160599589348,
333
+ "lr": 5e-07,
334
+ "objective/entropy": 9.560142517089844,
335
+ "objective/kl": 0.04084904119372368,
336
+ "objective/non_score_reward": -0.0020424523390829563,
337
+ "objective/rlhf_reward": -0.0020424523390829563,
338
+ "objective/scores": 0.0,
339
+ "policy/approxkl_avg": 0.00030206821975298226,
340
+ "policy/clipfrac_avg": 0.000589622650295496,
341
+ "policy/entropy_avg": 0.19114547967910767,
342
+ "step": 16,
343
+ "val/clipfrac_avg": 0.0,
344
+ "val/num_eos_tokens": 0,
345
+ "val/ratio": 0.9999871253967285,
346
+ "val/ratio_var": 1.3688204489881173e-06
347
+ },
348
+ {
349
+ "episode": 408,
350
+ "epoch": 0.05828571428571429,
351
+ "eps": 4,
352
+ "loss/policy_avg": 0.00015279045328497887,
353
+ "loss/value_avg": 0.0011765253730118275,
354
+ "lr": 5.333333333333333e-07,
355
+ "objective/entropy": 10.473526000976562,
356
+ "objective/kl": 0.012600630521774292,
357
+ "objective/non_score_reward": -0.0006300313398241997,
358
+ "objective/rlhf_reward": -0.0006300313398241997,
359
+ "objective/scores": 0.0,
360
+ "policy/approxkl_avg": 0.000512410537339747,
361
+ "policy/clipfrac_avg": 0.0009827043395489454,
362
+ "policy/entropy_avg": 0.1867837905883789,
363
+ "step": 17,
364
+ "val/clipfrac_avg": 0.0,
365
+ "val/num_eos_tokens": 0,
366
+ "val/ratio": 1.0001708269119263,
367
+ "val/ratio_var": 3.6231228932592785e-06
368
+ },
369
+ {
370
+ "episode": 432,
371
+ "epoch": 0.061714285714285715,
372
+ "eps": 4,
373
+ "loss/policy_avg": -0.0015998110175132751,
374
+ "loss/value_avg": 0.0011272402480244637,
375
+ "lr": 5.666666666666666e-07,
376
+ "objective/entropy": 9.374881744384766,
377
+ "objective/kl": -0.022168656811118126,
378
+ "objective/non_score_reward": 0.001108432887122035,
379
+ "objective/rlhf_reward": 0.001108432887122035,
380
+ "objective/scores": 0.0,
381
+ "policy/approxkl_avg": 0.0003468349459581077,
382
+ "policy/clipfrac_avg": 0.00019654087373055518,
383
+ "policy/entropy_avg": 0.17824786901474,
384
+ "step": 18,
385
+ "val/clipfrac_avg": 0.0,
386
+ "val/num_eos_tokens": 0,
387
+ "val/ratio": 1.0005429983139038,
388
+ "val/ratio_var": 1.015850784824579e-06
389
+ },
390
+ {
391
+ "episode": 456,
392
+ "epoch": 0.06514285714285714,
393
+ "eps": 4,
394
+ "loss/policy_avg": 0.0005391768645495176,
395
+ "loss/value_avg": 0.0010727113112807274,
396
+ "lr": 6e-07,
397
+ "objective/entropy": 9.847137451171875,
398
+ "objective/kl": 0.10544447600841522,
399
+ "objective/non_score_reward": -0.0052722240798175335,
400
+ "objective/rlhf_reward": -0.0052722240798175335,
401
+ "objective/scores": 0.0,
402
+ "policy/approxkl_avg": 0.0003828281769528985,
403
+ "policy/clipfrac_avg": 0.000589622650295496,
404
+ "policy/entropy_avg": 0.191776305437088,
405
+ "step": 19,
406
+ "val/clipfrac_avg": 0.0,
407
+ "val/num_eos_tokens": 0,
408
+ "val/ratio": 0.9989888072013855,
409
+ "val/ratio_var": 1.1592291002671118e-06
410
+ },
411
+ {
412
+ "episode": 480,
413
+ "epoch": 0.06857142857142857,
414
+ "eps": 4,
415
+ "loss/policy_avg": 0.0004999454831704497,
416
+ "loss/value_avg": 0.0011963938595727086,
417
+ "lr": 6.333333333333332e-07,
418
+ "objective/entropy": 9.980831146240234,
419
+ "objective/kl": 0.009486228227615356,
420
+ "objective/non_score_reward": -0.0004743114986922592,
421
+ "objective/rlhf_reward": -0.0004743114986922592,
422
+ "objective/scores": 0.0,
423
+ "policy/approxkl_avg": 0.0003429501666687429,
424
+ "policy/clipfrac_avg": 0.0,
425
+ "policy/entropy_avg": 0.18964967131614685,
426
+ "step": 20,
427
+ "val/clipfrac_avg": 0.0,
428
+ "val/num_eos_tokens": 0,
429
+ "val/ratio": 0.9999833703041077,
430
+ "val/ratio_var": 3.019130190295982e-06
431
+ },
432
+ {
433
+ "episode": 504,
434
+ "epoch": 0.072,
435
+ "eps": 4,
436
+ "loss/policy_avg": -0.0006457410054281354,
437
+ "loss/value_avg": 0.0011750897392630577,
438
+ "lr": 6.666666666666666e-07,
439
+ "objective/entropy": 8.674556732177734,
440
+ "objective/kl": 0.16487348079681396,
441
+ "objective/non_score_reward": -0.008243675343692303,
442
+ "objective/rlhf_reward": -0.008243675343692303,
443
+ "objective/scores": 0.0,
444
+ "policy/approxkl_avg": 0.00034728588070720434,
445
+ "policy/clipfrac_avg": 0.0007861634949222207,
446
+ "policy/entropy_avg": 0.17806154489517212,
447
+ "step": 21,
448
+ "val/clipfrac_avg": 0.0,
449
+ "val/num_eos_tokens": 0,
450
+ "val/ratio": 0.9990400671958923,
451
+ "val/ratio_var": 1.2348247082627495e-06
452
+ },
453
+ {
454
+ "episode": 528,
455
+ "epoch": 0.07542857142857143,
456
+ "eps": 4,
457
+ "loss/policy_avg": -0.0015686429105699062,
458
+ "loss/value_avg": 0.001164754037745297,
459
+ "lr": 7e-07,
460
+ "objective/entropy": 10.79302978515625,
461
+ "objective/kl": -0.08593261241912842,
462
+ "objective/non_score_reward": 0.004296630620956421,
463
+ "objective/rlhf_reward": 0.004296630620956421,
464
+ "objective/scores": 0.0,
465
+ "policy/approxkl_avg": 0.00043445732444524765,
466
+ "policy/clipfrac_avg": 0.001768867950886488,
467
+ "policy/entropy_avg": 0.1922873705625534,
468
+ "step": 22,
469
+ "val/clipfrac_avg": 0.0,
470
+ "val/num_eos_tokens": 0,
471
+ "val/ratio": 1.0014443397521973,
472
+ "val/ratio_var": 3.1963945730240084e-06
473
+ },
474
+ {
475
+ "episode": 552,
476
+ "epoch": 0.07885714285714286,
477
+ "eps": 4,
478
+ "loss/policy_avg": -0.0007012488786131144,
479
+ "loss/value_avg": 0.0011740096379071474,
480
+ "lr": 7.333333333333332e-07,
481
+ "objective/entropy": 9.420793533325195,
482
+ "objective/kl": 0.13005897402763367,
483
+ "objective/non_score_reward": -0.006502948235720396,
484
+ "objective/rlhf_reward": -0.006502948235720396,
485
+ "objective/scores": 0.0,
486
+ "policy/approxkl_avg": 0.000359610392479226,
487
+ "policy/clipfrac_avg": 0.0013757860288023949,
488
+ "policy/entropy_avg": 0.1827276051044464,
489
+ "step": 23,
490
+ "val/clipfrac_avg": 0.0,
491
+ "val/num_eos_tokens": 0,
492
+ "val/ratio": 0.9988819360733032,
493
+ "val/ratio_var": 1.4016917475601076e-06
494
+ },
495
+ {
496
+ "episode": 576,
497
+ "epoch": 0.08228571428571428,
498
+ "eps": 4,
499
+ "loss/policy_avg": -0.0009957049041986465,
500
+ "loss/value_avg": 0.001171375741250813,
501
+ "lr": 7.666666666666667e-07,
502
+ "objective/entropy": 12.150233268737793,
503
+ "objective/kl": 0.11499027907848358,
504
+ "objective/non_score_reward": -0.005749514326453209,
505
+ "objective/rlhf_reward": -0.005749514326453209,
506
+ "objective/scores": 0.0,
507
+ "policy/approxkl_avg": 0.0012159445323050022,
508
+ "policy/clipfrac_avg": 0.004323899280279875,
509
+ "policy/entropy_avg": 0.22310155630111694,
510
+ "step": 24,
511
+ "val/clipfrac_avg": 0.0,
512
+ "val/num_eos_tokens": 0,
513
+ "val/ratio": 0.9989429712295532,
514
+ "val/ratio_var": 1.6364158454962308e-06
515
+ },
516
+ {
517
+ "episode": 600,
518
+ "epoch": 0.08571428571428572,
519
+ "eps": 4,
520
+ "loss/policy_avg": 5.847762804478407e-05,
521
+ "loss/value_avg": 0.0012743598781526089,
522
+ "lr": 8e-07,
523
+ "objective/entropy": 10.221760749816895,
524
+ "objective/kl": -0.07403554022312164,
525
+ "objective/non_score_reward": 0.0037017769645899534,
526
+ "objective/rlhf_reward": 0.0037017769645899534,
527
+ "objective/scores": 0.0,
528
+ "policy/approxkl_avg": 0.00038175773806869984,
529
+ "policy/clipfrac_avg": 0.00039308174746111035,
530
+ "policy/entropy_avg": 0.1923346221446991,
531
+ "step": 25,
532
+ "val/clipfrac_avg": 0.0,
533
+ "val/num_eos_tokens": 0,
534
+ "val/ratio": 1.0002760887145996,
535
+ "val/ratio_var": 2.4850394311215496e-06
536
+ },
537
+ {
538
+ "episode": 624,
539
+ "epoch": 0.08914285714285715,
540
+ "eps": 4,
541
+ "loss/policy_avg": 0.0005327528342604637,
542
+ "loss/value_avg": 0.001223960192874074,
543
+ "lr": 8.333333333333333e-07,
544
+ "objective/entropy": 9.961271286010742,
545
+ "objective/kl": -0.03506360575556755,
546
+ "objective/non_score_reward": 0.0017531800549477339,
547
+ "objective/rlhf_reward": 0.0017531800549477339,
548
+ "objective/scores": 0.0,
549
+ "policy/approxkl_avg": 0.0003690699813887477,
550
+ "policy/clipfrac_avg": 0.00019654087373055518,
551
+ "policy/entropy_avg": 0.19056138396263123,
552
+ "step": 26,
553
+ "val/clipfrac_avg": 0.0,
554
+ "val/num_eos_tokens": 0,
555
+ "val/ratio": 0.9998593926429749,
556
+ "val/ratio_var": 2.2398712644644547e-06
557
+ },
558
+ {
559
+ "episode": 648,
560
+ "epoch": 0.09257142857142857,
561
+ "eps": 4,
562
+ "loss/policy_avg": -0.0012908531352877617,
563
+ "loss/value_avg": 0.0011609859066084027,
564
+ "lr": 8.666666666666667e-07,
565
+ "objective/entropy": 10.788581848144531,
566
+ "objective/kl": -0.035950906574726105,
567
+ "objective/non_score_reward": 0.0017975454684346914,
568
+ "objective/rlhf_reward": 0.0017975454684346914,
569
+ "objective/scores": 0.0,
570
+ "policy/approxkl_avg": 0.00042478417162783444,
571
+ "policy/clipfrac_avg": 0.001179245300590992,
572
+ "policy/entropy_avg": 0.20873138308525085,
573
+ "step": 27,
574
+ "val/clipfrac_avg": 0.0,
575
+ "val/num_eos_tokens": 0,
576
+ "val/ratio": 0.9998935461044312,
577
+ "val/ratio_var": 2.4417051918135257e-06
578
+ },
579
+ {
580
+ "episode": 672,
581
+ "epoch": 0.096,
582
+ "eps": 4,
583
+ "loss/policy_avg": 0.000542065070476383,
584
+ "loss/value_avg": 0.0012408210895955563,
585
+ "lr": 9e-07,
586
+ "objective/entropy": 8.87173080444336,
587
+ "objective/kl": -0.033091746270656586,
588
+ "objective/non_score_reward": 0.0016545872204005718,
589
+ "objective/rlhf_reward": 0.0016545872204005718,
590
+ "objective/scores": 0.0,
591
+ "policy/approxkl_avg": 0.00035692937672138214,
592
+ "policy/clipfrac_avg": 0.0,
593
+ "policy/entropy_avg": 0.17584078013896942,
594
+ "step": 28,
595
+ "val/clipfrac_avg": 0.0,
596
+ "val/num_eos_tokens": 0,
597
+ "val/ratio": 0.9999797940254211,
598
+ "val/ratio_var": 2.1010207547078608e-06
599
+ },
600
+ {
601
+ "episode": 696,
602
+ "epoch": 0.09942857142857142,
603
+ "eps": 3,
604
+ "loss/policy_avg": 0.00025542714865878224,
605
+ "loss/value_avg": 0.0009407824254594743,
606
+ "lr": 9.333333333333333e-07,
607
+ "objective/entropy": 10.091318130493164,
608
+ "objective/kl": -0.0042562782764434814,
609
+ "objective/non_score_reward": 0.00021281388762872666,
610
+ "objective/rlhf_reward": 0.00021281388762872666,
611
+ "objective/scores": 0.0,
612
+ "policy/approxkl_avg": 0.0004177533555775881,
613
+ "policy/clipfrac_avg": 0.000589622650295496,
614
+ "policy/entropy_avg": 0.20332473516464233,
615
+ "step": 29,
616
+ "val/clipfrac_avg": 0.0,
617
+ "val/num_eos_tokens": 0,
618
+ "val/ratio": 1.0008416175842285,
619
+ "val/ratio_var": 4.070296654390404e-06
620
+ },
621
+ {
622
+ "episode": 720,
623
+ "epoch": 0.10285714285714286,
624
+ "eps": 3,
625
+ "loss/policy_avg": -0.0018287734128534794,
626
+ "loss/value_avg": 0.0010721104918047786,
627
+ "lr": 9.666666666666666e-07,
628
+ "objective/entropy": 11.005254745483398,
629
+ "objective/kl": 0.05638129264116287,
630
+ "objective/non_score_reward": -0.002819064538925886,
631
+ "objective/rlhf_reward": -0.002819064538925886,
632
+ "objective/scores": 0.0,
633
+ "policy/approxkl_avg": 0.00047117556096054614,
634
+ "policy/clipfrac_avg": 0.0015723269898444414,
635
+ "policy/entropy_avg": 0.20323297381401062,
636
+ "step": 30,
637
+ "val/clipfrac_avg": 0.0,
638
+ "val/num_eos_tokens": 0,
639
+ "val/ratio": 0.9995233416557312,
640
+ "val/ratio_var": 2.1699356693716254e-06
641
+ },
642
+ {
643
+ "episode": 744,
644
+ "epoch": 0.10628571428571429,
645
+ "eps": 3,
646
+ "loss/policy_avg": -0.0013574458425864577,
647
+ "loss/value_avg": 0.0011641676537692547,
648
+ "lr": 1e-06,
649
+ "objective/entropy": 11.450809478759766,
650
+ "objective/kl": -0.010124157182872295,
651
+ "objective/non_score_reward": 0.000506207812577486,
652
+ "objective/rlhf_reward": 0.000506207812577486,
653
+ "objective/scores": 0.0,
654
+ "policy/approxkl_avg": 0.0004096939228475094,
655
+ "policy/clipfrac_avg": 0.00039308174746111035,
656
+ "policy/entropy_avg": 0.2194271683692932,
657
+ "step": 31,
658
+ "val/clipfrac_avg": 0.0,
659
+ "val/num_eos_tokens": 0,
660
+ "val/ratio": 1.000313401222229,
661
+ "val/ratio_var": 2.678104010556126e-06
662
+ },
663
+ {
664
+ "episode": 768,
665
+ "epoch": 0.10971428571428571,
666
+ "eps": 3,
667
+ "loss/policy_avg": -0.0002802757080644369,
668
+ "loss/value_avg": 0.001182440435513854,
669
+ "lr": 9.999676499856762e-07,
670
+ "objective/entropy": 9.76666259765625,
671
+ "objective/kl": 0.06829522550106049,
672
+ "objective/non_score_reward": -0.0034147612750530243,
673
+ "objective/rlhf_reward": -0.0034147612750530243,
674
+ "objective/scores": 0.0,
675
+ "policy/approxkl_avg": 0.00030742710805498064,
676
+ "policy/clipfrac_avg": 0.00039308174746111035,
677
+ "policy/entropy_avg": 0.19262900948524475,
678
+ "step": 32,
679
+ "val/clipfrac_avg": 0.0,
680
+ "val/num_eos_tokens": 0,
681
+ "val/ratio": 1.0005744695663452,
682
+ "val/ratio_var": 2.010586285905447e-06
683
+ },
684
+ {
685
+ "episode": 792,
686
+ "epoch": 0.11314285714285714,
687
+ "eps": 3,
688
+ "loss/policy_avg": -4.241405986249447e-05,
689
+ "loss/value_avg": 0.001140992040745914,
690
+ "lr": 9.998706045939205e-07,
691
+ "objective/entropy": 9.192793846130371,
692
+ "objective/kl": 0.011741682887077332,
693
+ "objective/non_score_reward": -0.0005870835739187896,
694
+ "objective/rlhf_reward": -0.0005870835739187896,
695
+ "objective/scores": 0.0,
696
+ "policy/approxkl_avg": 0.0003442139131948352,
697
+ "policy/clipfrac_avg": 0.0007861634949222207,
698
+ "policy/entropy_avg": 0.18409989774227142,
699
+ "step": 33,
700
+ "val/clipfrac_avg": 0.0,
701
+ "val/num_eos_tokens": 0,
702
+ "val/ratio": 1.0005278587341309,
703
+ "val/ratio_var": 2.262753923787386e-06
704
+ },
705
+ {
706
+ "episode": 816,
707
+ "epoch": 0.11657142857142858,
708
+ "eps": 3,
709
+ "loss/policy_avg": -0.0008663232438266277,
710
+ "loss/value_avg": 0.001219075871631503,
711
+ "lr": 9.997088777777095e-07,
712
+ "objective/entropy": 9.554176330566406,
713
+ "objective/kl": 0.09085823595523834,
714
+ "objective/non_score_reward": -0.004542911425232887,
715
+ "objective/rlhf_reward": -0.004542911425232887,
716
+ "objective/scores": 0.0,
717
+ "policy/approxkl_avg": 0.0003868684871122241,
718
+ "policy/clipfrac_avg": 0.0009827043395489454,
719
+ "policy/entropy_avg": 0.2021941840648651,
720
+ "step": 34,
721
+ "val/clipfrac_avg": 0.0,
722
+ "val/num_eos_tokens": 0,
723
+ "val/ratio": 0.9998293519020081,
724
+ "val/ratio_var": 1.1972449556196807e-06
725
+ },
726
+ {
727
+ "episode": 840,
728
+ "epoch": 0.12,
729
+ "eps": 3,
730
+ "loss/policy_avg": -1.8059727153740823e-05,
731
+ "loss/value_avg": 0.00110139069147408,
732
+ "lr": 9.994824927897762e-07,
733
+ "objective/entropy": 10.6250638961792,
734
+ "objective/kl": -0.057124651968479156,
735
+ "objective/non_score_reward": 0.002856232225894928,
736
+ "objective/rlhf_reward": 0.002856232225894928,
737
+ "objective/scores": 0.0,
738
+ "policy/approxkl_avg": 0.00036990485386922956,
739
+ "policy/clipfrac_avg": 0.00039308174746111035,
740
+ "policy/entropy_avg": 0.19656652212142944,
741
+ "step": 35,
742
+ "val/clipfrac_avg": 0.0,
743
+ "val/num_eos_tokens": 0,
744
+ "val/ratio": 1.000251054763794,
745
+ "val/ratio_var": 3.1737645258544944e-06
746
+ },
747
+ {
748
+ "episode": 864,
749
+ "epoch": 0.12342857142857143,
750
+ "eps": 3,
751
+ "loss/policy_avg": 0.0005211837124079466,
752
+ "loss/value_avg": 0.001098247361369431,
753
+ "lr": 9.99191482179265e-07,
754
+ "objective/entropy": 11.432059288024902,
755
+ "objective/kl": 0.03775228559970856,
756
+ "objective/non_score_reward": -0.001887614605948329,
757
+ "objective/rlhf_reward": -0.001887614605948329,
758
+ "objective/scores": 0.0,
759
+ "policy/approxkl_avg": 0.0015855329111218452,
760
+ "policy/clipfrac_avg": 0.001965408679097891,
761
+ "policy/entropy_avg": 0.2018502950668335,
762
+ "step": 36,
763
+ "val/clipfrac_avg": 0.0,
764
+ "val/num_eos_tokens": 0,
765
+ "val/ratio": 0.9998794794082642,
766
+ "val/ratio_var": 2.63856213678082e-06
767
+ },
768
+ {
769
+ "episode": 888,
770
+ "epoch": 0.12685714285714286,
771
+ "eps": 3,
772
+ "loss/policy_avg": -0.0003723090048879385,
773
+ "loss/value_avg": 0.0011761891655623913,
774
+ "lr": 9.988358877870534e-07,
775
+ "objective/entropy": 13.564403533935547,
776
+ "objective/kl": 0.14848901331424713,
777
+ "objective/non_score_reward": -0.007424450945109129,
778
+ "objective/rlhf_reward": -0.007424450945109129,
779
+ "objective/scores": 0.0,
780
+ "policy/approxkl_avg": 0.0005069933249615133,
781
+ "policy/clipfrac_avg": 0.000589622650295496,
782
+ "policy/entropy_avg": 0.21795369684696198,
783
+ "step": 37,
784
+ "val/clipfrac_avg": 0.0,
785
+ "val/num_eos_tokens": 0,
786
+ "val/ratio": 0.9994122982025146,
787
+ "val/ratio_var": 3.516516017043614e-06
788
+ },
789
+ {
790
+ "episode": 912,
791
+ "epoch": 0.13028571428571428,
792
+ "eps": 3,
793
+ "loss/policy_avg": -0.0005580224096775055,
794
+ "loss/value_avg": 0.0010930404532700777,
795
+ "lr": 9.984157607397357e-07,
796
+ "objective/entropy": 12.46867561340332,
797
+ "objective/kl": 0.04366648197174072,
798
+ "objective/non_score_reward": -0.00218332395888865,
799
+ "objective/rlhf_reward": -0.00218332395888865,
800
+ "objective/scores": 0.0,
801
+ "policy/approxkl_avg": 0.000503862276673317,
802
+ "policy/clipfrac_avg": 0.000589622650295496,
803
+ "policy/entropy_avg": 0.2015783190727234,
804
+ "step": 38,
805
+ "val/clipfrac_avg": 0.0,
806
+ "val/num_eos_tokens": 0,
807
+ "val/ratio": 1.0004000663757324,
808
+ "val/ratio_var": 6.4164978539338335e-06
809
+ },
810
+ {
811
+ "episode": 936,
812
+ "epoch": 0.1337142857142857,
813
+ "eps": 3,
814
+ "loss/policy_avg": -0.0002338151098228991,
815
+ "loss/value_avg": 0.0011000724043697119,
816
+ "lr": 9.979311614422718e-07,
817
+ "objective/entropy": 11.266685485839844,
818
+ "objective/kl": 0.10951797664165497,
819
+ "objective/non_score_reward": -0.005475898738950491,
820
+ "objective/rlhf_reward": -0.005475898738950491,
821
+ "objective/scores": 0.0,
822
+ "policy/approxkl_avg": 0.0004268661141395569,
823
+ "policy/clipfrac_avg": 0.000589622650295496,
824
+ "policy/entropy_avg": 0.21669596433639526,
825
+ "step": 39,
826
+ "val/clipfrac_avg": 0.0,
827
+ "val/num_eos_tokens": 0,
828
+ "val/ratio": 0.9992050528526306,
829
+ "val/ratio_var": 1.6559835103180376e-06
830
+ },
831
+ {
832
+ "episode": 960,
833
+ "epoch": 0.13714285714285715,
834
+ "eps": 3,
835
+ "loss/policy_avg": -0.0007765925256535411,
836
+ "loss/value_avg": 0.001088449265807867,
837
+ "lr": 9.973821595693026e-07,
838
+ "objective/entropy": 12.338579177856445,
839
+ "objective/kl": 0.06209728866815567,
840
+ "objective/non_score_reward": -0.0031048643868416548,
841
+ "objective/rlhf_reward": -0.0031048643868416548,
842
+ "objective/scores": 0.0,
843
+ "policy/approxkl_avg": 0.0005026735016144812,
844
+ "policy/clipfrac_avg": 0.001179245300590992,
845
+ "policy/entropy_avg": 0.2244863510131836,
846
+ "step": 40,
847
+ "val/clipfrac_avg": 0.0,
848
+ "val/num_eos_tokens": 0,
849
+ "val/ratio": 0.9999690651893616,
850
+ "val/ratio_var": 5.0828230087063275e-06
851
+ },
852
+ {
853
+ "episode": 984,
854
+ "epoch": 0.14057142857142857,
855
+ "eps": 3,
856
+ "loss/policy_avg": 0.00013598555233329535,
857
+ "loss/value_avg": 0.001187118818052113,
858
+ "lr": 9.967688340551327e-07,
859
+ "objective/entropy": 9.387279510498047,
860
+ "objective/kl": -0.033916082233190536,
861
+ "objective/non_score_reward": 0.0016958042979240417,
862
+ "objective/rlhf_reward": 0.0016958042979240417,
863
+ "objective/scores": 0.0,
864
+ "policy/approxkl_avg": 0.0004025290545541793,
865
+ "policy/clipfrac_avg": 0.0013757861452177167,
866
+ "policy/entropy_avg": 0.19263054430484772,
867
+ "step": 41,
868
+ "val/clipfrac_avg": 0.0,
869
+ "val/num_eos_tokens": 0,
870
+ "val/ratio": 1.0011463165283203,
871
+ "val/ratio_var": 3.6780629670829512e-06
872
+ },
873
+ {
874
+ "episode": 1008,
875
+ "epoch": 0.144,
876
+ "eps": 3,
877
+ "loss/policy_avg": -0.001398796564899385,
878
+ "loss/value_avg": 0.0012791166082024574,
879
+ "lr": 9.960912730823802e-07,
880
+ "objective/entropy": 9.443194389343262,
881
+ "objective/kl": 0.040594689548015594,
882
+ "objective/non_score_reward": -0.0020297346636652946,
883
+ "objective/rlhf_reward": -0.0020297346636652946,
884
+ "objective/scores": 0.0,
885
+ "policy/approxkl_avg": 0.00038655230309814215,
886
+ "policy/clipfrac_avg": 0.0009827043395489454,
887
+ "policy/entropy_avg": 0.17716416716575623,
888
+ "step": 42,
889
+ "val/clipfrac_avg": 0.0,
890
+ "val/num_eos_tokens": 0,
891
+ "val/ratio": 0.9991838335990906,
892
+ "val/ratio_var": 1.4638164884672733e-06
893
+ },
894
+ {
895
+ "episode": 1032,
896
+ "epoch": 0.14742857142857144,
897
+ "eps": 3,
898
+ "loss/policy_avg": -0.0008904861751943827,
899
+ "loss/value_avg": 0.0012651337310671806,
900
+ "lr": 9.953495740692994e-07,
901
+ "objective/entropy": 8.784332275390625,
902
+ "objective/kl": 0.048220910131931305,
903
+ "objective/non_score_reward": -0.0024110455997288227,
904
+ "objective/rlhf_reward": -0.0024110455997288227,
905
+ "objective/scores": 0.0,
906
+ "policy/approxkl_avg": 0.00036320951767265797,
907
+ "policy/clipfrac_avg": 0.00039308174746111035,
908
+ "policy/entropy_avg": 0.19501766562461853,
909
+ "step": 43,
910
+ "val/clipfrac_avg": 0.0,
911
+ "val/num_eos_tokens": 0,
912
+ "val/ratio": 0.9992011785507202,
913
+ "val/ratio_var": 7.953665885906958e-07
914
+ },
915
+ {
916
+ "episode": 1056,
917
+ "epoch": 0.15085714285714286,
918
+ "eps": 3,
919
+ "loss/policy_avg": -0.00040724524296820164,
920
+ "loss/value_avg": 0.0012381336418911815,
921
+ "lr": 9.945438436557734e-07,
922
+ "objective/entropy": 10.724536895751953,
923
+ "objective/kl": 0.09735743701457977,
924
+ "objective/non_score_reward": -0.004867871757596731,
925
+ "objective/rlhf_reward": -0.004867871757596731,
926
+ "objective/scores": 0.0,
927
+ "policy/approxkl_avg": 0.0004834880237467587,
928
+ "policy/clipfrac_avg": 0.0009827043395489454,
929
+ "policy/entropy_avg": 0.19912828505039215,
930
+ "step": 44,
931
+ "val/clipfrac_avg": 0.0,
932
+ "val/num_eos_tokens": 0,
933
+ "val/ratio": 0.9993612766265869,
934
+ "val/ratio_var": 3.948975518142106e-06
935
+ },
936
+ {
937
+ "episode": 1080,
938
+ "epoch": 0.15428571428571428,
939
+ "eps": 3,
940
+ "loss/policy_avg": -0.0007694043451920152,
941
+ "loss/value_avg": 0.0012855801032856107,
942
+ "lr": 9.93674197687982e-07,
943
+ "objective/entropy": 10.774508476257324,
944
+ "objective/kl": 0.036663156002759933,
945
+ "objective/non_score_reward": -0.0018331576138734818,
946
+ "objective/rlhf_reward": -0.0018331576138734818,
947
+ "objective/scores": 0.0,
948
+ "policy/approxkl_avg": 0.0003691558085847646,
949
+ "policy/clipfrac_avg": 0.0007861634949222207,
950
+ "policy/entropy_avg": 0.19582445919513702,
951
+ "step": 45,
952
+ "val/clipfrac_avg": 0.0,
953
+ "val/num_eos_tokens": 0,
954
+ "val/ratio": 1.0001473426818848,
955
+ "val/ratio_var": 3.0011931357876165e-06
956
+ },
957
+ {
958
+ "episode": 1104,
959
+ "epoch": 0.15771428571428572,
960
+ "eps": 3,
961
+ "loss/policy_avg": -0.0010882224887609482,
962
+ "loss/value_avg": 0.0011191555531695485,
963
+ "lr": 9.927407612017446e-07,
964
+ "objective/entropy": 8.94318962097168,
965
+ "objective/kl": 0.036568351089954376,
966
+ "objective/non_score_reward": -0.0018284174147993326,
967
+ "objective/rlhf_reward": -0.0018284174147993326,
968
+ "objective/scores": 0.0,
969
+ "policy/approxkl_avg": 0.00030608323868364096,
970
+ "policy/clipfrac_avg": 0.000589622650295496,
971
+ "policy/entropy_avg": 0.18505753576755524,
972
+ "step": 46,
973
+ "val/clipfrac_avg": 0.0,
974
+ "val/num_eos_tokens": 0,
975
+ "val/ratio": 0.9989262819290161,
976
+ "val/ratio_var": 2.5265023850806756e-06
977
+ },
978
+ {
979
+ "episode": 1128,
980
+ "epoch": 0.16114285714285714,
981
+ "eps": 3,
982
+ "loss/policy_avg": 6.927899084985256e-05,
983
+ "loss/value_avg": 0.0010387528454884887,
984
+ "lr": 9.91743668404545e-07,
985
+ "objective/entropy": 13.284429550170898,
986
+ "objective/kl": 0.0841490626335144,
987
+ "objective/non_score_reward": -0.00420745275914669,
988
+ "objective/rlhf_reward": -0.00420745275914669,
989
+ "objective/scores": 0.0,
990
+ "policy/approxkl_avg": 0.0005432004109025002,
991
+ "policy/clipfrac_avg": 0.0011792451841756701,
992
+ "policy/entropy_avg": 0.2116156369447708,
993
+ "step": 47,
994
+ "val/clipfrac_avg": 0.0,
995
+ "val/num_eos_tokens": 0,
996
+ "val/ratio": 0.9999104738235474,
997
+ "val/ratio_var": 2.7350765776645858e-06
998
+ },
999
+ {
1000
+ "episode": 1152,
1001
+ "epoch": 0.16457142857142856,
1002
+ "eps": 3,
1003
+ "loss/policy_avg": -0.00019468856044113636,
1004
+ "loss/value_avg": 0.0011783144436776638,
1005
+ "lr": 9.906830626562331e-07,
1006
+ "objective/entropy": 11.494680404663086,
1007
+ "objective/kl": 0.04910100996494293,
1008
+ "objective/non_score_reward": -0.0024550508242100477,
1009
+ "objective/rlhf_reward": -0.0024550508242100477,
1010
+ "objective/scores": 0.0,
1011
+ "policy/approxkl_avg": 0.00043969100806862116,
1012
+ "policy/clipfrac_avg": 0.001179245300590992,
1013
+ "policy/entropy_avg": 0.23010241985321045,
1014
+ "step": 48,
1015
+ "val/clipfrac_avg": 0.0,
1016
+ "val/num_eos_tokens": 0,
1017
+ "val/ratio": 1.0009230375289917,
1018
+ "val/ratio_var": 1.998189873120282e-06
1019
+ },
1020
+ {
1021
+ "episode": 1176,
1022
+ "epoch": 0.168,
1023
+ "eps": 3,
1024
+ "loss/policy_avg": -0.0006489114603027701,
1025
+ "loss/value_avg": 0.001087805489078164,
1026
+ "lr": 9.89559096448414e-07,
1027
+ "objective/entropy": 11.260568618774414,
1028
+ "objective/kl": 0.0814318135380745,
1029
+ "objective/non_score_reward": -0.004071590956300497,
1030
+ "objective/rlhf_reward": -0.004071590956300497,
1031
+ "objective/scores": 0.0,
1032
+ "policy/approxkl_avg": 0.00044973386684432626,
1033
+ "policy/clipfrac_avg": 0.0007861634949222207,
1034
+ "policy/entropy_avg": 0.1947825849056244,
1035
+ "step": 49,
1036
+ "val/clipfrac_avg": 0.0,
1037
+ "val/num_eos_tokens": 0,
1038
+ "val/ratio": 1.0001856088638306,
1039
+ "val/ratio_var": 8.380131362173415e-07
1040
+ },
1041
+ {
1042
+ "episode": 1200,
1043
+ "epoch": 0.17142857142857143,
1044
+ "eps": 3,
1045
+ "loss/policy_avg": -0.0005568858468905091,
1046
+ "loss/value_avg": 0.0010799263836815953,
1047
+ "lr": 9.883719313825227e-07,
1048
+ "objective/entropy": 12.4381742477417,
1049
+ "objective/kl": 0.00899545382708311,
1050
+ "objective/non_score_reward": -0.00044977269135415554,
1051
+ "objective/rlhf_reward": -0.00044977269135415554,
1052
+ "objective/scores": 0.0,
1053
+ "policy/approxkl_avg": 0.0008998862467706203,
1054
+ "policy/clipfrac_avg": 0.00039308174746111035,
1055
+ "policy/entropy_avg": 0.2059018909931183,
1056
+ "step": 50,
1057
+ "val/clipfrac_avg": 0.0,
1058
+ "val/num_eos_tokens": 0,
1059
+ "val/ratio": 0.9994922280311584,
1060
+ "val/ratio_var": 4.479263679968426e-06
1061
+ },
1062
+ {
1063
+ "episode": 1224,
1064
+ "epoch": 0.17485714285714285,
1065
+ "eps": 3,
1066
+ "loss/policy_avg": 0.00012879818677902222,
1067
+ "loss/value_avg": 0.0011688255472108722,
1068
+ "lr": 9.871217381465902e-07,
1069
+ "objective/entropy": 9.98705768585205,
1070
+ "objective/kl": -0.022387906908988953,
1071
+ "objective/non_score_reward": 0.0011193952523171902,
1072
+ "objective/rlhf_reward": 0.0011193952523171902,
1073
+ "objective/scores": 0.0,
1074
+ "policy/approxkl_avg": 0.000400608463678509,
1075
+ "policy/clipfrac_avg": 0.00039308174746111035,
1076
+ "policy/entropy_avg": 0.1886395812034607,
1077
+ "step": 51,
1078
+ "val/clipfrac_avg": 0.0,
1079
+ "val/num_eos_tokens": 0,
1080
+ "val/ratio": 0.9999980926513672,
1081
+ "val/ratio_var": 2.1912596821493935e-06
1082
+ },
1083
+ {
1084
+ "episode": 1248,
1085
+ "epoch": 0.1782857142857143,
1086
+ "eps": 3,
1087
+ "loss/policy_avg": -0.0006774354260414839,
1088
+ "loss/value_avg": 0.001140483422204852,
1089
+ "lr": 9.85808696490701e-07,
1090
+ "objective/entropy": 10.60033893585205,
1091
+ "objective/kl": -0.04120694845914841,
1092
+ "objective/non_score_reward": 0.002060347469523549,
1093
+ "objective/rlhf_reward": 0.002060347469523549,
1094
+ "objective/scores": 0.0,
1095
+ "policy/approxkl_avg": 0.0003651762963272631,
1096
+ "policy/clipfrac_avg": 0.000589622650295496,
1097
+ "policy/entropy_avg": 0.18694177269935608,
1098
+ "step": 52,
1099
+ "val/clipfrac_avg": 0.0,
1100
+ "val/num_eos_tokens": 0,
1101
+ "val/ratio": 1.0005837678909302,
1102
+ "val/ratio_var": 3.188570644852007e-06
1103
+ },
1104
+ {
1105
+ "episode": 1272,
1106
+ "epoch": 0.18171428571428572,
1107
+ "eps": 3,
1108
+ "loss/policy_avg": -0.0014181339647620916,
1109
+ "loss/value_avg": 0.0011729756370186806,
1110
+ "lr": 9.844329952011504e-07,
1111
+ "objective/entropy": 11.802154541015625,
1112
+ "objective/kl": 0.048689838498830795,
1113
+ "objective/non_score_reward": -0.0024344921112060547,
1114
+ "objective/rlhf_reward": -0.0024344921112060547,
1115
+ "objective/scores": 0.0,
1116
+ "policy/approxkl_avg": 0.00045496394159272313,
1117
+ "policy/clipfrac_avg": 0.00039308174746111035,
1118
+ "policy/entropy_avg": 0.21137630939483643,
1119
+ "step": 53,
1120
+ "val/clipfrac_avg": 0.0,
1121
+ "val/num_eos_tokens": 0,
1122
+ "val/ratio": 1.0002349615097046,
1123
+ "val/ratio_var": 1.731065481180849e-06
1124
+ },
1125
+ {
1126
+ "episode": 1296,
1127
+ "epoch": 0.18514285714285714,
1128
+ "eps": 3,
1129
+ "loss/policy_avg": -0.0007270214264281094,
1130
+ "loss/value_avg": 0.0010591228492558002,
1131
+ "lr": 9.829948320733e-07,
1132
+ "objective/entropy": 11.661518096923828,
1133
+ "objective/kl": 0.004553104750812054,
1134
+ "objective/non_score_reward": -0.00022765521134715527,
1135
+ "objective/rlhf_reward": -0.00022765521134715527,
1136
+ "objective/scores": 0.0,
1137
+ "policy/approxkl_avg": 0.000439059833297506,
1138
+ "policy/clipfrac_avg": 0.0007861634949222207,
1139
+ "policy/entropy_avg": 0.19910752773284912,
1140
+ "step": 54,
1141
+ "val/clipfrac_avg": 0.0,
1142
+ "val/num_eos_tokens": 0,
1143
+ "val/ratio": 1.0010268688201904,
1144
+ "val/ratio_var": 2.5225742774637183e-06
1145
+ },
1146
+ {
1147
+ "episode": 1320,
1148
+ "epoch": 0.18857142857142858,
1149
+ "eps": 3,
1150
+ "loss/policy_avg": -0.0006840452551841736,
1151
+ "loss/value_avg": 0.0011034862836822867,
1152
+ "lr": 9.8149441388314e-07,
1153
+ "objective/entropy": 9.149593353271484,
1154
+ "objective/kl": 0.057699695229530334,
1155
+ "objective/non_score_reward": -0.002884984714910388,
1156
+ "objective/rlhf_reward": -0.002884984714910388,
1157
+ "objective/scores": 0.0,
1158
+ "policy/approxkl_avg": 0.00033010850893333554,
1159
+ "policy/clipfrac_avg": 0.00039308174746111035,
1160
+ "policy/entropy_avg": 0.1894957423210144,
1161
+ "step": 55,
1162
+ "val/clipfrac_avg": 0.0,
1163
+ "val/num_eos_tokens": 0,
1164
+ "val/ratio": 0.9992285966873169,
1165
+ "val/ratio_var": 8.357873753084277e-07
1166
+ },
1167
+ {
1168
+ "episode": 1344,
1169
+ "epoch": 0.192,
1170
+ "eps": 3,
1171
+ "loss/policy_avg": -0.0006162170320749283,
1172
+ "loss/value_avg": 0.0011356081813573837,
1173
+ "lr": 9.799319563575593e-07,
1174
+ "objective/entropy": 11.028936386108398,
1175
+ "objective/kl": -0.02875569462776184,
1176
+ "objective/non_score_reward": 0.0014377848710864782,
1177
+ "objective/rlhf_reward": 0.0014377848710864782,
1178
+ "objective/scores": 0.0,
1179
+ "policy/approxkl_avg": 0.00047001155326142907,
1180
+ "policy/clipfrac_avg": 0.001768867950886488,
1181
+ "policy/entropy_avg": 0.21663422882556915,
1182
+ "step": 56,
1183
+ "val/clipfrac_avg": 0.0,
1184
+ "val/num_eos_tokens": 0,
1185
+ "val/ratio": 1.000121831893921,
1186
+ "val/ratio_var": 3.489588834781898e-06
1187
+ },
1188
+ {
1189
+ "episode": 1368,
1190
+ "epoch": 0.19542857142857142,
1191
+ "eps": 3,
1192
+ "loss/policy_avg": 0.00045855995267629623,
1193
+ "loss/value_avg": 0.0011290146503597498,
1194
+ "lr": 9.783076841433279e-07,
1195
+ "objective/entropy": 10.055129051208496,
1196
+ "objective/kl": 0.09907305240631104,
1197
+ "objective/non_score_reward": -0.004953653551638126,
1198
+ "objective/rlhf_reward": -0.004953653551638126,
1199
+ "objective/scores": 0.0,
1200
+ "policy/approxkl_avg": 0.0004340106970630586,
1201
+ "policy/clipfrac_avg": 0.0009827043395489454,
1202
+ "policy/entropy_avg": 0.180865079164505,
1203
+ "step": 57,
1204
+ "val/clipfrac_avg": 0.0,
1205
+ "val/num_eos_tokens": 0,
1206
+ "val/ratio": 0.9993646740913391,
1207
+ "val/ratio_var": 1.6994080169752124e-06
1208
+ },
1209
+ {
1210
+ "episode": 1392,
1211
+ "epoch": 0.19885714285714284,
1212
+ "eps": 3,
1213
+ "loss/policy_avg": -0.000757572939619422,
1214
+ "loss/value_avg": 0.001213478040881455,
1215
+ "lr": 9.76621830774799e-07,
1216
+ "objective/entropy": 10.346697807312012,
1217
+ "objective/kl": 0.06749838590621948,
1218
+ "objective/non_score_reward": -0.003374919295310974,
1219
+ "objective/rlhf_reward": -0.003374919295310974,
1220
+ "objective/scores": 0.0,
1221
+ "policy/approxkl_avg": 0.0004295073449611664,
1222
+ "policy/clipfrac_avg": 0.000589622650295496,
1223
+ "policy/entropy_avg": 0.18486541509628296,
1224
+ "step": 58,
1225
+ "val/clipfrac_avg": 0.0,
1226
+ "val/num_eos_tokens": 0,
1227
+ "val/ratio": 1.000084400177002,
1228
+ "val/ratio_var": 2.244758661618107e-06
1229
+ },
1230
+ {
1231
+ "episode": 1416,
1232
+ "epoch": 0.2022857142857143,
1233
+ "eps": 3,
1234
+ "loss/policy_avg": -0.00044112117029726505,
1235
+ "loss/value_avg": 0.001154930330812931,
1236
+ "lr": 9.748746386403305e-07,
1237
+ "objective/entropy": 11.446981430053711,
1238
+ "objective/kl": 0.04666760563850403,
1239
+ "objective/non_score_reward": -0.0023333802819252014,
1240
+ "objective/rlhf_reward": -0.0023333802819252014,
1241
+ "objective/scores": 0.0,
1242
+ "policy/approxkl_avg": 0.00045786722330376506,
1243
+ "policy/clipfrac_avg": 0.000589622650295496,
1244
+ "policy/entropy_avg": 0.19822391867637634,
1245
+ "step": 59,
1246
+ "val/clipfrac_avg": 0.0,
1247
+ "val/num_eos_tokens": 0,
1248
+ "val/ratio": 1.0004160404205322,
1249
+ "val/ratio_var": 1.9798999346676283e-06
1250
+ },
1251
+ {
1252
+ "episode": 1440,
1253
+ "epoch": 0.2057142857142857,
1254
+ "eps": 3,
1255
+ "loss/policy_avg": -0.0011781796347349882,
1256
+ "loss/value_avg": 0.0011596104595810175,
1257
+ "lr": 9.730663589474364e-07,
1258
+ "objective/entropy": 8.769947052001953,
1259
+ "objective/kl": 0.06850416958332062,
1260
+ "objective/non_score_reward": -0.0034252083860337734,
1261
+ "objective/rlhf_reward": -0.0034252083860337734,
1262
+ "objective/scores": 0.0,
1263
+ "policy/approxkl_avg": 0.00029326786170713603,
1264
+ "policy/clipfrac_avg": 0.0,
1265
+ "policy/entropy_avg": 0.18261712789535522,
1266
+ "step": 60,
1267
+ "val/clipfrac_avg": 0.0,
1268
+ "val/num_eos_tokens": 0,
1269
+ "val/ratio": 1.000569224357605,
1270
+ "val/ratio_var": 1.8464993445377331e-06
1271
+ },
1272
+ {
1273
+ "episode": 1464,
1274
+ "epoch": 0.20914285714285713,
1275
+ "eps": 3,
1276
+ "loss/policy_avg": -0.000520245754159987,
1277
+ "loss/value_avg": 0.001076899585314095,
1278
+ "lr": 9.711972516866678e-07,
1279
+ "objective/entropy": 10.513251304626465,
1280
+ "objective/kl": 0.008584271185100079,
1281
+ "objective/non_score_reward": -0.0004292135126888752,
1282
+ "objective/rlhf_reward": -0.0004292135126888752,
1283
+ "objective/scores": 0.0,
1284
+ "policy/approxkl_avg": 0.00044821258052252233,
1285
+ "policy/clipfrac_avg": 0.0015723269898444414,
1286
+ "policy/entropy_avg": 0.18403959274291992,
1287
+ "step": 61,
1288
+ "val/clipfrac_avg": 0.0,
1289
+ "val/num_eos_tokens": 0,
1290
+ "val/ratio": 1.0004669427871704,
1291
+ "val/ratio_var": 1.8683210782910464e-06
1292
+ },
1293
+ {
1294
+ "episode": 1488,
1295
+ "epoch": 0.21257142857142858,
1296
+ "eps": 3,
1297
+ "loss/policy_avg": -7.968657882884145e-05,
1298
+ "loss/value_avg": 0.0013330953661352396,
1299
+ "lr": 9.692675855942318e-07,
1300
+ "objective/entropy": 9.325455665588379,
1301
+ "objective/kl": -0.04262801632285118,
1302
+ "objective/non_score_reward": 0.002131400629878044,
1303
+ "objective/rlhf_reward": 0.002131400629878044,
1304
+ "objective/scores": 0.0,
1305
+ "policy/approxkl_avg": 0.0003892054664902389,
1306
+ "policy/clipfrac_avg": 0.0017688678344711661,
1307
+ "policy/entropy_avg": 0.1799222230911255,
1308
+ "step": 62,
1309
+ "val/clipfrac_avg": 0.0,
1310
+ "val/num_eos_tokens": 0,
1311
+ "val/ratio": 0.9998717308044434,
1312
+ "val/ratio_var": 2.2758265458833193e-06
1313
+ },
1314
+ {
1315
+ "episode": 1512,
1316
+ "epoch": 0.216,
1317
+ "eps": 3,
1318
+ "loss/policy_avg": -0.0009765516733750701,
1319
+ "loss/value_avg": 0.0009945188648998737,
1320
+ "lr": 9.67277638113354e-07,
1321
+ "objective/entropy": 13.20396614074707,
1322
+ "objective/kl": 0.04450948163866997,
1323
+ "objective/non_score_reward": -0.0022254744544625282,
1324
+ "objective/rlhf_reward": -0.0022254744544625282,
1325
+ "objective/scores": 0.0,
1326
+ "policy/approxkl_avg": 0.000418822281062603,
1327
+ "policy/clipfrac_avg": 0.00039308174746111035,
1328
+ "policy/entropy_avg": 0.24310021102428436,
1329
+ "step": 63,
1330
+ "val/clipfrac_avg": 0.0,
1331
+ "val/num_eos_tokens": 0,
1332
+ "val/ratio": 1.000614047050476,
1333
+ "val/ratio_var": 3.149192025375669e-06
1334
+ },
1335
+ {
1336
+ "episode": 1536,
1337
+ "epoch": 0.21942857142857142,
1338
+ "eps": 3,
1339
+ "loss/policy_avg": -0.00018674280727282166,
1340
+ "loss/value_avg": 0.0012268940918147564,
1341
+ "lr": 9.652276953543877e-07,
1342
+ "objective/entropy": 10.03724193572998,
1343
+ "objective/kl": 0.010379712097346783,
1344
+ "objective/non_score_reward": -0.0005189854418858886,
1345
+ "objective/rlhf_reward": -0.0005189854418858886,
1346
+ "objective/scores": 0.0,
1347
+ "policy/approxkl_avg": 0.000406760664191097,
1348
+ "policy/clipfrac_avg": 0.00039308174746111035,
1349
+ "policy/entropy_avg": 0.18875974416732788,
1350
+ "step": 64,
1351
+ "val/clipfrac_avg": 0.0,
1352
+ "val/num_eos_tokens": 0,
1353
+ "val/ratio": 0.9995108842849731,
1354
+ "val/ratio_var": 1.4013304507898283e-06
1355
+ },
1356
+ {
1357
+ "episode": 1560,
1358
+ "epoch": 0.22285714285714286,
1359
+ "eps": 3,
1360
+ "loss/policy_avg": 0.00012768094893544912,
1361
+ "loss/value_avg": 0.001114123035222292,
1362
+ "lr": 9.631180520536777e-07,
1363
+ "objective/entropy": 10.841609001159668,
1364
+ "objective/kl": 0.08112704008817673,
1365
+ "objective/non_score_reward": -0.004056352190673351,
1366
+ "objective/rlhf_reward": -0.004056352190673351,
1367
+ "objective/scores": 0.0,
1368
+ "policy/approxkl_avg": 0.00039341190131381154,
1369
+ "policy/clipfrac_avg": 0.00019654087373055518,
1370
+ "policy/entropy_avg": 0.19926729798316956,
1371
+ "step": 65,
1372
+ "val/clipfrac_avg": 0.0,
1373
+ "val/num_eos_tokens": 0,
1374
+ "val/ratio": 0.999116063117981,
1375
+ "val/ratio_var": 2.176814632548485e-06
1376
+ },
1377
+ {
1378
+ "episode": 1584,
1379
+ "epoch": 0.22628571428571428,
1380
+ "eps": 3,
1381
+ "loss/policy_avg": 0.0002728139515966177,
1382
+ "loss/value_avg": 0.001176491379737854,
1383
+ "lr": 9.60949011531184e-07,
1384
+ "objective/entropy": 8.628952026367188,
1385
+ "objective/kl": 0.06317141652107239,
1386
+ "objective/non_score_reward": -0.0031585709657520056,
1387
+ "objective/rlhf_reward": -0.0031585709657520056,
1388
+ "objective/scores": 0.0,
1389
+ "policy/approxkl_avg": 0.00034158816561102867,
1390
+ "policy/clipfrac_avg": 0.0,
1391
+ "policy/entropy_avg": 0.17121586203575134,
1392
+ "step": 66,
1393
+ "val/clipfrac_avg": 0.0,
1394
+ "val/num_eos_tokens": 0,
1395
+ "val/ratio": 0.9997283220291138,
1396
+ "val/ratio_var": 1.2662626431847457e-06
1397
+ },
1398
+ {
1399
+ "episode": 1608,
1400
+ "epoch": 0.2297142857142857,
1401
+ "eps": 3,
1402
+ "loss/policy_avg": -0.0013626827858388424,
1403
+ "loss/value_avg": 0.001127548050135374,
1404
+ "lr": 9.587208856468713e-07,
1405
+ "objective/entropy": 10.164885520935059,
1406
+ "objective/kl": -0.05043753236532211,
1407
+ "objective/non_score_reward": 0.002521876711398363,
1408
+ "objective/rlhf_reward": 0.002521876711398363,
1409
+ "objective/scores": 0.0,
1410
+ "policy/approxkl_avg": 0.00039231160189956427,
1411
+ "policy/clipfrac_avg": 0.0007861634949222207,
1412
+ "policy/entropy_avg": 0.18126381933689117,
1413
+ "step": 67,
1414
+ "val/clipfrac_avg": 0.0,
1415
+ "val/num_eos_tokens": 0,
1416
+ "val/ratio": 1.0005989074707031,
1417
+ "val/ratio_var": 3.3499827623018064e-06
1418
+ },
1419
+ {
1420
+ "episode": 1632,
1421
+ "epoch": 0.23314285714285715,
1422
+ "eps": 3,
1423
+ "loss/policy_avg": 0.00018104736227542162,
1424
+ "loss/value_avg": 0.001195235294289887,
1425
+ "lr": 9.564339947558697e-07,
1426
+ "objective/entropy": 8.992341995239258,
1427
+ "objective/kl": 0.0629781186580658,
1428
+ "objective/non_score_reward": -0.003148905700072646,
1429
+ "objective/rlhf_reward": -0.003148905700072646,
1430
+ "objective/scores": 0.0,
1431
+ "policy/approxkl_avg": 0.00031165953259915113,
1432
+ "policy/clipfrac_avg": 0.0009827043395489454,
1433
+ "policy/entropy_avg": 0.1769479513168335,
1434
+ "step": 68,
1435
+ "val/clipfrac_avg": 0.0,
1436
+ "val/num_eos_tokens": 0,
1437
+ "val/ratio": 0.9993069171905518,
1438
+ "val/ratio_var": 2.6862960567086702e-06
1439
+ },
1440
+ {
1441
+ "episode": 1656,
1442
+ "epoch": 0.23657142857142857,
1443
+ "eps": 3,
1444
+ "loss/policy_avg": -0.0006377133540809155,
1445
+ "loss/value_avg": 0.001105936011299491,
1446
+ "lr": 9.540886676624145e-07,
1447
+ "objective/entropy": 9.024686813354492,
1448
+ "objective/kl": 0.07867498695850372,
1449
+ "objective/non_score_reward": -0.003933749161660671,
1450
+ "objective/rlhf_reward": -0.003933749161660671,
1451
+ "objective/scores": 0.0,
1452
+ "policy/approxkl_avg": 0.0004139389202464372,
1453
+ "policy/clipfrac_avg": 0.0009827043395489454,
1454
+ "policy/entropy_avg": 0.18283474445343018,
1455
+ "step": 69,
1456
+ "val/clipfrac_avg": 0.0,
1457
+ "val/num_eos_tokens": 0,
1458
+ "val/ratio": 0.999620258808136,
1459
+ "val/ratio_var": 1.2224793408677215e-06
1460
+ },
1461
+ {
1462
+ "episode": 1680,
1463
+ "epoch": 0.24,
1464
+ "eps": 3,
1465
+ "loss/policy_avg": -0.00034369935747236013,
1466
+ "loss/value_avg": 0.0010239611146971583,
1467
+ "lr": 9.516852415725732e-07,
1468
+ "objective/entropy": 12.412653923034668,
1469
+ "objective/kl": 0.08225254714488983,
1470
+ "objective/non_score_reward": -0.004112627357244492,
1471
+ "objective/rlhf_reward": -0.004112627357244492,
1472
+ "objective/scores": 0.0,
1473
+ "policy/approxkl_avg": 0.0005226304056122899,
1474
+ "policy/clipfrac_avg": 0.001179245300590992,
1475
+ "policy/entropy_avg": 0.23866336047649384,
1476
+ "step": 70,
1477
+ "val/clipfrac_avg": 0.0,
1478
+ "val/num_eos_tokens": 0,
1479
+ "val/ratio": 1.0004711151123047,
1480
+ "val/ratio_var": 3.487248477540561e-06
1481
+ },
1482
+ {
1483
+ "episode": 1704,
1484
+ "epoch": 0.24342857142857144,
1485
+ "eps": 3,
1486
+ "loss/policy_avg": -0.0012513366527855396,
1487
+ "loss/value_avg": 0.0012419146951287985,
1488
+ "lr": 9.492240620457606e-07,
1489
+ "objective/entropy": 13.152254104614258,
1490
+ "objective/kl": 0.06805577129125595,
1491
+ "objective/non_score_reward": -0.0034027881920337677,
1492
+ "objective/rlhf_reward": -0.0034027881920337677,
1493
+ "objective/scores": 0.0,
1494
+ "policy/approxkl_avg": 0.0005608252249658108,
1495
+ "policy/clipfrac_avg": 0.0009827043395489454,
1496
+ "policy/entropy_avg": 0.2300390750169754,
1497
+ "step": 71,
1498
+ "val/clipfrac_avg": 0.0,
1499
+ "val/num_eos_tokens": 0,
1500
+ "val/ratio": 1.0004956722259521,
1501
+ "val/ratio_var": 2.79126152236131e-06
1502
+ },
1503
+ {
1504
+ "episode": 1728,
1505
+ "epoch": 0.24685714285714286,
1506
+ "eps": 3,
1507
+ "loss/policy_avg": -0.0006021471926942468,
1508
+ "loss/value_avg": 0.0010983510874211788,
1509
+ "lr": 9.467054829450571e-07,
1510
+ "objective/entropy": 11.628573417663574,
1511
+ "objective/kl": 0.0897781252861023,
1512
+ "objective/non_score_reward": -0.004488906357437372,
1513
+ "objective/rlhf_reward": -0.004488906357437372,
1514
+ "objective/scores": 0.0,
1515
+ "policy/approxkl_avg": 0.0004015603626612574,
1516
+ "policy/clipfrac_avg": 0.00039308174746111035,
1517
+ "policy/entropy_avg": 0.20409011840820312,
1518
+ "step": 72,
1519
+ "val/clipfrac_avg": 0.0,
1520
+ "val/num_eos_tokens": 0,
1521
+ "val/ratio": 1.000084638595581,
1522
+ "val/ratio_var": 3.225579803256551e-06
1523
+ },
1524
+ {
1525
+ "episode": 1752,
1526
+ "epoch": 0.2502857142857143,
1527
+ "eps": 3,
1528
+ "loss/policy_avg": -0.00017274607671424747,
1529
+ "loss/value_avg": 0.0011638551950454712,
1530
+ "lr": 9.441298663863289e-07,
1531
+ "objective/entropy": 10.489394187927246,
1532
+ "objective/kl": 0.05405331775546074,
1533
+ "objective/non_score_reward": -0.0027026659809052944,
1534
+ "objective/rlhf_reward": -0.0027026659809052944,
1535
+ "objective/scores": 0.0,
1536
+ "policy/approxkl_avg": 0.0003594418230932206,
1537
+ "policy/clipfrac_avg": 0.000589622650295496,
1538
+ "policy/entropy_avg": 0.19105064868927002,
1539
+ "step": 73,
1540
+ "val/clipfrac_avg": 0.0,
1541
+ "val/num_eos_tokens": 0,
1542
+ "val/ratio": 1.0002226829528809,
1543
+ "val/ratio_var": 1.964122702702298e-06
1544
+ },
1545
+ {
1546
+ "episode": 1776,
1547
+ "epoch": 0.2537142857142857,
1548
+ "eps": 3,
1549
+ "loss/policy_avg": 4.641339182853699e-05,
1550
+ "loss/value_avg": 0.0012347290758043528,
1551
+ "lr": 9.414975826861651e-07,
1552
+ "objective/entropy": 8.907478332519531,
1553
+ "objective/kl": 0.04718625172972679,
1554
+ "objective/non_score_reward": -0.0023593127261847258,
1555
+ "objective/rlhf_reward": -0.0023593127261847258,
1556
+ "objective/scores": 0.0,
1557
+ "policy/approxkl_avg": 0.0003988428507000208,
1558
+ "policy/clipfrac_avg": 0.00039308174746111035,
1559
+ "policy/entropy_avg": 0.19814759492874146,
1560
+ "step": 74,
1561
+ "val/clipfrac_avg": 0.0,
1562
+ "val/num_eos_tokens": 0,
1563
+ "val/ratio": 1.000138521194458,
1564
+ "val/ratio_var": 1.9056941482631373e-06
1565
+ },
1566
+ {
1567
+ "episode": 1800,
1568
+ "epoch": 0.2571428571428571,
1569
+ "eps": 3,
1570
+ "loss/policy_avg": -0.0015050144866108894,
1571
+ "loss/value_avg": 0.0012774434871971607,
1572
+ "lr": 9.388090103086343e-07,
1573
+ "objective/entropy": 13.666069030761719,
1574
+ "objective/kl": 0.2161487638950348,
1575
+ "objective/non_score_reward": -0.01080743782222271,
1576
+ "objective/rlhf_reward": -0.01080743782222271,
1577
+ "objective/scores": 0.0,
1578
+ "policy/approxkl_avg": 0.0012116986326873302,
1579
+ "policy/clipfrac_avg": 0.003144653979688883,
1580
+ "policy/entropy_avg": 0.24723494052886963,
1581
+ "step": 75,
1582
+ "val/clipfrac_avg": 0.0,
1583
+ "val/num_eos_tokens": 0,
1584
+ "val/ratio": 0.9994445443153381,
1585
+ "val/ratio_var": 2.9358038773352746e-06
1586
+ },
1587
+ {
1588
+ "episode": 1824,
1589
+ "epoch": 0.26057142857142856,
1590
+ "eps": 3,
1591
+ "loss/policy_avg": -0.00010800664313137531,
1592
+ "loss/value_avg": 0.0011818609200417995,
1593
+ "lr": 9.360645358108695e-07,
1594
+ "objective/entropy": 8.11184310913086,
1595
+ "objective/kl": 0.04297180101275444,
1596
+ "objective/non_score_reward": -0.0021485905162990093,
1597
+ "objective/rlhf_reward": -0.0021485905162990093,
1598
+ "objective/scores": 0.0,
1599
+ "policy/approxkl_avg": 0.0003635706380009651,
1600
+ "policy/clipfrac_avg": 0.0017688678344711661,
1601
+ "policy/entropy_avg": 0.17554748058319092,
1602
+ "step": 76,
1603
+ "val/clipfrac_avg": 0.0,
1604
+ "val/num_eos_tokens": 0,
1605
+ "val/ratio": 1.0001165866851807,
1606
+ "val/ratio_var": 1.6492232361997594e-06
1607
+ },
1608
+ {
1609
+ "episode": 1848,
1610
+ "epoch": 0.264,
1611
+ "eps": 3,
1612
+ "loss/policy_avg": -0.000423701130785048,
1613
+ "loss/value_avg": 0.0011581408325582743,
1614
+ "lr": 9.332645537874899e-07,
1615
+ "objective/entropy": 9.99191951751709,
1616
+ "objective/kl": -0.004131610505282879,
1617
+ "objective/non_score_reward": 0.00020658024004660547,
1618
+ "objective/rlhf_reward": 0.00020658024004660547,
1619
+ "objective/scores": 0.0,
1620
+ "policy/approxkl_avg": 0.00039624073542654514,
1621
+ "policy/clipfrac_avg": 0.0007861634949222207,
1622
+ "policy/entropy_avg": 0.18171575665473938,
1623
+ "step": 77,
1624
+ "val/clipfrac_avg": 0.0,
1625
+ "val/num_eos_tokens": 0,
1626
+ "val/ratio": 1.0006225109100342,
1627
+ "val/ratio_var": 2.7435085030447226e-06
1628
+ },
1629
+ {
1630
+ "episode": 1872,
1631
+ "epoch": 0.2674285714285714,
1632
+ "eps": 3,
1633
+ "loss/policy_avg": -0.0017575371311977506,
1634
+ "loss/value_avg": 0.0011438459623605013,
1635
+ "lr": 9.304094668138669e-07,
1636
+ "objective/entropy": 10.796030044555664,
1637
+ "objective/kl": 0.04405898600816727,
1638
+ "objective/non_score_reward": -0.002202949021011591,
1639
+ "objective/rlhf_reward": -0.002202949021011591,
1640
+ "objective/scores": 0.0,
1641
+ "policy/approxkl_avg": 0.0003987160453107208,
1642
+ "policy/clipfrac_avg": 0.0009827043395489454,
1643
+ "policy/entropy_avg": 0.20184190571308136,
1644
+ "step": 78,
1645
+ "val/clipfrac_avg": 0.0,
1646
+ "val/num_eos_tokens": 0,
1647
+ "val/ratio": 1.0002810955047607,
1648
+ "val/ratio_var": 3.7199308735580416e-06
1649
+ },
1650
+ {
1651
+ "episode": 1896,
1652
+ "epoch": 0.27085714285714285,
1653
+ "eps": 3,
1654
+ "loss/policy_avg": -0.0006982747581787407,
1655
+ "loss/value_avg": 0.001091040438041091,
1656
+ "lr": 9.274996853882425e-07,
1657
+ "objective/entropy": 11.000476837158203,
1658
+ "objective/kl": 0.0654320940375328,
1659
+ "objective/non_score_reward": -0.0032716048881411552,
1660
+ "objective/rlhf_reward": -0.0032716048881411552,
1661
+ "objective/scores": 0.0,
1662
+ "policy/approxkl_avg": 0.000344442087225616,
1663
+ "policy/clipfrac_avg": 0.00019654087373055518,
1664
+ "policy/entropy_avg": 0.21253754198551178,
1665
+ "step": 79,
1666
+ "val/clipfrac_avg": 0.0,
1667
+ "val/num_eos_tokens": 0,
1668
+ "val/ratio": 0.9996973872184753,
1669
+ "val/ratio_var": 1.350840307168255e-06
1670
+ },
1671
+ {
1672
+ "episode": 1920,
1673
+ "epoch": 0.2742857142857143,
1674
+ "eps": 3,
1675
+ "loss/policy_avg": -0.00022802106104791164,
1676
+ "loss/value_avg": 0.0011568802874535322,
1677
+ "lr": 9.245356278727093e-07,
1678
+ "objective/entropy": 12.402453422546387,
1679
+ "objective/kl": 0.017512556165456772,
1680
+ "objective/non_score_reward": -0.0008756277966313064,
1681
+ "objective/rlhf_reward": -0.0008756277966313064,
1682
+ "objective/scores": 0.0,
1683
+ "policy/approxkl_avg": 0.0004438001196831465,
1684
+ "policy/clipfrac_avg": 0.000589622650295496,
1685
+ "policy/entropy_avg": 0.21272364258766174,
1686
+ "step": 80,
1687
+ "val/clipfrac_avg": 0.0,
1688
+ "val/num_eos_tokens": 0,
1689
+ "val/ratio": 1.0009765625,
1690
+ "val/ratio_var": 1.2678815437539015e-06
1691
+ },
1692
+ {
1693
+ "episode": 1944,
1694
+ "epoch": 0.2777142857142857,
1695
+ "eps": 3,
1696
+ "loss/policy_avg": -0.0021045708563178778,
1697
+ "loss/value_avg": 0.0011669377563521266,
1698
+ "lr": 9.215177204330587e-07,
1699
+ "objective/entropy": 11.159307479858398,
1700
+ "objective/kl": -0.03767850622534752,
1701
+ "objective/non_score_reward": 0.0018839254043996334,
1702
+ "objective/rlhf_reward": 0.0018839254043996334,
1703
+ "objective/scores": 0.0,
1704
+ "policy/approxkl_avg": 0.00048406008863821626,
1705
+ "policy/clipfrac_avg": 0.002555031329393387,
1706
+ "policy/entropy_avg": 0.2017732560634613,
1707
+ "step": 81,
1708
+ "val/clipfrac_avg": 0.0,
1709
+ "val/num_eos_tokens": 0,
1710
+ "val/ratio": 1.001840353012085,
1711
+ "val/ratio_var": 3.842494152195286e-06
1712
+ },
1713
+ {
1714
+ "episode": 1968,
1715
+ "epoch": 0.28114285714285714,
1716
+ "eps": 3,
1717
+ "loss/policy_avg": 0.000858477724250406,
1718
+ "loss/value_avg": 0.0012590258847922087,
1719
+ "lr": 9.184463969775083e-07,
1720
+ "objective/entropy": 9.627300262451172,
1721
+ "objective/kl": -0.05987191200256348,
1722
+ "objective/non_score_reward": 0.002993595553562045,
1723
+ "objective/rlhf_reward": 0.002993595553562045,
1724
+ "objective/scores": 0.0,
1725
+ "policy/approxkl_avg": 0.00036076403921470046,
1726
+ "policy/clipfrac_avg": 0.00019654087373055518,
1727
+ "policy/entropy_avg": 0.188791424036026,
1728
+ "step": 82,
1729
+ "val/clipfrac_avg": 0.0,
1730
+ "val/num_eos_tokens": 0,
1731
+ "val/ratio": 1.0008705854415894,
1732
+ "val/ratio_var": 2.1515820662898477e-06
1733
+ },
1734
+ {
1735
+ "episode": 1992,
1736
+ "epoch": 0.2845714285714286,
1737
+ "eps": 3,
1738
+ "loss/policy_avg": 0.00014637189451605082,
1739
+ "loss/value_avg": 0.0012444497551769018,
1740
+ "lr": 9.153220990943145e-07,
1741
+ "objective/entropy": 12.66540813446045,
1742
+ "objective/kl": -0.0041860491037368774,
1743
+ "objective/non_score_reward": 0.000209302845178172,
1744
+ "objective/rlhf_reward": 0.000209302845178172,
1745
+ "objective/scores": 0.0,
1746
+ "policy/approxkl_avg": 0.0004341425374150276,
1747
+ "policy/clipfrac_avg": 0.00039308174746111035,
1748
+ "policy/entropy_avg": 0.22873584926128387,
1749
+ "step": 83,
1750
+ "val/clipfrac_avg": 0.0,
1751
+ "val/num_eos_tokens": 0,
1752
+ "val/ratio": 0.9995162487030029,
1753
+ "val/ratio_var": 4.206229277770035e-06
1754
+ },
1755
+ {
1756
+ "episode": 2016,
1757
+ "epoch": 0.288,
1758
+ "eps": 3,
1759
+ "loss/policy_avg": -0.0008962788851931691,
1760
+ "loss/value_avg": 0.0012219806667417288,
1761
+ "lr": 9.121452759882831e-07,
1762
+ "objective/entropy": 9.737704277038574,
1763
+ "objective/kl": 0.05109195038676262,
1764
+ "objective/non_score_reward": -0.0025545977987349033,
1765
+ "objective/rlhf_reward": -0.0025545977987349033,
1766
+ "objective/scores": 0.0,
1767
+ "policy/approxkl_avg": 0.00039628270315006375,
1768
+ "policy/clipfrac_avg": 0.00039308174746111035,
1769
+ "policy/entropy_avg": 0.19647139310836792,
1770
+ "step": 84,
1771
+ "val/clipfrac_avg": 0.0,
1772
+ "val/num_eos_tokens": 0,
1773
+ "val/ratio": 0.9999145269393921,
1774
+ "val/ratio_var": 2.936012378995656e-06
1775
+ },
1776
+ {
1777
+ "episode": 2040,
1778
+ "epoch": 0.2914285714285714,
1779
+ "eps": 3,
1780
+ "loss/policy_avg": -0.001591079868376255,
1781
+ "loss/value_avg": 0.0010952729498967528,
1782
+ "lr": 9.08916384416183e-07,
1783
+ "objective/entropy": 11.758407592773438,
1784
+ "objective/kl": 0.14314039051532745,
1785
+ "objective/non_score_reward": -0.007157019339501858,
1786
+ "objective/rlhf_reward": -0.007157019339501858,
1787
+ "objective/scores": 0.0,
1788
+ "policy/approxkl_avg": 0.00046683146501891315,
1789
+ "policy/clipfrac_avg": 0.001179245300590992,
1790
+ "policy/entropy_avg": 0.2205151468515396,
1791
+ "step": 85,
1792
+ "val/clipfrac_avg": 0.0,
1793
+ "val/num_eos_tokens": 0,
1794
+ "val/ratio": 1.0003023147583008,
1795
+ "val/ratio_var": 2.956066737169749e-06
1796
+ },
1797
+ {
1798
+ "episode": 2064,
1799
+ "epoch": 0.2948571428571429,
1800
+ "eps": 3,
1801
+ "loss/policy_avg": -0.0015333988703787327,
1802
+ "loss/value_avg": 0.001151310047134757,
1803
+ "lr": 9.056358886210747e-07,
1804
+ "objective/entropy": 12.441439628601074,
1805
+ "objective/kl": 0.08128754049539566,
1806
+ "objective/non_score_reward": -0.0040643769316375256,
1807
+ "objective/rlhf_reward": -0.0040643769316375256,
1808
+ "objective/scores": 0.0,
1809
+ "policy/approxkl_avg": 0.0005020004464313388,
1810
+ "policy/clipfrac_avg": 0.001965408679097891,
1811
+ "policy/entropy_avg": 0.22599831223487854,
1812
+ "step": 86,
1813
+ "val/clipfrac_avg": 0.0,
1814
+ "val/num_eos_tokens": 0,
1815
+ "val/ratio": 0.9994167685508728,
1816
+ "val/ratio_var": 3.2662399007676868e-06
1817
+ },
1818
+ {
1819
+ "episode": 2088,
1820
+ "epoch": 0.29828571428571427,
1821
+ "eps": 3,
1822
+ "loss/policy_avg": 0.00033648847602307796,
1823
+ "loss/value_avg": 0.001143318135291338,
1824
+ "lr": 9.023042602655623e-07,
1825
+ "objective/entropy": 9.817390441894531,
1826
+ "objective/kl": -0.09843956679105759,
1827
+ "objective/non_score_reward": 0.004921978339552879,
1828
+ "objective/rlhf_reward": 0.004921978339552879,
1829
+ "objective/scores": 0.0,
1830
+ "policy/approxkl_avg": 0.00033533855457790196,
1831
+ "policy/clipfrac_avg": 0.00039308174746111035,
1832
+ "policy/entropy_avg": 0.19580014050006866,
1833
+ "step": 87,
1834
+ "val/clipfrac_avg": 0.0,
1835
+ "val/num_eos_tokens": 0,
1836
+ "val/ratio": 1.0008838176727295,
1837
+ "val/ratio_var": 1.9849708223773632e-06
1838
+ },
1839
+ {
1840
+ "episode": 2112,
1841
+ "epoch": 0.3017142857142857,
1842
+ "eps": 3,
1843
+ "loss/policy_avg": 0.00017705616482999176,
1844
+ "loss/value_avg": 0.0010654886718839407,
1845
+ "lr": 8.989219783639795e-07,
1846
+ "objective/entropy": 9.474151611328125,
1847
+ "objective/kl": 0.08406557887792587,
1848
+ "objective/non_score_reward": -0.004203279037028551,
1849
+ "objective/rlhf_reward": -0.004203279037028551,
1850
+ "objective/scores": 0.0,
1851
+ "policy/approxkl_avg": 0.0003710829478222877,
1852
+ "policy/clipfrac_avg": 0.000589622650295496,
1853
+ "policy/entropy_avg": 0.1872844099998474,
1854
+ "step": 88,
1855
+ "val/clipfrac_avg": 0.0,
1856
+ "val/num_eos_tokens": 0,
1857
+ "val/ratio": 1.0003993511199951,
1858
+ "val/ratio_var": 1.4447776948145474e-06
1859
+ },
1860
+ {
1861
+ "episode": 2136,
1862
+ "epoch": 0.30514285714285716,
1863
+ "eps": 3,
1864
+ "loss/policy_avg": -0.0004872347926720977,
1865
+ "loss/value_avg": 0.001096963882446289,
1866
+ "lr": 8.95489529213517e-07,
1867
+ "objective/entropy": 9.022795677185059,
1868
+ "objective/kl": 0.027196139097213745,
1869
+ "objective/non_score_reward": -0.0013598070945590734,
1870
+ "objective/rlhf_reward": -0.0013598070945590734,
1871
+ "objective/scores": 0.0,
1872
+ "policy/approxkl_avg": 0.0003935418208129704,
1873
+ "policy/clipfrac_avg": 0.001965408679097891,
1874
+ "policy/entropy_avg": 0.1806870847940445,
1875
+ "step": 89,
1876
+ "val/clipfrac_avg": 0.0,
1877
+ "val/num_eos_tokens": 0,
1878
+ "val/ratio": 0.999650239944458,
1879
+ "val/ratio_var": 1.598498329258291e-06
1880
+ },
1881
+ {
1882
+ "episode": 2160,
1883
+ "epoch": 0.30857142857142855,
1884
+ "eps": 3,
1885
+ "loss/policy_avg": 6.0095335356891155e-05,
1886
+ "loss/value_avg": 0.0011415532790124416,
1887
+ "lr": 8.920074063243045e-07,
1888
+ "objective/entropy": 10.911069869995117,
1889
+ "objective/kl": 0.06437133252620697,
1890
+ "objective/non_score_reward": -0.003218566533178091,
1891
+ "objective/rlhf_reward": -0.003218566533178091,
1892
+ "objective/scores": 0.0,
1893
+ "policy/approxkl_avg": 0.0003742141416296363,
1894
+ "policy/clipfrac_avg": 0.001179245300590992,
1895
+ "policy/entropy_avg": 0.18816140294075012,
1896
+ "step": 90,
1897
+ "val/clipfrac_avg": 0.0,
1898
+ "val/num_eos_tokens": 0,
1899
+ "val/ratio": 0.9992557764053345,
1900
+ "val/ratio_var": 9.944599241862306e-07
1901
+ },
1902
+ {
1903
+ "episode": 2184,
1904
+ "epoch": 0.312,
1905
+ "eps": 3,
1906
+ "loss/policy_avg": 2.5081797502934933e-05,
1907
+ "loss/value_avg": 0.0012206505052745342,
1908
+ "lr": 8.884761103484547e-07,
1909
+ "objective/entropy": 9.843175888061523,
1910
+ "objective/kl": 0.03545796871185303,
1911
+ "objective/non_score_reward": -0.0017728982493281364,
1912
+ "objective/rlhf_reward": -0.0017728982493281364,
1913
+ "objective/scores": 0.0,
1914
+ "policy/approxkl_avg": 0.0003428169875405729,
1915
+ "policy/clipfrac_avg": 0.0,
1916
+ "policy/entropy_avg": 0.1816243827342987,
1917
+ "step": 91,
1918
+ "val/clipfrac_avg": 0.0,
1919
+ "val/num_eos_tokens": 0,
1920
+ "val/ratio": 1.0000550746917725,
1921
+ "val/ratio_var": 7.711231546636554e-07
1922
+ },
1923
+ {
1924
+ "episode": 2208,
1925
+ "epoch": 0.31542857142857145,
1926
+ "eps": 3,
1927
+ "loss/policy_avg": -0.0016599048394709826,
1928
+ "loss/value_avg": 0.0011678379960358143,
1929
+ "lr": 8.848961490080805e-07,
1930
+ "objective/entropy": 10.757587432861328,
1931
+ "objective/kl": 0.02934662066400051,
1932
+ "objective/non_score_reward": -0.0014673307305201888,
1933
+ "objective/rlhf_reward": -0.0014673307305201888,
1934
+ "objective/scores": 0.0,
1935
+ "policy/approxkl_avg": 0.00047666137106716633,
1936
+ "policy/clipfrac_avg": 0.0027515720576047897,
1937
+ "policy/entropy_avg": 0.21150188148021698,
1938
+ "step": 92,
1939
+ "val/clipfrac_avg": 0.0,
1940
+ "val/num_eos_tokens": 0,
1941
+ "val/ratio": 1.0001424551010132,
1942
+ "val/ratio_var": 2.449437033646973e-06
1943
+ },
1944
+ {
1945
+ "episode": 2232,
1946
+ "epoch": 0.31885714285714284,
1947
+ "eps": 3,
1948
+ "loss/policy_avg": -0.00034765334567055106,
1949
+ "loss/value_avg": 0.0010505674872547388,
1950
+ "lr": 8.81268037022296e-07,
1951
+ "objective/entropy": 10.09364128112793,
1952
+ "objective/kl": 0.052005209028720856,
1953
+ "objective/non_score_reward": -0.002600260078907013,
1954
+ "objective/rlhf_reward": -0.002600260078907013,
1955
+ "objective/scores": 0.0,
1956
+ "policy/approxkl_avg": 0.00031719336402602494,
1957
+ "policy/clipfrac_avg": 0.00039308174746111035,
1958
+ "policy/entropy_avg": 0.17978450655937195,
1959
+ "step": 93,
1960
+ "val/clipfrac_avg": 0.0,
1961
+ "val/num_eos_tokens": 0,
1962
+ "val/ratio": 0.9992406964302063,
1963
+ "val/ratio_var": 1.4672579027319443e-06
1964
+ },
1965
+ {
1966
+ "episode": 2256,
1967
+ "epoch": 0.3222857142857143,
1968
+ "eps": 3,
1969
+ "loss/policy_avg": -0.00022107455879449844,
1970
+ "loss/value_avg": 0.0010250682244077325,
1971
+ "lr": 8.775922960332108e-07,
1972
+ "objective/entropy": 10.98604965209961,
1973
+ "objective/kl": 0.036630094051361084,
1974
+ "objective/non_score_reward": -0.0018315049819648266,
1975
+ "objective/rlhf_reward": -0.0018315049819648266,
1976
+ "objective/scores": 0.0,
1977
+ "policy/approxkl_avg": 0.0004163292469456792,
1978
+ "policy/clipfrac_avg": 0.0007861634949222207,
1979
+ "policy/entropy_avg": 0.19221962988376617,
1980
+ "step": 94,
1981
+ "val/clipfrac_avg": 0.0,
1982
+ "val/num_eos_tokens": 0,
1983
+ "val/ratio": 1.0002532005310059,
1984
+ "val/ratio_var": 3.751965323317563e-06
1985
+ },
1986
+ {
1987
+ "episode": 2280,
1988
+ "epoch": 0.32571428571428573,
1989
+ "eps": 3,
1990
+ "loss/policy_avg": -0.001208254136145115,
1991
+ "loss/value_avg": 0.0010296719847247005,
1992
+ "lr": 8.738694545309298e-07,
1993
+ "objective/entropy": 9.24979019165039,
1994
+ "objective/kl": 0.005700349807739258,
1995
+ "objective/non_score_reward": -0.00028501744964160025,
1996
+ "objective/rlhf_reward": -0.00028501744964160025,
1997
+ "objective/scores": 0.0,
1998
+ "policy/approxkl_avg": 0.0003547874221112579,
1999
+ "policy/clipfrac_avg": 0.00039308174746111035,
2000
+ "policy/entropy_avg": 0.1905178278684616,
2001
+ "step": 95,
2002
+ "val/clipfrac_avg": 0.0,
2003
+ "val/num_eos_tokens": 0,
2004
+ "val/ratio": 0.9988124370574951,
2005
+ "val/ratio_var": 2.4154353468475165e-06
2006
+ },
2007
+ {
2008
+ "episode": 2304,
2009
+ "epoch": 0.3291428571428571,
2010
+ "eps": 3,
2011
+ "loss/policy_avg": -0.0005715511506423354,
2012
+ "loss/value_avg": 0.0010314981918781996,
2013
+ "lr": 8.701000477775687e-07,
2014
+ "objective/entropy": 10.696990966796875,
2015
+ "objective/kl": 0.015906011685729027,
2016
+ "objective/non_score_reward": -0.0007953002932481468,
2017
+ "objective/rlhf_reward": -0.0007953002932481468,
2018
+ "objective/scores": 0.0,
2019
+ "policy/approxkl_avg": 0.0003968015080317855,
2020
+ "policy/clipfrac_avg": 0.00019654087373055518,
2021
+ "policy/entropy_avg": 0.2042274922132492,
2022
+ "step": 96,
2023
+ "val/clipfrac_avg": 0.0,
2024
+ "val/num_eos_tokens": 0,
2025
+ "val/ratio": 0.9995560646057129,
2026
+ "val/ratio_var": 9.015092246045242e-07
2027
+ },
2028
+ {
2029
+ "episode": 2328,
2030
+ "epoch": 0.3325714285714286,
2031
+ "eps": 3,
2032
+ "loss/policy_avg": -0.00013049808330833912,
2033
+ "loss/value_avg": 0.0011974009685218334,
2034
+ "lr": 8.662846177302938e-07,
2035
+ "objective/entropy": 11.063315391540527,
2036
+ "objective/kl": 0.13328056037425995,
2037
+ "objective/non_score_reward": -0.006664028391242027,
2038
+ "objective/rlhf_reward": -0.006664028391242027,
2039
+ "objective/scores": 0.0,
2040
+ "policy/approxkl_avg": 0.0004643582506105304,
2041
+ "policy/clipfrac_avg": 0.0009827043395489454,
2042
+ "policy/entropy_avg": 0.2180095911026001,
2043
+ "step": 97,
2044
+ "val/clipfrac_avg": 0.0,
2045
+ "val/num_eos_tokens": 0,
2046
+ "val/ratio": 0.9988436698913574,
2047
+ "val/ratio_var": 4.182348675385583e-06
2048
+ },
2049
+ {
2050
+ "episode": 2352,
2051
+ "epoch": 0.336,
2052
+ "eps": 3,
2053
+ "loss/policy_avg": -0.00021513155661523342,
2054
+ "loss/value_avg": 0.0011991425417363644,
2055
+ "lr": 8.624237129634014e-07,
2056
+ "objective/entropy": 8.599056243896484,
2057
+ "objective/kl": 0.03197745233774185,
2058
+ "objective/non_score_reward": -0.0015988722443580627,
2059
+ "objective/rlhf_reward": -0.0015988722443580627,
2060
+ "objective/scores": 0.0,
2061
+ "policy/approxkl_avg": 0.00039817538345232606,
2062
+ "policy/clipfrac_avg": 0.0009827043395489454,
2063
+ "policy/entropy_avg": 0.18570010364055634,
2064
+ "step": 98,
2065
+ "val/clipfrac_avg": 0.0,
2066
+ "val/num_eos_tokens": 0,
2067
+ "val/ratio": 0.9998412728309631,
2068
+ "val/ratio_var": 1.992225406866055e-06
2069
+ },
2070
+ {
2071
+ "episode": 2376,
2072
+ "epoch": 0.3394285714285714,
2073
+ "eps": 3,
2074
+ "loss/policy_avg": -0.0004443543148227036,
2075
+ "loss/value_avg": 0.0011885116109624505,
2076
+ "lr": 8.58517888589445e-07,
2077
+ "objective/entropy": 11.050878524780273,
2078
+ "objective/kl": 0.03590617701411247,
2079
+ "objective/non_score_reward": -0.001795308431610465,
2080
+ "objective/rlhf_reward": -0.001795308431610465,
2081
+ "objective/scores": 0.0,
2082
+ "policy/approxkl_avg": 0.0004836043808609247,
2083
+ "policy/clipfrac_avg": 0.0017688678344711661,
2084
+ "policy/entropy_avg": 0.21578478813171387,
2085
+ "step": 99,
2086
+ "val/clipfrac_avg": 0.0,
2087
+ "val/num_eos_tokens": 0,
2088
+ "val/ratio": 1.000609040260315,
2089
+ "val/ratio_var": 2.0210025013511768e-06
2090
+ },
2091
+ {
2092
+ "episode": 2400,
2093
+ "epoch": 0.34285714285714286,
2094
+ "eps": 3,
2095
+ "loss/policy_avg": -0.0003154139267280698,
2096
+ "loss/value_avg": 0.0011843700194731355,
2097
+ "lr": 8.54567706179422e-07,
2098
+ "objective/entropy": 10.662446975708008,
2099
+ "objective/kl": 0.029185041785240173,
2100
+ "objective/non_score_reward": -0.0014592523220926523,
2101
+ "objective/rlhf_reward": -0.0014592523220926523,
2102
+ "objective/scores": 0.0,
2103
+ "policy/approxkl_avg": 0.00036341603845357895,
2104
+ "policy/clipfrac_avg": 0.000589622650295496,
2105
+ "policy/entropy_avg": 0.18732158839702606,
2106
+ "step": 100,
2107
+ "val/clipfrac_avg": 0.0,
2108
+ "val/num_eos_tokens": 0,
2109
+ "val/ratio": 0.9997389316558838,
2110
+ "val/ratio_var": 2.346555902477121e-06
2111
+ }
2112
+ ],
2113
+ "logging_steps": 1,
2114
+ "max_steps": 292,
2115
+ "num_input_tokens_seen": 0,
2116
+ "num_train_epochs": 1.0,
2117
+ "save_steps": 50,
2118
+ "stateful_callbacks": {
2119
+ "TrainerControl": {
2120
+ "args": {
2121
+ "should_epoch_stop": false,
2122
+ "should_evaluate": false,
2123
+ "should_log": true,
2124
+ "should_save": true,
2125
+ "should_training_stop": false
2126
+ },
2127
+ "attributes": {}
2128
+ }
2129
+ },
2130
+ "total_flos": 0,
2131
+ "train_batch_size": null,
2132
+ "trial_name": null,
2133
+ "trial_params": null
2134
+ }