ismaelR commited on
Commit
d039457
·
verified ·
1 Parent(s): bf117a3

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ base_model:
5
+ - Qwen/Qwen3-1.7B
6
+ datasets: []
7
+ languages:
8
+ - en
9
+ library_name: transformers
10
+ metrics: []
11
+ pipeline_tag: text-generation
12
+ tags: []
13
+
14
+ ---
15
+
16
+ # Model Card for ismaelR/(complete)
17
+
18
+ <!-- Provide a quick summary of what the model is/does. -->
19
+
20
+ This model was finetuned by performing GRPO
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ <!-- Provide a longer summary of what this model is. -->
27
+
28
+
29
+
30
+ - **Developed by:** Orange
31
+ - **Funded by [optional]:** [More Information Needed]
32
+ - **Shared by [optional]:** [More Information Needed]
33
+ - **Model type:** [More Information Needed]
34
+ - **Language(s) (NLP):** English
35
+ - **License:** [More Information Needed]
36
+ - **Finetuned from model [optional]:** Qwen/Qwen3-1.7B
37
+ - **Date [optional]:** 2025-07-21 21:48:00
38
+
39
+ ### Model Sources [optional]
40
+
41
+ <!-- Provide the basic links for the model. -->
42
+
43
+ - **Repository:** [More Information Needed]
44
+ - **Paper [optional]:** [More Information Needed]
45
+ - **Demo [optional]:** [More Information Needed]
46
+
47
+ ## Uses
48
+
49
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
50
+
51
+ ### Direct Use
52
+
53
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
54
+
55
+
56
+ This model can be used with the `transformers` library using `pipeline` abstraction as follows:
57
+
58
+ ```python
59
+ import torch
60
+ from transformers import pipeline
61
+
62
+ model_id = "Orange/Qwen-2.5-O.5B-regexp"
63
+ pipe = pipeline(
64
+ "text-generation",
65
+ model=model_id,
66
+ torch_dtype=torch.bfloat16,
67
+ device_map="auto",
68
+ )
69
+ messages = [
70
+ {"role": "system", "content": "You are chatbot specialized on Unknown domain."},
71
+ {"role": "user", "content": "Can you give a sample of your specialized knowledge?"},
72
+ ]
73
+ outputs = pipe(
74
+ messages,
75
+ max_new_tokens=256,
76
+ )
77
+ print(outputs[0]["generated_text"][-1])
78
+ ```
79
+
80
+ ### Downstream Use [optional]
81
+
82
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
83
+
84
+ [More Information Needed]
85
+
86
+ ### Out-of-Scope Use
87
+
88
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
89
+
90
+ [More Information Needed]
91
+
92
+ ## Bias, Risks, and Limitations
93
+
94
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
95
+
96
+ [More Information Needed]
97
+
98
+ ### Recommendations
99
+
100
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
101
+
102
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
103
+
104
+ ## How to Get Started with the Model
105
+
106
+ Use the code below to get started with the model.
107
+
108
+ [More Information Needed]
109
+
110
+ ## Training Details
111
+
112
+ This model was finetuned with [Orange internal fine tuning tools](https://gitlab.tech.orange/NEPAL/knowledge/orangelm/lm-adaptation/) with the Docker Image tagged `0.1.1` in the [registry](https://gitlab.tech.orange/NEPAL/knowledge/orangelm/lm-adaptation/container_registry/84664) and the following configuration file:
113
+
114
+ #### Speeds, Sizes, Times [optional]
115
+
116
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
117
+
118
+ [More Information Needed]
119
+
120
+ ## Evaluation
121
+
122
+ <!-- This section describes the evaluation protocols and provides the results. -->
123
+
124
+ ### Testing Data, Factors & Metrics
125
+
126
+ #### Testing Data
127
+
128
+ <!-- This should link to a Dataset Card if possible. -->
129
+
130
+ [More Information Needed]
131
+
132
+ #### Factors
133
+
134
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
135
+
136
+ [More Information Needed]
137
+
138
+ #### Metrics
139
+
140
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
141
+
142
+ [More Information Needed]
143
+
144
+ ### Results
145
+
146
+ [More Information Needed]
147
+
148
+ #### Summary
149
+
150
+
151
+
152
+ ## Model Examination [optional]
153
+
154
+ <!-- Relevant interpretability work for the model goes here -->
155
+
156
+ [More Information Needed]
157
+
158
+ ## Environmental Impact
159
+
160
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
161
+
162
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
163
+
164
+ - **Hardware Type:** [More Information Needed]
165
+ - **Hours used:** [More Information Needed]
166
+ - **Cloud Provider:** [More Information Needed]
167
+ - **Compute Region:** [More Information Needed]
168
+ - **Carbon Emitted:** [More Information Needed]
169
+
170
+ ## Technical Specifications [optional]
171
+
172
+ ### Model Architecture and Objective
173
+
174
+ [More Information Needed]
175
+
176
+ ### Compute Infrastructure
177
+
178
+ [More Information Needed]
179
+
180
+ #### Hardware
181
+
182
+ [More Information Needed]
183
+
184
+ #### Software
185
+
186
+ [More Information Needed]
187
+
188
+ ## Citation [optional]
189
+
190
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
191
+
192
+ **BibTeX:**
193
+
194
+ [More Information Needed]
195
+
196
+ **APA:**
197
+
198
+ [More Information Needed]
199
+
200
+ ## Glossary [optional]
201
+
202
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
203
+
204
+ [More Information Needed]
205
+
206
+ ## More Information [optional]
207
+
208
+ [More Information Needed]
209
+
210
+ ## Model Card Authors [optional]
211
+
212
+ [More Information Needed]
213
+
214
+ ## Model Card Contact
215
+
216
+ Thanks to [Ismaël Rousseau](mailto:ismael.rousseau@orange.com) for adding this model.
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "eos_token_id": 151645,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 2048,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 6144,
14
+ "max_position_embeddings": 40960,
15
+ "max_window_layers": 28,
16
+ "model_type": "qwen3",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 28,
19
+ "num_key_value_heads": 8,
20
+ "rms_norm_eps": 1e-06,
21
+ "rope_scaling": null,
22
+ "rope_theta": 1000000,
23
+ "sliding_window": null,
24
+ "tie_word_embeddings": true,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.51.3",
27
+ "use_cache": false,
28
+ "use_sliding_window": false,
29
+ "vocab_size": 151936
30
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "temperature": 0.6,
10
+ "top_k": 20,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.51.3"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea429df27a66e791c31d4955b3116ae7f380762ebbe1f116e6ab606c20d24b0b
3
+ size 3441185608
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b06ef6eaca7186a82f570d7c2103aedeadb16957e46079b291ef42a482a8f605
3
+ size 14688
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a88e06f1ec2f76af747b9baddeaee79747d2440d7c4e0730f8204da519e6a294
3
+ size 1064
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67cc0080ffd7555f723f423c27cfef314e1ad9d335c8b79f465c5faba1ed478b
3
+ size 11422821
tokenizer_config.json ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is string %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in content %}\n {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
231
+ "clean_up_tokenization_spaces": false,
232
+ "eos_token": "<|im_end|>",
233
+ "errors": "replace",
234
+ "extra_special_tokens": {},
235
+ "model_max_length": 131072,
236
+ "pad_token": "<|endoftext|>",
237
+ "split_special_tokens": false,
238
+ "tokenizer_class": "Qwen2Tokenizer",
239
+ "unk_token": null
240
+ }
trainer_state.json ADDED
@@ -0,0 +1,2538 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 20.0,
6
+ "eval_steps": 250,
7
+ "global_step": 2000,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "completion_length": 67.9625,
14
+ "epoch": 0.1,
15
+ "grad_norm": 0.0,
16
+ "kl": 0.0,
17
+ "learning_rate": 9.97e-07,
18
+ "loss": 0.0,
19
+ "reward": 0.0,
20
+ "reward_std": 0.0,
21
+ "rewards/DCR_reward": 0.0,
22
+ "step": 10
23
+ },
24
+ {
25
+ "completion_length": 67.125,
26
+ "epoch": 0.2,
27
+ "grad_norm": 0.0,
28
+ "kl": 0.0,
29
+ "learning_rate": 9.936666666666667e-07,
30
+ "loss": 0.0,
31
+ "reward": 0.0,
32
+ "reward_std": 0.0,
33
+ "rewards/DCR_reward": 0.0,
34
+ "step": 20
35
+ },
36
+ {
37
+ "completion_length": 58.3125,
38
+ "epoch": 0.3,
39
+ "grad_norm": 0.0,
40
+ "kl": 1.6021728515625e-05,
41
+ "learning_rate": 9.903333333333333e-07,
42
+ "loss": 0.0,
43
+ "reward": 0.0007361111231148243,
44
+ "reward_std": 0.0019508397206664085,
45
+ "rewards/DCR_reward": 0.0007361111231148243,
46
+ "step": 30
47
+ },
48
+ {
49
+ "completion_length": 81.925,
50
+ "epoch": 0.4,
51
+ "grad_norm": 0.0,
52
+ "kl": 0.0004387378692626953,
53
+ "learning_rate": 9.87e-07,
54
+ "loss": 0.0,
55
+ "reward": 0.0,
56
+ "reward_std": 0.0,
57
+ "rewards/DCR_reward": 0.0,
58
+ "step": 40
59
+ },
60
+ {
61
+ "completion_length": 69.9625,
62
+ "epoch": 0.5,
63
+ "grad_norm": 0.0,
64
+ "kl": 0.001148056983947754,
65
+ "learning_rate": 9.836666666666666e-07,
66
+ "loss": 0.0,
67
+ "reward": 0.0,
68
+ "reward_std": 0.0,
69
+ "rewards/DCR_reward": 0.0,
70
+ "step": 50
71
+ },
72
+ {
73
+ "completion_length": 103.45,
74
+ "epoch": 0.6,
75
+ "grad_norm": 0.0,
76
+ "kl": 0.02473886013031006,
77
+ "learning_rate": 9.803333333333332e-07,
78
+ "loss": -0.0,
79
+ "reward": 0.004910160228610039,
80
+ "reward_std": 0.013888029754161835,
81
+ "rewards/DCR_reward": 0.004910160228610039,
82
+ "step": 60
83
+ },
84
+ {
85
+ "completion_length": 74.0125,
86
+ "epoch": 0.7,
87
+ "grad_norm": 24.0,
88
+ "kl": 0.000667726993560791,
89
+ "learning_rate": 9.77e-07,
90
+ "loss": -0.0,
91
+ "reward": 0.052958965534344316,
92
+ "reward_std": 0.022137212846428157,
93
+ "rewards/DCR_reward": 0.052958965534344316,
94
+ "step": 70
95
+ },
96
+ {
97
+ "completion_length": 64.8625,
98
+ "epoch": 0.8,
99
+ "grad_norm": 0.0,
100
+ "kl": 0.000375521183013916,
101
+ "learning_rate": 9.736666666666667e-07,
102
+ "loss": 0.0,
103
+ "reward": 0.0,
104
+ "reward_std": 0.0,
105
+ "rewards/DCR_reward": 0.0,
106
+ "step": 80
107
+ },
108
+ {
109
+ "completion_length": 65.25,
110
+ "epoch": 0.9,
111
+ "grad_norm": 0.0,
112
+ "kl": 0.0009482383728027343,
113
+ "learning_rate": 9.703333333333332e-07,
114
+ "loss": 0.0,
115
+ "reward": 0.0,
116
+ "reward_std": 0.0,
117
+ "rewards/DCR_reward": 0.0,
118
+ "step": 90
119
+ },
120
+ {
121
+ "completion_length": 67.1875,
122
+ "epoch": 1.0,
123
+ "grad_norm": 0.0,
124
+ "kl": 0.0008351325988769532,
125
+ "learning_rate": 9.67e-07,
126
+ "loss": 0.0,
127
+ "reward": 0.02388598620891571,
128
+ "reward_std": 0.026214474439620973,
129
+ "rewards/DCR_reward": 0.02388598620891571,
130
+ "step": 100
131
+ },
132
+ {
133
+ "completion_length": 58.2125,
134
+ "epoch": 1.1,
135
+ "grad_norm": 0.0,
136
+ "kl": 0.0007820606231689453,
137
+ "learning_rate": 9.636666666666666e-07,
138
+ "loss": 0.0,
139
+ "reward": 0.0,
140
+ "reward_std": 0.0,
141
+ "rewards/DCR_reward": 0.0,
142
+ "step": 110
143
+ },
144
+ {
145
+ "completion_length": 63.0375,
146
+ "epoch": 1.2,
147
+ "grad_norm": 0.0,
148
+ "kl": 0.002427983283996582,
149
+ "learning_rate": 9.603333333333333e-07,
150
+ "loss": 0.0,
151
+ "reward": 0.009990863502025604,
152
+ "reward_std": 0.01850293278694153,
153
+ "rewards/DCR_reward": 0.009990863502025604,
154
+ "step": 120
155
+ },
156
+ {
157
+ "completion_length": 98.425,
158
+ "epoch": 1.3,
159
+ "grad_norm": 0.0,
160
+ "kl": 0.0005758762359619141,
161
+ "learning_rate": 9.57e-07,
162
+ "loss": 0.0,
163
+ "reward": 0.00625,
164
+ "reward_std": 0.01767766922712326,
165
+ "rewards/DCR_reward": 0.00625,
166
+ "step": 130
167
+ },
168
+ {
169
+ "completion_length": 74.8375,
170
+ "epoch": 1.4,
171
+ "grad_norm": 0.0,
172
+ "kl": 0.0006227374076843261,
173
+ "learning_rate": 9.536666666666667e-07,
174
+ "loss": 0.0,
175
+ "reward": 0.0,
176
+ "reward_std": 0.0,
177
+ "rewards/DCR_reward": 0.0,
178
+ "step": 140
179
+ },
180
+ {
181
+ "completion_length": 74.525,
182
+ "epoch": 1.5,
183
+ "grad_norm": 0.0,
184
+ "kl": 0.0026155948638916016,
185
+ "learning_rate": 9.503333333333333e-07,
186
+ "loss": 0.0,
187
+ "reward": 0.05994144082069397,
188
+ "reward_std": 0.0,
189
+ "rewards/DCR_reward": 0.05994144082069397,
190
+ "step": 150
191
+ },
192
+ {
193
+ "completion_length": 107.5625,
194
+ "epoch": 1.6,
195
+ "grad_norm": 0.0,
196
+ "kl": 0.0027020096778869627,
197
+ "learning_rate": 9.469999999999999e-07,
198
+ "loss": 0.0,
199
+ "reward": 0.028785842657089233,
200
+ "reward_std": 0.024412646889686584,
201
+ "rewards/DCR_reward": 0.028785842657089233,
202
+ "step": 160
203
+ },
204
+ {
205
+ "completion_length": 92.0,
206
+ "epoch": 1.7,
207
+ "grad_norm": 0.0,
208
+ "kl": 0.001540231704711914,
209
+ "learning_rate": 9.436666666666667e-07,
210
+ "loss": -0.0,
211
+ "reward": 0.005954481801018119,
212
+ "reward_std": 0.01657787673175335,
213
+ "rewards/DCR_reward": 0.005954481801018119,
214
+ "step": 170
215
+ },
216
+ {
217
+ "completion_length": 89.55,
218
+ "epoch": 1.8,
219
+ "grad_norm": 0.0,
220
+ "kl": 0.00039608478546142577,
221
+ "learning_rate": 9.403333333333333e-07,
222
+ "loss": 0.0,
223
+ "reward": 0.0,
224
+ "reward_std": 0.0,
225
+ "rewards/DCR_reward": 0.0,
226
+ "step": 180
227
+ },
228
+ {
229
+ "completion_length": 66.075,
230
+ "epoch": 1.9,
231
+ "grad_norm": 0.0,
232
+ "kl": 0.0017781257629394531,
233
+ "learning_rate": 9.37e-07,
234
+ "loss": 0.0,
235
+ "reward": 0.0013888888992369176,
236
+ "reward_std": 0.003928371146321297,
237
+ "rewards/DCR_reward": 0.0013888888992369176,
238
+ "step": 190
239
+ },
240
+ {
241
+ "completion_length": 57.3625,
242
+ "epoch": 2.0,
243
+ "grad_norm": 0.0,
244
+ "kl": 0.0024953842163085937,
245
+ "learning_rate": 9.336666666666666e-07,
246
+ "loss": 0.0,
247
+ "reward": 0.00016590238083153964,
248
+ "reward_std": 6.703471299260855e-05,
249
+ "rewards/DCR_reward": 0.00016590238083153964,
250
+ "step": 200
251
+ },
252
+ {
253
+ "completion_length": 112.0625,
254
+ "epoch": 2.1,
255
+ "grad_norm": 0.0,
256
+ "kl": 0.004532432556152344,
257
+ "learning_rate": 9.303333333333333e-07,
258
+ "loss": -0.0,
259
+ "reward": 0.005078960955142975,
260
+ "reward_std": 0.014365470409393311,
261
+ "rewards/DCR_reward": 0.005078960955142975,
262
+ "step": 210
263
+ },
264
+ {
265
+ "completion_length": 87.725,
266
+ "epoch": 2.2,
267
+ "grad_norm": 0.0,
268
+ "kl": 0.009652423858642577,
269
+ "learning_rate": 9.27e-07,
270
+ "loss": -0.0,
271
+ "reward": 0.03418266177177429,
272
+ "reward_std": 0.021894259750843047,
273
+ "rewards/DCR_reward": 0.03418266177177429,
274
+ "step": 220
275
+ },
276
+ {
277
+ "completion_length": 99.2375,
278
+ "epoch": 2.3,
279
+ "grad_norm": 0.0,
280
+ "kl": 0.001990985870361328,
281
+ "learning_rate": 9.236666666666666e-07,
282
+ "loss": 0.0,
283
+ "reward": 0.006875000149011612,
284
+ "reward_std": 0.01944543719291687,
285
+ "rewards/DCR_reward": 0.006875000149011612,
286
+ "step": 230
287
+ },
288
+ {
289
+ "completion_length": 87.775,
290
+ "epoch": 2.4,
291
+ "grad_norm": 0.0,
292
+ "kl": 0.0026398658752441405,
293
+ "learning_rate": 9.203333333333333e-07,
294
+ "loss": 0.0,
295
+ "reward": 0.002777777798473835,
296
+ "reward_std": 0.005143444985151291,
297
+ "rewards/DCR_reward": 0.002777777798473835,
298
+ "step": 240
299
+ },
300
+ {
301
+ "completion_length": 74.575,
302
+ "epoch": 2.5,
303
+ "grad_norm": 0.0,
304
+ "kl": 0.01164557933807373,
305
+ "learning_rate": 9.17e-07,
306
+ "loss": 0.0,
307
+ "reward": 0.00416666679084301,
308
+ "reward_std": 0.0025717224925756454,
309
+ "rewards/DCR_reward": 0.00416666679084301,
310
+ "step": 250
311
+ },
312
+ {
313
+ "epoch": 2.5,
314
+ "eval_completion_length": 84.36375,
315
+ "eval_kl": 0.006733891963958741,
316
+ "eval_loss": -2.638111595842929e-07,
317
+ "eval_reward": 0.016272501772618853,
318
+ "eval_reward_std": 0.014607391188983741,
319
+ "eval_rewards/DCR_reward": 0.016272501772618853,
320
+ "eval_runtime": 478.3232,
321
+ "eval_samples_per_second": 0.209,
322
+ "eval_steps_per_second": 0.027,
323
+ "step": 250
324
+ },
325
+ {
326
+ "completion_length": 89.6375,
327
+ "epoch": 2.6,
328
+ "grad_norm": 0.0,
329
+ "kl": 0.009255027770996094,
330
+ "learning_rate": 9.136666666666666e-07,
331
+ "loss": 0.0,
332
+ "reward": 0.018863770365715026,
333
+ "reward_std": 0.025783935189247133,
334
+ "rewards/DCR_reward": 0.018863770365715026,
335
+ "step": 260
336
+ },
337
+ {
338
+ "completion_length": 60.0,
339
+ "epoch": 2.7,
340
+ "grad_norm": 0.0,
341
+ "kl": 0.005229568481445313,
342
+ "learning_rate": 9.103333333333333e-07,
343
+ "loss": 0.0,
344
+ "reward": 0.0,
345
+ "reward_std": 0.0,
346
+ "rewards/DCR_reward": 0.0,
347
+ "step": 270
348
+ },
349
+ {
350
+ "completion_length": 89.1875,
351
+ "epoch": 2.8,
352
+ "grad_norm": 0.0,
353
+ "kl": 0.005951988697052002,
354
+ "learning_rate": 9.07e-07,
355
+ "loss": 0.0,
356
+ "reward": 0.0,
357
+ "reward_std": 0.0,
358
+ "rewards/DCR_reward": 0.0,
359
+ "step": 280
360
+ },
361
+ {
362
+ "completion_length": 76.9125,
363
+ "epoch": 2.9,
364
+ "grad_norm": 0.0,
365
+ "kl": 0.008041763305664062,
366
+ "learning_rate": 9.036666666666666e-07,
367
+ "loss": 0.0,
368
+ "reward": 0.0,
369
+ "reward_std": 0.0,
370
+ "rewards/DCR_reward": 0.0,
371
+ "step": 290
372
+ },
373
+ {
374
+ "completion_length": 64.475,
375
+ "epoch": 3.0,
376
+ "grad_norm": 0.0,
377
+ "kl": 0.005487966537475586,
378
+ "learning_rate": 9.003333333333333e-07,
379
+ "loss": 0.0,
380
+ "reward": 0.06013102883007378,
381
+ "reward_std": 3.2332184218830664e-08,
382
+ "rewards/DCR_reward": 0.06013102883007378,
383
+ "step": 300
384
+ },
385
+ {
386
+ "completion_length": 97.425,
387
+ "epoch": 3.1,
388
+ "grad_norm": 0.0,
389
+ "kl": 0.00860748291015625,
390
+ "learning_rate": 8.969999999999999e-07,
391
+ "loss": 0.0,
392
+ "reward": 0.041901327669620514,
393
+ "reward_std": 0.027376230992376804,
394
+ "rewards/DCR_reward": 0.041901327669620514,
395
+ "step": 310
396
+ },
397
+ {
398
+ "completion_length": 96.8125,
399
+ "epoch": 3.2,
400
+ "grad_norm": 0.0,
401
+ "kl": 0.01550760269165039,
402
+ "learning_rate": 8.936666666666667e-07,
403
+ "loss": 0.0,
404
+ "reward": 0.00040142778307199477,
405
+ "reward_std": 0.001058001583442092,
406
+ "rewards/DCR_reward": 0.00040142778307199477,
407
+ "step": 320
408
+ },
409
+ {
410
+ "completion_length": 90.7875,
411
+ "epoch": 3.3,
412
+ "grad_norm": 0.0,
413
+ "kl": 0.0132171630859375,
414
+ "learning_rate": 8.903333333333333e-07,
415
+ "loss": 0.0,
416
+ "reward": 0.05994144082069397,
417
+ "reward_std": 0.0,
418
+ "rewards/DCR_reward": 0.05994144082069397,
419
+ "step": 330
420
+ },
421
+ {
422
+ "completion_length": 68.4875,
423
+ "epoch": 3.4,
424
+ "grad_norm": 0.0,
425
+ "kl": 0.01154632568359375,
426
+ "learning_rate": 8.869999999999999e-07,
427
+ "loss": 0.0,
428
+ "reward": 0.005439520999789238,
429
+ "reward_std": 0.015385289490222932,
430
+ "rewards/DCR_reward": 0.005439520999789238,
431
+ "step": 340
432
+ },
433
+ {
434
+ "completion_length": 98.8875,
435
+ "epoch": 3.5,
436
+ "grad_norm": 0.0,
437
+ "kl": 0.00932455062866211,
438
+ "learning_rate": 8.836666666666667e-07,
439
+ "loss": 0.0,
440
+ "reward": 0.0,
441
+ "reward_std": 0.0,
442
+ "rewards/DCR_reward": 0.0,
443
+ "step": 350
444
+ },
445
+ {
446
+ "completion_length": 89.1875,
447
+ "epoch": 3.6,
448
+ "grad_norm": 0.0,
449
+ "kl": 0.009013175964355469,
450
+ "learning_rate": 8.803333333333333e-07,
451
+ "loss": -0.0,
452
+ "reward": 0.025011462631664472,
453
+ "reward_std": 0.04632342683034949,
454
+ "rewards/DCR_reward": 0.025011462631664472,
455
+ "step": 360
456
+ },
457
+ {
458
+ "completion_length": 72.875,
459
+ "epoch": 3.7,
460
+ "grad_norm": 0.0,
461
+ "kl": 0.01702561378479004,
462
+ "learning_rate": 8.769999999999999e-07,
463
+ "loss": 0.0,
464
+ "reward": 0.009234594414010644,
465
+ "reward_std": 0.0059265575400786474,
466
+ "rewards/DCR_reward": 0.009234594414010644,
467
+ "step": 370
468
+ },
469
+ {
470
+ "completion_length": 113.2875,
471
+ "epoch": 3.8,
472
+ "grad_norm": 0.0,
473
+ "kl": 0.02369537353515625,
474
+ "learning_rate": 8.736666666666667e-07,
475
+ "loss": 0.0,
476
+ "reward": 0.021757069602608682,
477
+ "reward_std": 0.045767249166965486,
478
+ "rewards/DCR_reward": 0.021757069602608682,
479
+ "step": 380
480
+ },
481
+ {
482
+ "completion_length": 93.05,
483
+ "epoch": 3.9,
484
+ "grad_norm": 0.0,
485
+ "kl": 0.017023229598999025,
486
+ "learning_rate": 8.703333333333333e-07,
487
+ "loss": 0.0,
488
+ "reward": 0.024103129375725986,
489
+ "reward_std": 0.0580611415207386,
490
+ "rewards/DCR_reward": 0.024103129375725986,
491
+ "step": 390
492
+ },
493
+ {
494
+ "completion_length": 92.2875,
495
+ "epoch": 4.0,
496
+ "grad_norm": 3.109375,
497
+ "kl": 0.0402587890625,
498
+ "learning_rate": 8.669999999999999e-07,
499
+ "loss": 0.0,
500
+ "reward": 0.06285573422210292,
501
+ "reward_std": 0.07154790845233946,
502
+ "rewards/DCR_reward": 0.06285573422210292,
503
+ "step": 400
504
+ },
505
+ {
506
+ "completion_length": 66.1,
507
+ "epoch": 4.1,
508
+ "grad_norm": 0.0,
509
+ "kl": 0.022429752349853515,
510
+ "learning_rate": 8.636666666666667e-07,
511
+ "loss": 0.0,
512
+ "reward": 0.0001454122830182314,
513
+ "reward_std": 0.0003706513671204448,
514
+ "rewards/DCR_reward": 0.0001454122830182314,
515
+ "step": 410
516
+ },
517
+ {
518
+ "completion_length": 137.125,
519
+ "epoch": 4.2,
520
+ "grad_norm": 0.0,
521
+ "kl": 0.024456501007080078,
522
+ "learning_rate": 8.603333333333332e-07,
523
+ "loss": 0.0,
524
+ "reward": 0.05007606785511598,
525
+ "reward_std": 0.06857064368668944,
526
+ "rewards/DCR_reward": 0.05007606785511598,
527
+ "step": 420
528
+ },
529
+ {
530
+ "completion_length": 92.9625,
531
+ "epoch": 4.3,
532
+ "grad_norm": 0.0,
533
+ "kl": 0.023699188232421876,
534
+ "learning_rate": 8.569999999999999e-07,
535
+ "loss": 0.0,
536
+ "reward": 0.0036682719714008273,
537
+ "reward_std": 0.0028008831664919852,
538
+ "rewards/DCR_reward": 0.0036682719714008273,
539
+ "step": 430
540
+ },
541
+ {
542
+ "completion_length": 95.85,
543
+ "epoch": 4.4,
544
+ "grad_norm": 0.0,
545
+ "kl": 0.03151016235351563,
546
+ "learning_rate": 8.536666666666667e-07,
547
+ "loss": -0.0,
548
+ "reward": 0.05042804731056094,
549
+ "reward_std": 0.05373804932460189,
550
+ "rewards/DCR_reward": 0.05042804731056094,
551
+ "step": 440
552
+ },
553
+ {
554
+ "completion_length": 118.325,
555
+ "epoch": 4.5,
556
+ "grad_norm": 7.3125,
557
+ "kl": 0.03622512817382813,
558
+ "learning_rate": 8.503333333333333e-07,
559
+ "loss": 0.0,
560
+ "reward": 0.011024426942458376,
561
+ "reward_std": 0.02974587368662469,
562
+ "rewards/DCR_reward": 0.011024426942458376,
563
+ "step": 450
564
+ },
565
+ {
566
+ "completion_length": 73.175,
567
+ "epoch": 4.6,
568
+ "grad_norm": 1.015625,
569
+ "kl": 0.03716583251953125,
570
+ "learning_rate": 8.469999999999999e-07,
571
+ "loss": 0.0,
572
+ "reward": 0.07448566257953644,
573
+ "reward_std": 0.023564168593065916,
574
+ "rewards/DCR_reward": 0.07448566257953644,
575
+ "step": 460
576
+ },
577
+ {
578
+ "completion_length": 81.0875,
579
+ "epoch": 4.7,
580
+ "grad_norm": 0.0,
581
+ "kl": 0.052947998046875,
582
+ "learning_rate": 8.436666666666667e-07,
583
+ "loss": 0.0,
584
+ "reward": 0.09821241516910958,
585
+ "reward_std": 0.0942218255950138,
586
+ "rewards/DCR_reward": 0.09821241516910958,
587
+ "step": 470
588
+ },
589
+ {
590
+ "completion_length": 99.0375,
591
+ "epoch": 4.8,
592
+ "grad_norm": 9.8125,
593
+ "kl": 0.03128204345703125,
594
+ "learning_rate": 8.403333333333333e-07,
595
+ "loss": -0.0,
596
+ "reward": 0.04295953951077536,
597
+ "reward_std": 0.01739810509607196,
598
+ "rewards/DCR_reward": 0.04295953951077536,
599
+ "step": 480
600
+ },
601
+ {
602
+ "completion_length": 98.1,
603
+ "epoch": 4.9,
604
+ "grad_norm": 28.5,
605
+ "kl": 0.051171875,
606
+ "learning_rate": 8.369999999999999e-07,
607
+ "loss": 0.0,
608
+ "reward": 0.08375828897114843,
609
+ "reward_std": 0.1000140183372423,
610
+ "rewards/DCR_reward": 0.08375828897114843,
611
+ "step": 490
612
+ },
613
+ {
614
+ "completion_length": 117.9375,
615
+ "epoch": 5.0,
616
+ "grad_norm": 0.0,
617
+ "kl": 0.039456844329833984,
618
+ "learning_rate": 8.336666666666667e-07,
619
+ "loss": -0.0,
620
+ "reward": 0.006845223042182625,
621
+ "reward_std": 0.017963543720543384,
622
+ "rewards/DCR_reward": 0.006845223042182625,
623
+ "step": 500
624
+ },
625
+ {
626
+ "epoch": 5.0,
627
+ "eval_completion_length": 102.48375,
628
+ "eval_kl": 0.06146286010742188,
629
+ "eval_loss": 1.9775862369897368e-07,
630
+ "eval_reward": 0.08125734120461857,
631
+ "eval_reward_std": 0.05342855209446043,
632
+ "eval_rewards/DCR_reward": 0.08125734120461857,
633
+ "eval_runtime": 1460.9247,
634
+ "eval_samples_per_second": 0.068,
635
+ "eval_steps_per_second": 0.009,
636
+ "step": 500
637
+ },
638
+ {
639
+ "completion_length": 105.3125,
640
+ "epoch": 5.1,
641
+ "grad_norm": 14.875,
642
+ "kl": 0.073431396484375,
643
+ "learning_rate": 8.303333333333333e-07,
644
+ "loss": -0.0,
645
+ "reward": 0.09785518775461241,
646
+ "reward_std": 0.08415243490599096,
647
+ "rewards/DCR_reward": 0.09785518775461241,
648
+ "step": 510
649
+ },
650
+ {
651
+ "completion_length": 92.6375,
652
+ "epoch": 5.2,
653
+ "grad_norm": 0.0,
654
+ "kl": 0.05964393615722656,
655
+ "learning_rate": 8.269999999999999e-07,
656
+ "loss": 0.0,
657
+ "reward": 0.0921564630290959,
658
+ "reward_std": 0.04181258587050252,
659
+ "rewards/DCR_reward": 0.0921564630290959,
660
+ "step": 520
661
+ },
662
+ {
663
+ "completion_length": 91.2875,
664
+ "epoch": 5.3,
665
+ "grad_norm": 19.5,
666
+ "kl": 0.08701934814453124,
667
+ "learning_rate": 8.236666666666666e-07,
668
+ "loss": 0.0,
669
+ "reward": 0.16571124200709164,
670
+ "reward_std": 0.04422773125115782,
671
+ "rewards/DCR_reward": 0.16571124200709164,
672
+ "step": 530
673
+ },
674
+ {
675
+ "completion_length": 118.0875,
676
+ "epoch": 5.4,
677
+ "grad_norm": 4.90625,
678
+ "kl": 0.09522705078125,
679
+ "learning_rate": 8.203333333333333e-07,
680
+ "loss": -0.0,
681
+ "reward": 0.12790518356487154,
682
+ "reward_std": 0.04112411521346075,
683
+ "rewards/DCR_reward": 0.12790518356487154,
684
+ "step": 540
685
+ },
686
+ {
687
+ "completion_length": 92.725,
688
+ "epoch": 5.5,
689
+ "grad_norm": 29.25,
690
+ "kl": 0.08489990234375,
691
+ "learning_rate": 8.169999999999999e-07,
692
+ "loss": -0.0,
693
+ "reward": 0.09454966578632593,
694
+ "reward_std": 0.04570461367693497,
695
+ "rewards/DCR_reward": 0.09454966578632593,
696
+ "step": 550
697
+ },
698
+ {
699
+ "completion_length": 83.2375,
700
+ "epoch": 5.6,
701
+ "grad_norm": 0.0,
702
+ "kl": 0.0607513427734375,
703
+ "learning_rate": 8.136666666666666e-07,
704
+ "loss": 0.0,
705
+ "reward": 0.07220683824270964,
706
+ "reward_std": 0.050523467175662515,
707
+ "rewards/DCR_reward": 0.07220683824270964,
708
+ "step": 560
709
+ },
710
+ {
711
+ "completion_length": 99.6875,
712
+ "epoch": 5.7,
713
+ "grad_norm": 0.0,
714
+ "kl": 0.068798828125,
715
+ "learning_rate": 8.103333333333333e-07,
716
+ "loss": -0.0,
717
+ "reward": 0.096117812365992,
718
+ "reward_std": 0.05572115568793379,
719
+ "rewards/DCR_reward": 0.096117812365992,
720
+ "step": 570
721
+ },
722
+ {
723
+ "completion_length": 61.5,
724
+ "epoch": 5.8,
725
+ "grad_norm": 0.0,
726
+ "kl": 0.07020721435546876,
727
+ "learning_rate": 8.070000000000001e-07,
728
+ "loss": -0.0,
729
+ "reward": 0.047906511649489406,
730
+ "reward_std": 4.001859270204022e-05,
731
+ "rewards/DCR_reward": 0.047906511649489406,
732
+ "step": 580
733
+ },
734
+ {
735
+ "completion_length": 99.3125,
736
+ "epoch": 5.9,
737
+ "grad_norm": 0.0,
738
+ "kl": 0.123101806640625,
739
+ "learning_rate": 8.036666666666666e-07,
740
+ "loss": -0.0,
741
+ "reward": 0.13380602395627647,
742
+ "reward_std": 0.10035560713149608,
743
+ "rewards/DCR_reward": 0.13380602395627647,
744
+ "step": 590
745
+ },
746
+ {
747
+ "completion_length": 103.025,
748
+ "epoch": 6.0,
749
+ "grad_norm": 0.0,
750
+ "kl": 0.081585693359375,
751
+ "learning_rate": 8.003333333333333e-07,
752
+ "loss": -0.0,
753
+ "reward": 0.061570275388658044,
754
+ "reward_std": 0.04306753019336611,
755
+ "rewards/DCR_reward": 0.061570275388658044,
756
+ "step": 600
757
+ },
758
+ {
759
+ "completion_length": 97.9125,
760
+ "epoch": 6.1,
761
+ "grad_norm": 0.0,
762
+ "kl": 0.11351318359375,
763
+ "learning_rate": 7.970000000000001e-07,
764
+ "loss": 0.0,
765
+ "reward": 0.09666887713829056,
766
+ "reward_std": 0.059037036258087025,
767
+ "rewards/DCR_reward": 0.09666887713829056,
768
+ "step": 610
769
+ },
770
+ {
771
+ "completion_length": 91.7,
772
+ "epoch": 6.2,
773
+ "grad_norm": 0.0,
774
+ "kl": 0.055133056640625,
775
+ "learning_rate": 7.936666666666666e-07,
776
+ "loss": 0.0,
777
+ "reward": 0.13362326713686343,
778
+ "reward_std": 0.07874106459098584,
779
+ "rewards/DCR_reward": 0.13362326713686343,
780
+ "step": 620
781
+ },
782
+ {
783
+ "completion_length": 107.4125,
784
+ "epoch": 6.3,
785
+ "grad_norm": 9.25,
786
+ "kl": 0.1246337890625,
787
+ "learning_rate": 7.903333333333333e-07,
788
+ "loss": 0.0,
789
+ "reward": 0.08847815722692758,
790
+ "reward_std": 0.06451865802846442,
791
+ "rewards/DCR_reward": 0.08847815722692758,
792
+ "step": 630
793
+ },
794
+ {
795
+ "completion_length": 77.0875,
796
+ "epoch": 6.4,
797
+ "grad_norm": 25.0,
798
+ "kl": 0.15296096801757814,
799
+ "learning_rate": 7.87e-07,
800
+ "loss": -0.0,
801
+ "reward": 0.1423336612060666,
802
+ "reward_std": 0.055696507578250024,
803
+ "rewards/DCR_reward": 0.1423336612060666,
804
+ "step": 640
805
+ },
806
+ {
807
+ "completion_length": 99.6375,
808
+ "epoch": 6.5,
809
+ "grad_norm": 0.0,
810
+ "kl": 0.08688135147094726,
811
+ "learning_rate": 7.836666666666666e-07,
812
+ "loss": -0.0,
813
+ "reward": 0.173234991542995,
814
+ "reward_std": 0.06990011496527586,
815
+ "rewards/DCR_reward": 0.173234991542995,
816
+ "step": 650
817
+ },
818
+ {
819
+ "completion_length": 95.975,
820
+ "epoch": 6.6,
821
+ "grad_norm": 12.9375,
822
+ "kl": 0.14229736328125,
823
+ "learning_rate": 7.803333333333333e-07,
824
+ "loss": 0.0,
825
+ "reward": 0.19796406209934503,
826
+ "reward_std": 0.08495658059261757,
827
+ "rewards/DCR_reward": 0.19796406209934503,
828
+ "step": 660
829
+ },
830
+ {
831
+ "completion_length": 51.9875,
832
+ "epoch": 6.7,
833
+ "grad_norm": 26.0,
834
+ "kl": 0.19248046875,
835
+ "learning_rate": 7.77e-07,
836
+ "loss": 0.0,
837
+ "reward": 0.12773155540926381,
838
+ "reward_std": 0.018970384920248762,
839
+ "rewards/DCR_reward": 0.12773155540926381,
840
+ "step": 670
841
+ },
842
+ {
843
+ "completion_length": 81.6875,
844
+ "epoch": 6.8,
845
+ "grad_norm": 10.0625,
846
+ "kl": 0.09774169921875,
847
+ "learning_rate": 7.736666666666666e-07,
848
+ "loss": -0.0,
849
+ "reward": 0.09495227632578462,
850
+ "reward_std": 0.05938452887904759,
851
+ "rewards/DCR_reward": 0.09495227632578462,
852
+ "step": 680
853
+ },
854
+ {
855
+ "completion_length": 100.675,
856
+ "epoch": 6.9,
857
+ "grad_norm": 19.75,
858
+ "kl": 0.1118896484375,
859
+ "learning_rate": 7.703333333333333e-07,
860
+ "loss": 0.0,
861
+ "reward": 0.09238526365661529,
862
+ "reward_std": 0.04735179884301033,
863
+ "rewards/DCR_reward": 0.09238526365661529,
864
+ "step": 690
865
+ },
866
+ {
867
+ "completion_length": 63.1625,
868
+ "epoch": 7.0,
869
+ "grad_norm": 17.625,
870
+ "kl": 0.0974609375,
871
+ "learning_rate": 7.67e-07,
872
+ "loss": 0.0,
873
+ "reward": 0.15320108719170095,
874
+ "reward_std": 0.04531068232899997,
875
+ "rewards/DCR_reward": 0.15320108719170095,
876
+ "step": 700
877
+ },
878
+ {
879
+ "completion_length": 93.4125,
880
+ "epoch": 7.1,
881
+ "grad_norm": 18.5,
882
+ "kl": 0.101190185546875,
883
+ "learning_rate": 7.636666666666667e-07,
884
+ "loss": 0.0,
885
+ "reward": 0.02544752674875781,
886
+ "reward_std": 0.05906593499239534,
887
+ "rewards/DCR_reward": 0.02544752674875781,
888
+ "step": 710
889
+ },
890
+ {
891
+ "completion_length": 98.0875,
892
+ "epoch": 7.2,
893
+ "grad_norm": 0.0,
894
+ "kl": 0.12484130859375,
895
+ "learning_rate": 7.603333333333332e-07,
896
+ "loss": -0.0,
897
+ "reward": 0.17555389162153007,
898
+ "reward_std": 0.06005739986721892,
899
+ "rewards/DCR_reward": 0.17555389162153007,
900
+ "step": 720
901
+ },
902
+ {
903
+ "completion_length": 85.3,
904
+ "epoch": 7.3,
905
+ "grad_norm": 0.03076171875,
906
+ "kl": 0.06852807998657226,
907
+ "learning_rate": 7.57e-07,
908
+ "loss": 0.0,
909
+ "reward": 0.09980509513407014,
910
+ "reward_std": 0.022045876948665465,
911
+ "rewards/DCR_reward": 0.09980509513407014,
912
+ "step": 730
913
+ },
914
+ {
915
+ "completion_length": 60.525,
916
+ "epoch": 7.4,
917
+ "grad_norm": 0.0,
918
+ "kl": 0.146484375,
919
+ "learning_rate": 7.536666666666667e-07,
920
+ "loss": 0.0,
921
+ "reward": 0.17459992747753858,
922
+ "reward_std": 0.04686348429240752,
923
+ "rewards/DCR_reward": 0.17459992747753858,
924
+ "step": 740
925
+ },
926
+ {
927
+ "completion_length": 146.6375,
928
+ "epoch": 7.5,
929
+ "grad_norm": 5.125,
930
+ "kl": 0.13475341796875,
931
+ "learning_rate": 7.503333333333332e-07,
932
+ "loss": -0.0,
933
+ "reward": 0.12549885590560733,
934
+ "reward_std": 0.003405572484291497,
935
+ "rewards/DCR_reward": 0.12549885590560733,
936
+ "step": 750
937
+ },
938
+ {
939
+ "epoch": 7.5,
940
+ "eval_completion_length": 85.11125,
941
+ "eval_kl": 0.133615665435791,
942
+ "eval_loss": -8.860533853294328e-07,
943
+ "eval_reward": 0.13431876484770328,
944
+ "eval_reward_std": 0.05950616509797925,
945
+ "eval_rewards/DCR_reward": 0.13431876484770328,
946
+ "eval_runtime": 2234.9394,
947
+ "eval_samples_per_second": 0.045,
948
+ "eval_steps_per_second": 0.006,
949
+ "step": 750
950
+ },
951
+ {
952
+ "completion_length": 79.35,
953
+ "epoch": 7.6,
954
+ "grad_norm": 17.875,
955
+ "kl": 0.166937255859375,
956
+ "learning_rate": 7.47e-07,
957
+ "loss": -0.0,
958
+ "reward": 0.04600536972284317,
959
+ "reward_std": 0.030051297834233992,
960
+ "rewards/DCR_reward": 0.04600536972284317,
961
+ "step": 760
962
+ },
963
+ {
964
+ "completion_length": 59.425,
965
+ "epoch": 7.7,
966
+ "grad_norm": 18.875,
967
+ "kl": 0.144036865234375,
968
+ "learning_rate": 7.436666666666667e-07,
969
+ "loss": -0.0,
970
+ "reward": 0.25486029861494897,
971
+ "reward_std": 0.06726857685171125,
972
+ "rewards/DCR_reward": 0.25486029861494897,
973
+ "step": 770
974
+ },
975
+ {
976
+ "completion_length": 59.25,
977
+ "epoch": 7.8,
978
+ "grad_norm": 13.875,
979
+ "kl": 0.1705810546875,
980
+ "learning_rate": 7.403333333333332e-07,
981
+ "loss": 0.0,
982
+ "reward": 0.18612688397988678,
983
+ "reward_std": 0.060024100821465254,
984
+ "rewards/DCR_reward": 0.18612688397988678,
985
+ "step": 780
986
+ },
987
+ {
988
+ "completion_length": 91.675,
989
+ "epoch": 7.9,
990
+ "grad_norm": 8.3125,
991
+ "kl": 0.1150146484375,
992
+ "learning_rate": 7.37e-07,
993
+ "loss": -0.0,
994
+ "reward": 0.22636549319140614,
995
+ "reward_std": 0.09487768108156161,
996
+ "rewards/DCR_reward": 0.22636549319140614,
997
+ "step": 790
998
+ },
999
+ {
1000
+ "completion_length": 89.375,
1001
+ "epoch": 8.0,
1002
+ "grad_norm": 13.25,
1003
+ "kl": 0.1668212890625,
1004
+ "learning_rate": 7.336666666666667e-07,
1005
+ "loss": -0.0,
1006
+ "reward": 0.11243776695337146,
1007
+ "reward_std": 0.0434065388621093,
1008
+ "rewards/DCR_reward": 0.11243776695337146,
1009
+ "step": 800
1010
+ },
1011
+ {
1012
+ "completion_length": 60.225,
1013
+ "epoch": 8.1,
1014
+ "grad_norm": 0.0,
1015
+ "kl": 0.144354248046875,
1016
+ "learning_rate": 7.303333333333332e-07,
1017
+ "loss": 0.0,
1018
+ "reward": 0.17428544777212665,
1019
+ "reward_std": 0.03378924725689103,
1020
+ "rewards/DCR_reward": 0.17428544777212665,
1021
+ "step": 810
1022
+ },
1023
+ {
1024
+ "completion_length": 67.4375,
1025
+ "epoch": 8.2,
1026
+ "grad_norm": 29.625,
1027
+ "kl": 0.2034912109375,
1028
+ "learning_rate": 7.27e-07,
1029
+ "loss": -0.0,
1030
+ "reward": 0.09552836455404759,
1031
+ "reward_std": 0.0525470721883039,
1032
+ "rewards/DCR_reward": 0.09552836455404759,
1033
+ "step": 820
1034
+ },
1035
+ {
1036
+ "completion_length": 78.75,
1037
+ "epoch": 8.3,
1038
+ "grad_norm": 0.0,
1039
+ "kl": 0.185302734375,
1040
+ "learning_rate": 7.236666666666666e-07,
1041
+ "loss": -0.0,
1042
+ "reward": 0.10288139216136187,
1043
+ "reward_std": 0.07548805264668772,
1044
+ "rewards/DCR_reward": 0.10288139216136187,
1045
+ "step": 830
1046
+ },
1047
+ {
1048
+ "completion_length": 104.2875,
1049
+ "epoch": 8.4,
1050
+ "grad_norm": 0.0,
1051
+ "kl": 0.1417724609375,
1052
+ "learning_rate": 7.203333333333333e-07,
1053
+ "loss": -0.0,
1054
+ "reward": 0.12569777632597834,
1055
+ "reward_std": 0.0003777303310926072,
1056
+ "rewards/DCR_reward": 0.12569777632597834,
1057
+ "step": 840
1058
+ },
1059
+ {
1060
+ "completion_length": 89.9625,
1061
+ "epoch": 8.5,
1062
+ "grad_norm": 0.0,
1063
+ "kl": 0.124444580078125,
1064
+ "learning_rate": 7.17e-07,
1065
+ "loss": 0.0,
1066
+ "reward": 0.08354408431332558,
1067
+ "reward_std": 0.016342163346146778,
1068
+ "rewards/DCR_reward": 0.08354408431332558,
1069
+ "step": 850
1070
+ },
1071
+ {
1072
+ "completion_length": 126.1625,
1073
+ "epoch": 8.6,
1074
+ "grad_norm": 5.75,
1075
+ "kl": 0.15697021484375,
1076
+ "learning_rate": 7.136666666666666e-07,
1077
+ "loss": -0.0,
1078
+ "reward": 0.18000225534196942,
1079
+ "reward_std": 0.11962818971369416,
1080
+ "rewards/DCR_reward": 0.18000225534196942,
1081
+ "step": 860
1082
+ },
1083
+ {
1084
+ "completion_length": 66.125,
1085
+ "epoch": 8.7,
1086
+ "grad_norm": 15.8125,
1087
+ "kl": 0.131787109375,
1088
+ "learning_rate": 7.103333333333333e-07,
1089
+ "loss": 0.0,
1090
+ "reward": 0.14025002100970596,
1091
+ "reward_std": 0.07934097726297296,
1092
+ "rewards/DCR_reward": 0.14025002100970596,
1093
+ "step": 870
1094
+ },
1095
+ {
1096
+ "completion_length": 85.55,
1097
+ "epoch": 8.8,
1098
+ "grad_norm": 30.875,
1099
+ "kl": 0.14227294921875,
1100
+ "learning_rate": 7.07e-07,
1101
+ "loss": 0.0,
1102
+ "reward": 0.32092891409993174,
1103
+ "reward_std": 0.12094202971202321,
1104
+ "rewards/DCR_reward": 0.32092891409993174,
1105
+ "step": 880
1106
+ },
1107
+ {
1108
+ "completion_length": 80.625,
1109
+ "epoch": 8.9,
1110
+ "grad_norm": 0.0,
1111
+ "kl": 0.11814746856689454,
1112
+ "learning_rate": 7.036666666666666e-07,
1113
+ "loss": 0.0,
1114
+ "reward": 0.09874425530433655,
1115
+ "reward_std": 0.020111887441453292,
1116
+ "rewards/DCR_reward": 0.09874425530433655,
1117
+ "step": 890
1118
+ },
1119
+ {
1120
+ "completion_length": 74.525,
1121
+ "epoch": 9.0,
1122
+ "grad_norm": 0.0,
1123
+ "kl": 0.107318115234375,
1124
+ "learning_rate": 7.003333333333333e-07,
1125
+ "loss": -0.0,
1126
+ "reward": 0.17968311300501227,
1127
+ "reward_std": 0.03413876986596733,
1128
+ "rewards/DCR_reward": 0.17968311300501227,
1129
+ "step": 900
1130
+ },
1131
+ {
1132
+ "completion_length": 57.0625,
1133
+ "epoch": 9.1,
1134
+ "grad_norm": 17.625,
1135
+ "kl": 0.158935546875,
1136
+ "learning_rate": 6.97e-07,
1137
+ "loss": 0.0,
1138
+ "reward": 0.1742305759107694,
1139
+ "reward_std": 0.07847042060523109,
1140
+ "rewards/DCR_reward": 0.1742305759107694,
1141
+ "step": 910
1142
+ },
1143
+ {
1144
+ "completion_length": 64.8625,
1145
+ "epoch": 9.2,
1146
+ "grad_norm": 0.0,
1147
+ "kl": 0.17060546875,
1148
+ "learning_rate": 6.936666666666666e-07,
1149
+ "loss": -0.0,
1150
+ "reward": 0.18635233133099974,
1151
+ "reward_std": 0.07983016533326008,
1152
+ "rewards/DCR_reward": 0.18635233133099974,
1153
+ "step": 920
1154
+ },
1155
+ {
1156
+ "completion_length": 77.625,
1157
+ "epoch": 9.3,
1158
+ "grad_norm": 23.25,
1159
+ "kl": 0.218511962890625,
1160
+ "learning_rate": 6.903333333333333e-07,
1161
+ "loss": 0.0,
1162
+ "reward": 0.19767590372357519,
1163
+ "reward_std": 0.043497299042064695,
1164
+ "rewards/DCR_reward": 0.19767590372357519,
1165
+ "step": 930
1166
+ },
1167
+ {
1168
+ "completion_length": 47.325,
1169
+ "epoch": 9.4,
1170
+ "grad_norm": 21.375,
1171
+ "kl": 0.189459228515625,
1172
+ "learning_rate": 6.87e-07,
1173
+ "loss": 0.0,
1174
+ "reward": 0.22449640462873505,
1175
+ "reward_std": 0.0027420094997694378,
1176
+ "rewards/DCR_reward": 0.22449640462873505,
1177
+ "step": 940
1178
+ },
1179
+ {
1180
+ "completion_length": 74.6,
1181
+ "epoch": 9.5,
1182
+ "grad_norm": 0.0,
1183
+ "kl": 0.14364013671875,
1184
+ "learning_rate": 6.836666666666666e-07,
1185
+ "loss": -0.0,
1186
+ "reward": 0.1431873946392443,
1187
+ "reward_std": 0.001265111715017042,
1188
+ "rewards/DCR_reward": 0.1431873946392443,
1189
+ "step": 950
1190
+ },
1191
+ {
1192
+ "completion_length": 95.7625,
1193
+ "epoch": 9.6,
1194
+ "grad_norm": 0.0,
1195
+ "kl": 0.1601806640625,
1196
+ "learning_rate": 6.803333333333333e-07,
1197
+ "loss": -0.0,
1198
+ "reward": 0.1265127849765122,
1199
+ "reward_std": 0.06392315039120149,
1200
+ "rewards/DCR_reward": 0.1265127849765122,
1201
+ "step": 960
1202
+ },
1203
+ {
1204
+ "completion_length": 79.325,
1205
+ "epoch": 9.7,
1206
+ "grad_norm": 0.019287109375,
1207
+ "kl": 0.1714111328125,
1208
+ "learning_rate": 6.77e-07,
1209
+ "loss": 0.0,
1210
+ "reward": 0.1620770814595744,
1211
+ "reward_std": 0.07664856179035269,
1212
+ "rewards/DCR_reward": 0.1620770814595744,
1213
+ "step": 970
1214
+ },
1215
+ {
1216
+ "completion_length": 113.475,
1217
+ "epoch": 9.8,
1218
+ "grad_norm": 40.5,
1219
+ "kl": 0.09014892578125,
1220
+ "learning_rate": 6.736666666666666e-07,
1221
+ "loss": -0.0,
1222
+ "reward": 0.12962240994675084,
1223
+ "reward_std": 0.08673616686003242,
1224
+ "rewards/DCR_reward": 0.12962240994675084,
1225
+ "step": 980
1226
+ },
1227
+ {
1228
+ "completion_length": 132.1875,
1229
+ "epoch": 9.9,
1230
+ "grad_norm": 3.171875,
1231
+ "kl": 0.1203857421875,
1232
+ "learning_rate": 6.703333333333333e-07,
1233
+ "loss": 0.0,
1234
+ "reward": 0.0844914206303656,
1235
+ "reward_std": 0.04573890994070098,
1236
+ "rewards/DCR_reward": 0.0844914206303656,
1237
+ "step": 990
1238
+ },
1239
+ {
1240
+ "completion_length": 91.7,
1241
+ "epoch": 10.0,
1242
+ "grad_norm": 0.0,
1243
+ "kl": 0.14872217178344727,
1244
+ "learning_rate": 6.67e-07,
1245
+ "loss": -0.0,
1246
+ "reward": 0.22451465255580844,
1247
+ "reward_std": 0.03193944031372666,
1248
+ "rewards/DCR_reward": 0.22451465255580844,
1249
+ "step": 1000
1250
+ },
1251
+ {
1252
+ "epoch": 10.0,
1253
+ "eval_completion_length": 88.34125,
1254
+ "eval_kl": 0.16021126747131348,
1255
+ "eval_loss": -2.1401815786248335e-07,
1256
+ "eval_reward": 0.170399680956034,
1257
+ "eval_reward_std": 0.052541274107022674,
1258
+ "eval_rewards/DCR_reward": 0.170399680956034,
1259
+ "eval_runtime": 2514.5153,
1260
+ "eval_samples_per_second": 0.04,
1261
+ "eval_steps_per_second": 0.005,
1262
+ "step": 1000
1263
+ },
1264
+ {
1265
+ "completion_length": 69.625,
1266
+ "epoch": 10.1,
1267
+ "grad_norm": 0.0,
1268
+ "kl": 0.16669921875,
1269
+ "learning_rate": 6.636666666666666e-07,
1270
+ "loss": -0.0,
1271
+ "reward": 0.18533597313798963,
1272
+ "reward_std": 0.06167281661574009,
1273
+ "rewards/DCR_reward": 0.18533597313798963,
1274
+ "step": 1010
1275
+ },
1276
+ {
1277
+ "completion_length": 60.6875,
1278
+ "epoch": 10.2,
1279
+ "grad_norm": 0.0,
1280
+ "kl": 0.1216064453125,
1281
+ "learning_rate": 6.603333333333333e-07,
1282
+ "loss": -0.0,
1283
+ "reward": 0.12192644038586878,
1284
+ "reward_std": 0.01787745998954051,
1285
+ "rewards/DCR_reward": 0.12192644038586878,
1286
+ "step": 1020
1287
+ },
1288
+ {
1289
+ "completion_length": 76.7625,
1290
+ "epoch": 10.3,
1291
+ "grad_norm": 7.5,
1292
+ "kl": 0.17220458984375,
1293
+ "learning_rate": 6.57e-07,
1294
+ "loss": 0.0,
1295
+ "reward": 0.11648766156286001,
1296
+ "reward_std": 0.0692232246074127,
1297
+ "rewards/DCR_reward": 0.11648766156286001,
1298
+ "step": 1030
1299
+ },
1300
+ {
1301
+ "completion_length": 88.8375,
1302
+ "epoch": 10.4,
1303
+ "grad_norm": 38.5,
1304
+ "kl": 0.130328369140625,
1305
+ "learning_rate": 6.536666666666666e-07,
1306
+ "loss": 0.0,
1307
+ "reward": 0.21984734574798495,
1308
+ "reward_std": 0.008847105817403644,
1309
+ "rewards/DCR_reward": 0.21984734574798495,
1310
+ "step": 1040
1311
+ },
1312
+ {
1313
+ "completion_length": 68.9125,
1314
+ "epoch": 10.5,
1315
+ "grad_norm": 0.0,
1316
+ "kl": 0.168310546875,
1317
+ "learning_rate": 6.503333333333332e-07,
1318
+ "loss": 0.0,
1319
+ "reward": 0.0662006882019341,
1320
+ "reward_std": 0.089736894213479,
1321
+ "rewards/DCR_reward": 0.0662006882019341,
1322
+ "step": 1050
1323
+ },
1324
+ {
1325
+ "completion_length": 92.8875,
1326
+ "epoch": 10.6,
1327
+ "grad_norm": 21.625,
1328
+ "kl": 0.1709716796875,
1329
+ "learning_rate": 6.47e-07,
1330
+ "loss": 0.0,
1331
+ "reward": 0.3175446430454031,
1332
+ "reward_std": 0.04590535781462677,
1333
+ "rewards/DCR_reward": 0.3175446430454031,
1334
+ "step": 1060
1335
+ },
1336
+ {
1337
+ "completion_length": 104.0875,
1338
+ "epoch": 10.7,
1339
+ "grad_norm": 0.0013275146484375,
1340
+ "kl": 0.1356201171875,
1341
+ "learning_rate": 6.436666666666667e-07,
1342
+ "loss": 0.0,
1343
+ "reward": 0.17272857704083436,
1344
+ "reward_std": 0.04932637963789972,
1345
+ "rewards/DCR_reward": 0.17272857704083436,
1346
+ "step": 1070
1347
+ },
1348
+ {
1349
+ "completion_length": 73.5875,
1350
+ "epoch": 10.8,
1351
+ "grad_norm": 0.0,
1352
+ "kl": 0.17826080322265625,
1353
+ "learning_rate": 6.403333333333332e-07,
1354
+ "loss": 0.0,
1355
+ "reward": 0.19967932105064393,
1356
+ "reward_std": 0.012677951477235183,
1357
+ "rewards/DCR_reward": 0.19967932105064393,
1358
+ "step": 1080
1359
+ },
1360
+ {
1361
+ "completion_length": 61.4375,
1362
+ "epoch": 10.9,
1363
+ "grad_norm": 14.9375,
1364
+ "kl": 0.2052734375,
1365
+ "learning_rate": 6.37e-07,
1366
+ "loss": 0.0,
1367
+ "reward": 0.21544204794627148,
1368
+ "reward_std": 0.016153086804820305,
1369
+ "rewards/DCR_reward": 0.21544204794627148,
1370
+ "step": 1090
1371
+ },
1372
+ {
1373
+ "completion_length": 89.9125,
1374
+ "epoch": 11.0,
1375
+ "grad_norm": 18.875,
1376
+ "kl": 0.18170166015625,
1377
+ "learning_rate": 6.336666666666667e-07,
1378
+ "loss": 0.0,
1379
+ "reward": 0.09034715148736723,
1380
+ "reward_std": 0.07943847334618112,
1381
+ "rewards/DCR_reward": 0.09034715148736723,
1382
+ "step": 1100
1383
+ },
1384
+ {
1385
+ "completion_length": 64.05,
1386
+ "epoch": 11.1,
1387
+ "grad_norm": 15.0625,
1388
+ "kl": 0.154913330078125,
1389
+ "learning_rate": 6.303333333333332e-07,
1390
+ "loss": 0.0,
1391
+ "reward": 0.24353665355592966,
1392
+ "reward_std": 0.08834277796122478,
1393
+ "rewards/DCR_reward": 0.24353665355592966,
1394
+ "step": 1110
1395
+ },
1396
+ {
1397
+ "completion_length": 61.9,
1398
+ "epoch": 11.2,
1399
+ "grad_norm": 0.1220703125,
1400
+ "kl": 0.15966796875,
1401
+ "learning_rate": 6.27e-07,
1402
+ "loss": -0.0,
1403
+ "reward": 0.08767885738052428,
1404
+ "reward_std": 0.05409979920323167,
1405
+ "rewards/DCR_reward": 0.08767885738052428,
1406
+ "step": 1120
1407
+ },
1408
+ {
1409
+ "completion_length": 54.05,
1410
+ "epoch": 11.3,
1411
+ "grad_norm": 0.0,
1412
+ "kl": 0.251708984375,
1413
+ "learning_rate": 6.236666666666667e-07,
1414
+ "loss": -0.0,
1415
+ "reward": 0.1877214941661805,
1416
+ "reward_std": 0.014956248462726762,
1417
+ "rewards/DCR_reward": 0.1877214941661805,
1418
+ "step": 1130
1419
+ },
1420
+ {
1421
+ "completion_length": 79.475,
1422
+ "epoch": 11.4,
1423
+ "grad_norm": 16.5,
1424
+ "kl": 0.163287353515625,
1425
+ "learning_rate": 6.203333333333333e-07,
1426
+ "loss": 0.0,
1427
+ "reward": 0.18192651600111276,
1428
+ "reward_std": 0.04434406632919945,
1429
+ "rewards/DCR_reward": 0.18192651600111276,
1430
+ "step": 1140
1431
+ },
1432
+ {
1433
+ "completion_length": 139.775,
1434
+ "epoch": 11.5,
1435
+ "grad_norm": 2.921875,
1436
+ "kl": 0.231005859375,
1437
+ "learning_rate": 6.17e-07,
1438
+ "loss": -0.0,
1439
+ "reward": 0.22696539172902702,
1440
+ "reward_std": 0.03848955475841649,
1441
+ "rewards/DCR_reward": 0.22696539172902702,
1442
+ "step": 1150
1443
+ },
1444
+ {
1445
+ "completion_length": 95.55,
1446
+ "epoch": 11.6,
1447
+ "grad_norm": 23.25,
1448
+ "kl": 0.170263671875,
1449
+ "learning_rate": 6.136666666666666e-07,
1450
+ "loss": -0.0,
1451
+ "reward": 0.24207095536403359,
1452
+ "reward_std": 0.07178211783611914,
1453
+ "rewards/DCR_reward": 0.24207095536403359,
1454
+ "step": 1160
1455
+ },
1456
+ {
1457
+ "completion_length": 89.1,
1458
+ "epoch": 11.7,
1459
+ "grad_norm": 15.5,
1460
+ "kl": 0.126556396484375,
1461
+ "learning_rate": 6.103333333333333e-07,
1462
+ "loss": 0.0,
1463
+ "reward": 0.11935207918286324,
1464
+ "reward_std": 0.07003376996144653,
1465
+ "rewards/DCR_reward": 0.11935207918286324,
1466
+ "step": 1170
1467
+ },
1468
+ {
1469
+ "completion_length": 63.4375,
1470
+ "epoch": 11.8,
1471
+ "grad_norm": 28.625,
1472
+ "kl": 0.18448925018310547,
1473
+ "learning_rate": 6.07e-07,
1474
+ "loss": -0.0,
1475
+ "reward": 0.11633564964868129,
1476
+ "reward_std": 0.026135455832263687,
1477
+ "rewards/DCR_reward": 0.11633564964868129,
1478
+ "step": 1180
1479
+ },
1480
+ {
1481
+ "completion_length": 89.6125,
1482
+ "epoch": 11.9,
1483
+ "grad_norm": 0.0299072265625,
1484
+ "kl": 0.12894287109375,
1485
+ "learning_rate": 6.036666666666666e-07,
1486
+ "loss": 0.0,
1487
+ "reward": 0.16700884115416556,
1488
+ "reward_std": 0.036743837507117405,
1489
+ "rewards/DCR_reward": 0.16700884115416556,
1490
+ "step": 1190
1491
+ },
1492
+ {
1493
+ "completion_length": 89.675,
1494
+ "epoch": 12.0,
1495
+ "grad_norm": 0.0009307861328125,
1496
+ "kl": 0.133929443359375,
1497
+ "learning_rate": 6.003333333333334e-07,
1498
+ "loss": -0.0,
1499
+ "reward": 0.2129200980300084,
1500
+ "reward_std": 0.017186377505953487,
1501
+ "rewards/DCR_reward": 0.2129200980300084,
1502
+ "step": 1200
1503
+ },
1504
+ {
1505
+ "completion_length": 65.125,
1506
+ "epoch": 12.1,
1507
+ "grad_norm": 0.0,
1508
+ "kl": 0.1867431640625,
1509
+ "learning_rate": 5.97e-07,
1510
+ "loss": 0.0,
1511
+ "reward": 0.2869118741014972,
1512
+ "reward_std": 0.05356231460464187,
1513
+ "rewards/DCR_reward": 0.2869118741014972,
1514
+ "step": 1210
1515
+ },
1516
+ {
1517
+ "completion_length": 61.1875,
1518
+ "epoch": 12.2,
1519
+ "grad_norm": 19.125,
1520
+ "kl": 0.209033203125,
1521
+ "learning_rate": 5.936666666666666e-07,
1522
+ "loss": -0.0,
1523
+ "reward": 0.2106538300169632,
1524
+ "reward_std": 0.01354630084197197,
1525
+ "rewards/DCR_reward": 0.2106538300169632,
1526
+ "step": 1220
1527
+ },
1528
+ {
1529
+ "completion_length": 122.4125,
1530
+ "epoch": 12.3,
1531
+ "grad_norm": 3.703125,
1532
+ "kl": 0.15078125,
1533
+ "learning_rate": 5.903333333333334e-07,
1534
+ "loss": 0.0,
1535
+ "reward": 0.19904457703232764,
1536
+ "reward_std": 0.015228879620144653,
1537
+ "rewards/DCR_reward": 0.19904457703232764,
1538
+ "step": 1230
1539
+ },
1540
+ {
1541
+ "completion_length": 60.0625,
1542
+ "epoch": 12.4,
1543
+ "grad_norm": 0.263671875,
1544
+ "kl": 0.17626953125,
1545
+ "learning_rate": 5.87e-07,
1546
+ "loss": -0.0,
1547
+ "reward": 0.06604900020174682,
1548
+ "reward_std": 0.015421837849709163,
1549
+ "rewards/DCR_reward": 0.06604900020174682,
1550
+ "step": 1240
1551
+ },
1552
+ {
1553
+ "completion_length": 77.025,
1554
+ "epoch": 12.5,
1555
+ "grad_norm": 0.0,
1556
+ "kl": 0.1548828125,
1557
+ "learning_rate": 5.836666666666666e-07,
1558
+ "loss": 0.0,
1559
+ "reward": 0.24018247241619975,
1560
+ "reward_std": 0.11031100240943488,
1561
+ "rewards/DCR_reward": 0.24018247241619975,
1562
+ "step": 1250
1563
+ },
1564
+ {
1565
+ "epoch": 12.5,
1566
+ "eval_completion_length": 79.80125,
1567
+ "eval_kl": 0.1788983154296875,
1568
+ "eval_loss": -2.466062483108544e-07,
1569
+ "eval_reward": 0.18051586833898908,
1570
+ "eval_reward_std": 0.04664378996535504,
1571
+ "eval_rewards/DCR_reward": 0.18051586833898908,
1572
+ "eval_runtime": 2637.5023,
1573
+ "eval_samples_per_second": 0.038,
1574
+ "eval_steps_per_second": 0.005,
1575
+ "step": 1250
1576
+ },
1577
+ {
1578
+ "completion_length": 109.5375,
1579
+ "epoch": 12.6,
1580
+ "grad_norm": 0.0,
1581
+ "kl": 0.1450439453125,
1582
+ "learning_rate": 5.803333333333334e-07,
1583
+ "loss": 0.0,
1584
+ "reward": 0.18088525887578727,
1585
+ "reward_std": 0.07833153888532252,
1586
+ "rewards/DCR_reward": 0.18088525887578727,
1587
+ "step": 1260
1588
+ },
1589
+ {
1590
+ "completion_length": 93.2375,
1591
+ "epoch": 12.7,
1592
+ "grad_norm": 24.625,
1593
+ "kl": 0.15348119735717775,
1594
+ "learning_rate": 5.769999999999999e-07,
1595
+ "loss": -0.0,
1596
+ "reward": 0.22176313707605005,
1597
+ "reward_std": 0.08856261784967501,
1598
+ "rewards/DCR_reward": 0.22176313707605005,
1599
+ "step": 1270
1600
+ },
1601
+ {
1602
+ "completion_length": 82.95,
1603
+ "epoch": 12.8,
1604
+ "grad_norm": 0.0,
1605
+ "kl": 0.184228515625,
1606
+ "learning_rate": 5.736666666666666e-07,
1607
+ "loss": 0.0,
1608
+ "reward": 0.1458639702643268,
1609
+ "reward_std": 0.0540214991623742,
1610
+ "rewards/DCR_reward": 0.1458639702643268,
1611
+ "step": 1280
1612
+ },
1613
+ {
1614
+ "completion_length": 108.1875,
1615
+ "epoch": 12.9,
1616
+ "grad_norm": 25.125,
1617
+ "kl": 0.19346923828125,
1618
+ "learning_rate": 5.703333333333334e-07,
1619
+ "loss": -0.0,
1620
+ "reward": 0.09487569569610059,
1621
+ "reward_std": 0.09456775116559583,
1622
+ "rewards/DCR_reward": 0.09487569569610059,
1623
+ "step": 1290
1624
+ },
1625
+ {
1626
+ "completion_length": 60.0375,
1627
+ "epoch": 13.0,
1628
+ "grad_norm": 12.1875,
1629
+ "kl": 0.14573974609375,
1630
+ "learning_rate": 5.669999999999999e-07,
1631
+ "loss": 0.0,
1632
+ "reward": 0.09252699612407014,
1633
+ "reward_std": 0.0609970541823742,
1634
+ "rewards/DCR_reward": 0.09252699612407014,
1635
+ "step": 1300
1636
+ },
1637
+ {
1638
+ "completion_length": 87.975,
1639
+ "epoch": 13.1,
1640
+ "grad_norm": 8.375,
1641
+ "kl": 0.24951171875,
1642
+ "learning_rate": 5.636666666666666e-07,
1643
+ "loss": -0.0,
1644
+ "reward": 0.18639756126794965,
1645
+ "reward_std": 0.059623961660099666,
1646
+ "rewards/DCR_reward": 0.18639756126794965,
1647
+ "step": 1310
1648
+ },
1649
+ {
1650
+ "completion_length": 46.525,
1651
+ "epoch": 13.2,
1652
+ "grad_norm": 0.0,
1653
+ "kl": 0.183843994140625,
1654
+ "learning_rate": 5.603333333333334e-07,
1655
+ "loss": 0.0,
1656
+ "reward": 0.27727809102507306,
1657
+ "reward_std": 0.012234579344567464,
1658
+ "rewards/DCR_reward": 0.27727809102507306,
1659
+ "step": 1320
1660
+ },
1661
+ {
1662
+ "completion_length": 81.5375,
1663
+ "epoch": 13.3,
1664
+ "grad_norm": 15.25,
1665
+ "kl": 0.1978515625,
1666
+ "learning_rate": 5.57e-07,
1667
+ "loss": -0.0,
1668
+ "reward": 0.34632972672116014,
1669
+ "reward_std": 0.02853981898369966,
1670
+ "rewards/DCR_reward": 0.34632972672116014,
1671
+ "step": 1330
1672
+ },
1673
+ {
1674
+ "completion_length": 116.1875,
1675
+ "epoch": 13.4,
1676
+ "grad_norm": 7.46875,
1677
+ "kl": 0.14244384765625,
1678
+ "learning_rate": 5.536666666666666e-07,
1679
+ "loss": -0.0,
1680
+ "reward": 0.1364185765822185,
1681
+ "reward_std": 0.03053054096526466,
1682
+ "rewards/DCR_reward": 0.1364185765822185,
1683
+ "step": 1340
1684
+ },
1685
+ {
1686
+ "completion_length": 96.25,
1687
+ "epoch": 13.5,
1688
+ "grad_norm": 16.875,
1689
+ "kl": 0.1332763671875,
1690
+ "learning_rate": 5.503333333333334e-07,
1691
+ "loss": -0.0,
1692
+ "reward": 0.174199710926041,
1693
+ "reward_std": 0.06987538231981034,
1694
+ "rewards/DCR_reward": 0.174199710926041,
1695
+ "step": 1350
1696
+ },
1697
+ {
1698
+ "completion_length": 73.625,
1699
+ "epoch": 13.6,
1700
+ "grad_norm": 0.01190185546875,
1701
+ "kl": 0.145501708984375,
1702
+ "learning_rate": 5.47e-07,
1703
+ "loss": 0.0,
1704
+ "reward": 0.1467902946518734,
1705
+ "reward_std": 0.021008528914126145,
1706
+ "rewards/DCR_reward": 0.1467902946518734,
1707
+ "step": 1360
1708
+ },
1709
+ {
1710
+ "completion_length": 102.9,
1711
+ "epoch": 13.7,
1712
+ "grad_norm": 0.0,
1713
+ "kl": 0.150146484375,
1714
+ "learning_rate": 5.436666666666666e-07,
1715
+ "loss": -0.0,
1716
+ "reward": 0.14278438028413803,
1717
+ "reward_std": 0.06553829507544151,
1718
+ "rewards/DCR_reward": 0.14278438028413803,
1719
+ "step": 1370
1720
+ },
1721
+ {
1722
+ "completion_length": 67.125,
1723
+ "epoch": 13.8,
1724
+ "grad_norm": 34.75,
1725
+ "kl": 0.18953857421875,
1726
+ "learning_rate": 5.403333333333333e-07,
1727
+ "loss": -0.0,
1728
+ "reward": 0.04754302315413952,
1729
+ "reward_std": 0.0033664418617263435,
1730
+ "rewards/DCR_reward": 0.04754302315413952,
1731
+ "step": 1380
1732
+ },
1733
+ {
1734
+ "completion_length": 80.1875,
1735
+ "epoch": 13.9,
1736
+ "grad_norm": 27.875,
1737
+ "kl": 0.1549560546875,
1738
+ "learning_rate": 5.37e-07,
1739
+ "loss": 0.0,
1740
+ "reward": 0.06470744522521273,
1741
+ "reward_std": 0.06195358677759941,
1742
+ "rewards/DCR_reward": 0.06470744522521273,
1743
+ "step": 1390
1744
+ },
1745
+ {
1746
+ "completion_length": 76.525,
1747
+ "epoch": 14.0,
1748
+ "grad_norm": 20.5,
1749
+ "kl": 0.18204002380371093,
1750
+ "learning_rate": 5.336666666666666e-07,
1751
+ "loss": -0.0,
1752
+ "reward": 0.25884356582537293,
1753
+ "reward_std": 0.07922300189136422,
1754
+ "rewards/DCR_reward": 0.25884356582537293,
1755
+ "step": 1400
1756
+ },
1757
+ {
1758
+ "completion_length": 61.475,
1759
+ "epoch": 14.1,
1760
+ "grad_norm": 0.0,
1761
+ "kl": 0.1508544921875,
1762
+ "learning_rate": 5.303333333333333e-07,
1763
+ "loss": 0.0,
1764
+ "reward": 0.20719661605544387,
1765
+ "reward_std": 0.0746672638963446,
1766
+ "rewards/DCR_reward": 0.20719661605544387,
1767
+ "step": 1410
1768
+ },
1769
+ {
1770
+ "completion_length": 79.1875,
1771
+ "epoch": 14.2,
1772
+ "grad_norm": 0.0,
1773
+ "kl": 0.1884521484375,
1774
+ "learning_rate": 5.27e-07,
1775
+ "loss": 0.0,
1776
+ "reward": 0.1095911561860703,
1777
+ "reward_std": 0.02852085893282492,
1778
+ "rewards/DCR_reward": 0.1095911561860703,
1779
+ "step": 1420
1780
+ },
1781
+ {
1782
+ "completion_length": 99.4875,
1783
+ "epoch": 14.3,
1784
+ "grad_norm": 8.5625,
1785
+ "kl": 0.183203125,
1786
+ "learning_rate": 5.236666666666666e-07,
1787
+ "loss": 0.0,
1788
+ "reward": 0.26417584040900693,
1789
+ "reward_std": 0.03152382288160993,
1790
+ "rewards/DCR_reward": 0.26417584040900693,
1791
+ "step": 1430
1792
+ },
1793
+ {
1794
+ "completion_length": 81.4375,
1795
+ "epoch": 14.4,
1796
+ "grad_norm": 11.8125,
1797
+ "kl": 0.147418212890625,
1798
+ "learning_rate": 5.203333333333333e-07,
1799
+ "loss": 0.0,
1800
+ "reward": 0.27334287738776764,
1801
+ "reward_std": 0.059623092689435,
1802
+ "rewards/DCR_reward": 0.27334287738776764,
1803
+ "step": 1440
1804
+ },
1805
+ {
1806
+ "completion_length": 64.7375,
1807
+ "epoch": 14.5,
1808
+ "grad_norm": 22.0,
1809
+ "kl": 0.223583984375,
1810
+ "learning_rate": 5.17e-07,
1811
+ "loss": -0.0,
1812
+ "reward": 0.10534097602358088,
1813
+ "reward_std": 0.01799371653714843,
1814
+ "rewards/DCR_reward": 0.10534097602358088,
1815
+ "step": 1450
1816
+ },
1817
+ {
1818
+ "completion_length": 69.275,
1819
+ "epoch": 14.6,
1820
+ "grad_norm": 0.0,
1821
+ "kl": 0.224200439453125,
1822
+ "learning_rate": 5.136666666666666e-07,
1823
+ "loss": -0.0,
1824
+ "reward": 0.2516032636165619,
1825
+ "reward_std": 0.07302998011000454,
1826
+ "rewards/DCR_reward": 0.2516032636165619,
1827
+ "step": 1460
1828
+ },
1829
+ {
1830
+ "completion_length": 99.375,
1831
+ "epoch": 14.7,
1832
+ "grad_norm": 19.5,
1833
+ "kl": 0.16717529296875,
1834
+ "learning_rate": 5.103333333333333e-07,
1835
+ "loss": -0.0,
1836
+ "reward": 0.10548254007007926,
1837
+ "reward_std": 0.02692217687581433,
1838
+ "rewards/DCR_reward": 0.10548254007007926,
1839
+ "step": 1470
1840
+ },
1841
+ {
1842
+ "completion_length": 79.325,
1843
+ "epoch": 14.8,
1844
+ "grad_norm": 26.375,
1845
+ "kl": 0.181884765625,
1846
+ "learning_rate": 5.07e-07,
1847
+ "loss": 0.0,
1848
+ "reward": 0.07880783905275165,
1849
+ "reward_std": 0.05647614029903707,
1850
+ "rewards/DCR_reward": 0.07880783905275165,
1851
+ "step": 1480
1852
+ },
1853
+ {
1854
+ "completion_length": 99.9125,
1855
+ "epoch": 14.9,
1856
+ "grad_norm": 18.25,
1857
+ "kl": 0.09996337890625,
1858
+ "learning_rate": 5.036666666666666e-07,
1859
+ "loss": -0.0,
1860
+ "reward": 0.15730613842606544,
1861
+ "reward_std": 0.0505793450953206,
1862
+ "rewards/DCR_reward": 0.15730613842606544,
1863
+ "step": 1490
1864
+ },
1865
+ {
1866
+ "completion_length": 74.3375,
1867
+ "epoch": 15.0,
1868
+ "grad_norm": 0.0,
1869
+ "kl": 0.167626953125,
1870
+ "learning_rate": 5.003333333333333e-07,
1871
+ "loss": -0.0,
1872
+ "reward": 0.20553053587209433,
1873
+ "reward_std": 0.06437594342569355,
1874
+ "rewards/DCR_reward": 0.20553053587209433,
1875
+ "step": 1500
1876
+ },
1877
+ {
1878
+ "epoch": 15.0,
1879
+ "eval_completion_length": 83.67125,
1880
+ "eval_kl": 0.21793760299682619,
1881
+ "eval_loss": -1.1107649697805755e-06,
1882
+ "eval_reward": 0.1789407900039805,
1883
+ "eval_reward_std": 0.04753444435028598,
1884
+ "eval_rewards/DCR_reward": 0.1789407900039805,
1885
+ "eval_runtime": 2622.8316,
1886
+ "eval_samples_per_second": 0.038,
1887
+ "eval_steps_per_second": 0.005,
1888
+ "step": 1500
1889
+ },
1890
+ {
1891
+ "completion_length": 77.8125,
1892
+ "epoch": 15.1,
1893
+ "grad_norm": 8.75,
1894
+ "kl": 0.12496776580810547,
1895
+ "learning_rate": 4.97e-07,
1896
+ "loss": 0.0,
1897
+ "reward": 0.2007469806820154,
1898
+ "reward_std": 0.11496121380478144,
1899
+ "rewards/DCR_reward": 0.2007469806820154,
1900
+ "step": 1510
1901
+ },
1902
+ {
1903
+ "completion_length": 112.4375,
1904
+ "epoch": 15.2,
1905
+ "grad_norm": 17.375,
1906
+ "kl": 0.1972900390625,
1907
+ "learning_rate": 4.936666666666666e-07,
1908
+ "loss": -0.0,
1909
+ "reward": 0.1523078629281372,
1910
+ "reward_std": 0.0487046109745279,
1911
+ "rewards/DCR_reward": 0.1523078629281372,
1912
+ "step": 1520
1913
+ },
1914
+ {
1915
+ "completion_length": 106.325,
1916
+ "epoch": 15.3,
1917
+ "grad_norm": 33.75,
1918
+ "kl": 0.1178955078125,
1919
+ "learning_rate": 4.903333333333333e-07,
1920
+ "loss": -0.0,
1921
+ "reward": 0.14267266684328206,
1922
+ "reward_std": 0.13161113108944847,
1923
+ "rewards/DCR_reward": 0.14267266684328206,
1924
+ "step": 1530
1925
+ },
1926
+ {
1927
+ "completion_length": 70.2875,
1928
+ "epoch": 15.4,
1929
+ "grad_norm": 14.8125,
1930
+ "kl": 0.2324951171875,
1931
+ "learning_rate": 4.87e-07,
1932
+ "loss": 0.0,
1933
+ "reward": 0.2017348323017359,
1934
+ "reward_std": 0.04265181252852699,
1935
+ "rewards/DCR_reward": 0.2017348323017359,
1936
+ "step": 1540
1937
+ },
1938
+ {
1939
+ "completion_length": 45.4625,
1940
+ "epoch": 15.5,
1941
+ "grad_norm": 0.0,
1942
+ "kl": 0.15888671875,
1943
+ "learning_rate": 4.836666666666666e-07,
1944
+ "loss": -0.0,
1945
+ "reward": 0.2259118565125391,
1946
+ "reward_std": 0.026894852996608164,
1947
+ "rewards/DCR_reward": 0.2259118565125391,
1948
+ "step": 1550
1949
+ },
1950
+ {
1951
+ "completion_length": 88.6625,
1952
+ "epoch": 15.6,
1953
+ "grad_norm": 23.5,
1954
+ "kl": 0.164312744140625,
1955
+ "learning_rate": 4.803333333333333e-07,
1956
+ "loss": 0.0,
1957
+ "reward": 0.15437748426338657,
1958
+ "reward_std": 0.032754498023996347,
1959
+ "rewards/DCR_reward": 0.15437748426338657,
1960
+ "step": 1560
1961
+ },
1962
+ {
1963
+ "completion_length": 78.4,
1964
+ "epoch": 15.7,
1965
+ "grad_norm": 0.0,
1966
+ "kl": 0.164697265625,
1967
+ "learning_rate": 4.769999999999999e-07,
1968
+ "loss": -0.0,
1969
+ "reward": 0.09430048232898117,
1970
+ "reward_std": 0.03383164100494014,
1971
+ "rewards/DCR_reward": 0.09430048232898117,
1972
+ "step": 1570
1973
+ },
1974
+ {
1975
+ "completion_length": 79.2875,
1976
+ "epoch": 15.8,
1977
+ "grad_norm": 0.0,
1978
+ "kl": 0.176513671875,
1979
+ "learning_rate": 4.7366666666666666e-07,
1980
+ "loss": -0.0,
1981
+ "reward": 0.23642000226536766,
1982
+ "reward_std": 0.04199942027457837,
1983
+ "rewards/DCR_reward": 0.23642000226536766,
1984
+ "step": 1580
1985
+ },
1986
+ {
1987
+ "completion_length": 86.6875,
1988
+ "epoch": 15.9,
1989
+ "grad_norm": 28.875,
1990
+ "kl": 0.234124755859375,
1991
+ "learning_rate": 4.703333333333333e-07,
1992
+ "loss": 0.0,
1993
+ "reward": 0.20127527262666262,
1994
+ "reward_std": 0.026311779970637873,
1995
+ "rewards/DCR_reward": 0.20127527262666262,
1996
+ "step": 1590
1997
+ },
1998
+ {
1999
+ "completion_length": 70.875,
2000
+ "epoch": 16.0,
2001
+ "grad_norm": 0.0,
2002
+ "kl": 0.175665283203125,
2003
+ "learning_rate": 4.67e-07,
2004
+ "loss": -0.0,
2005
+ "reward": 0.14691403629258276,
2006
+ "reward_std": 0.04396011229930537,
2007
+ "rewards/DCR_reward": 0.14691403629258276,
2008
+ "step": 1600
2009
+ },
2010
+ {
2011
+ "completion_length": 57.625,
2012
+ "epoch": 16.1,
2013
+ "grad_norm": 17.0,
2014
+ "kl": 0.20440673828125,
2015
+ "learning_rate": 4.6366666666666665e-07,
2016
+ "loss": -0.0,
2017
+ "reward": 0.13289597362745553,
2018
+ "reward_std": 0.0672353014729822,
2019
+ "rewards/DCR_reward": 0.13289597362745553,
2020
+ "step": 1610
2021
+ },
2022
+ {
2023
+ "completion_length": 99.6375,
2024
+ "epoch": 16.2,
2025
+ "grad_norm": 0.0,
2026
+ "kl": 0.117718505859375,
2027
+ "learning_rate": 4.603333333333333e-07,
2028
+ "loss": -0.0,
2029
+ "reward": 0.26933112973347306,
2030
+ "reward_std": 0.06138738352313169,
2031
+ "rewards/DCR_reward": 0.26933112973347306,
2032
+ "step": 1620
2033
+ },
2034
+ {
2035
+ "completion_length": 85.85,
2036
+ "epoch": 16.3,
2037
+ "grad_norm": 0.0,
2038
+ "kl": 0.206988525390625,
2039
+ "learning_rate": 4.57e-07,
2040
+ "loss": -0.0,
2041
+ "reward": 0.1808759123814525,
2042
+ "reward_std": 0.023974006343632937,
2043
+ "rewards/DCR_reward": 0.1808759123814525,
2044
+ "step": 1630
2045
+ },
2046
+ {
2047
+ "completion_length": 74.9625,
2048
+ "epoch": 16.4,
2049
+ "grad_norm": 19.125,
2050
+ "kl": 0.205908203125,
2051
+ "learning_rate": 4.5366666666666664e-07,
2052
+ "loss": -0.0,
2053
+ "reward": 0.2561442313250154,
2054
+ "reward_std": 0.05999695781356422,
2055
+ "rewards/DCR_reward": 0.2561442313250154,
2056
+ "step": 1640
2057
+ },
2058
+ {
2059
+ "completion_length": 51.425,
2060
+ "epoch": 16.5,
2061
+ "grad_norm": 0.0,
2062
+ "kl": 0.19951171875,
2063
+ "learning_rate": 4.503333333333333e-07,
2064
+ "loss": 0.0,
2065
+ "reward": 0.23342568413354456,
2066
+ "reward_std": 0.030879499143338762,
2067
+ "rewards/DCR_reward": 0.23342568413354456,
2068
+ "step": 1650
2069
+ },
2070
+ {
2071
+ "completion_length": 83.3625,
2072
+ "epoch": 16.6,
2073
+ "grad_norm": 0.0,
2074
+ "kl": 0.13411979675292968,
2075
+ "learning_rate": 4.4699999999999997e-07,
2076
+ "loss": 0.0,
2077
+ "reward": 0.10717827337794006,
2078
+ "reward_std": 0.055167136660065806,
2079
+ "rewards/DCR_reward": 0.10717827337794006,
2080
+ "step": 1660
2081
+ },
2082
+ {
2083
+ "completion_length": 80.0875,
2084
+ "epoch": 16.7,
2085
+ "grad_norm": 14.75,
2086
+ "kl": 0.2007080078125,
2087
+ "learning_rate": 4.4366666666666663e-07,
2088
+ "loss": 0.0,
2089
+ "reward": 0.21203166521154343,
2090
+ "reward_std": 0.05453803092241287,
2091
+ "rewards/DCR_reward": 0.21203166521154343,
2092
+ "step": 1670
2093
+ },
2094
+ {
2095
+ "completion_length": 106.975,
2096
+ "epoch": 16.8,
2097
+ "grad_norm": 16.625,
2098
+ "kl": 0.1846435546875,
2099
+ "learning_rate": 4.4033333333333335e-07,
2100
+ "loss": 0.0,
2101
+ "reward": 0.08938784400233998,
2102
+ "reward_std": 0.05009205757160089,
2103
+ "rewards/DCR_reward": 0.08938784400233998,
2104
+ "step": 1680
2105
+ },
2106
+ {
2107
+ "completion_length": 105.3875,
2108
+ "epoch": 16.9,
2109
+ "grad_norm": 0.0,
2110
+ "kl": 0.1232177734375,
2111
+ "learning_rate": 4.3699999999999996e-07,
2112
+ "loss": -0.0,
2113
+ "reward": 0.18415257817832753,
2114
+ "reward_std": 0.007105620090851516,
2115
+ "rewards/DCR_reward": 0.18415257817832753,
2116
+ "step": 1690
2117
+ },
2118
+ {
2119
+ "completion_length": 89.2375,
2120
+ "epoch": 17.0,
2121
+ "grad_norm": 17.625,
2122
+ "kl": 0.1628662109375,
2123
+ "learning_rate": 4.336666666666666e-07,
2124
+ "loss": 0.0,
2125
+ "reward": 0.10958491688361391,
2126
+ "reward_std": 0.041172702021503936,
2127
+ "rewards/DCR_reward": 0.10958491688361391,
2128
+ "step": 1700
2129
+ },
2130
+ {
2131
+ "completion_length": 90.1875,
2132
+ "epoch": 17.1,
2133
+ "grad_norm": 0.0,
2134
+ "kl": 0.15859375,
2135
+ "learning_rate": 4.3033333333333334e-07,
2136
+ "loss": 0.0,
2137
+ "reward": 0.09857021539355629,
2138
+ "reward_std": 0.02386656640956062,
2139
+ "rewards/DCR_reward": 0.09857021539355629,
2140
+ "step": 1710
2141
+ },
2142
+ {
2143
+ "completion_length": 63.25,
2144
+ "epoch": 17.2,
2145
+ "grad_norm": 0.0,
2146
+ "kl": 0.16219091415405273,
2147
+ "learning_rate": 4.2699999999999995e-07,
2148
+ "loss": -0.0,
2149
+ "reward": 0.054912311234511436,
2150
+ "reward_std": 0.021657621535587167,
2151
+ "rewards/DCR_reward": 0.054912311234511436,
2152
+ "step": 1720
2153
+ },
2154
+ {
2155
+ "completion_length": 75.025,
2156
+ "epoch": 17.3,
2157
+ "grad_norm": 23.875,
2158
+ "kl": 0.208575439453125,
2159
+ "learning_rate": 4.2366666666666666e-07,
2160
+ "loss": 0.0,
2161
+ "reward": 0.3105363720096648,
2162
+ "reward_std": 0.03136685772915371,
2163
+ "rewards/DCR_reward": 0.3105363720096648,
2164
+ "step": 1730
2165
+ },
2166
+ {
2167
+ "completion_length": 76.5875,
2168
+ "epoch": 17.4,
2169
+ "grad_norm": 11.875,
2170
+ "kl": 0.22567138671875,
2171
+ "learning_rate": 4.203333333333333e-07,
2172
+ "loss": 0.0,
2173
+ "reward": 0.1657306909793988,
2174
+ "reward_std": 0.06016184531727049,
2175
+ "rewards/DCR_reward": 0.1657306909793988,
2176
+ "step": 1740
2177
+ },
2178
+ {
2179
+ "completion_length": 76.5625,
2180
+ "epoch": 17.5,
2181
+ "grad_norm": 0.10302734375,
2182
+ "kl": 0.21397705078125,
2183
+ "learning_rate": 4.17e-07,
2184
+ "loss": 0.0,
2185
+ "reward": 0.3156662947498262,
2186
+ "reward_std": 0.017444778925437276,
2187
+ "rewards/DCR_reward": 0.3156662947498262,
2188
+ "step": 1750
2189
+ },
2190
+ {
2191
+ "epoch": 17.5,
2192
+ "eval_completion_length": 83.015,
2193
+ "eval_kl": 0.1820037841796875,
2194
+ "eval_loss": 1.8720327261689818e-06,
2195
+ "eval_reward": 0.1803470617614221,
2196
+ "eval_reward_std": 0.047306431781344714,
2197
+ "eval_rewards/DCR_reward": 0.1803470617614221,
2198
+ "eval_runtime": 2666.1674,
2199
+ "eval_samples_per_second": 0.038,
2200
+ "eval_steps_per_second": 0.005,
2201
+ "step": 1750
2202
+ },
2203
+ {
2204
+ "completion_length": 71.3125,
2205
+ "epoch": 17.6,
2206
+ "grad_norm": 18.25,
2207
+ "kl": 0.187255859375,
2208
+ "learning_rate": 4.1366666666666665e-07,
2209
+ "loss": 0.0,
2210
+ "reward": 0.20512793064117432,
2211
+ "reward_std": 0.047461129271408706,
2212
+ "rewards/DCR_reward": 0.20512793064117432,
2213
+ "step": 1760
2214
+ },
2215
+ {
2216
+ "completion_length": 59.45,
2217
+ "epoch": 17.7,
2218
+ "grad_norm": 17.0,
2219
+ "kl": 0.1356689453125,
2220
+ "learning_rate": 4.103333333333333e-07,
2221
+ "loss": 0.0,
2222
+ "reward": 0.14152073024306447,
2223
+ "reward_std": 0.09260614971240103,
2224
+ "rewards/DCR_reward": 0.14152073024306447,
2225
+ "step": 1770
2226
+ },
2227
+ {
2228
+ "completion_length": 66.0625,
2229
+ "epoch": 17.8,
2230
+ "grad_norm": 0.0,
2231
+ "kl": 0.1925048828125,
2232
+ "learning_rate": 4.07e-07,
2233
+ "loss": -0.0,
2234
+ "reward": 0.25652417142409834,
2235
+ "reward_std": 0.04226240784919355,
2236
+ "rewards/DCR_reward": 0.25652417142409834,
2237
+ "step": 1780
2238
+ },
2239
+ {
2240
+ "completion_length": 109.575,
2241
+ "epoch": 17.9,
2242
+ "grad_norm": 0.0,
2243
+ "kl": 0.165313720703125,
2244
+ "learning_rate": 4.0366666666666664e-07,
2245
+ "loss": 0.0,
2246
+ "reward": 0.21949415714479983,
2247
+ "reward_std": 0.08642580564569471,
2248
+ "rewards/DCR_reward": 0.21949415714479983,
2249
+ "step": 1790
2250
+ },
2251
+ {
2252
+ "completion_length": 64.35,
2253
+ "epoch": 18.0,
2254
+ "grad_norm": 20.875,
2255
+ "kl": 0.1558349609375,
2256
+ "learning_rate": 4.003333333333333e-07,
2257
+ "loss": -0.0,
2258
+ "reward": 0.16233385398518294,
2259
+ "reward_std": 0.04793143280548975,
2260
+ "rewards/DCR_reward": 0.16233385398518294,
2261
+ "step": 1800
2262
+ },
2263
+ {
2264
+ "completion_length": 47.7375,
2265
+ "epoch": 18.1,
2266
+ "grad_norm": 13.125,
2267
+ "kl": 0.280615234375,
2268
+ "learning_rate": 3.97e-07,
2269
+ "loss": -0.0,
2270
+ "reward": 0.1926513019599952,
2271
+ "reward_std": 0.022014413893526808,
2272
+ "rewards/DCR_reward": 0.1926513019599952,
2273
+ "step": 1810
2274
+ },
2275
+ {
2276
+ "completion_length": 86.2875,
2277
+ "epoch": 18.2,
2278
+ "grad_norm": 0.0,
2279
+ "kl": 0.19091796875,
2280
+ "learning_rate": 3.9366666666666663e-07,
2281
+ "loss": 0.0,
2282
+ "reward": 0.18852764302864672,
2283
+ "reward_std": 0.033248070168701814,
2284
+ "rewards/DCR_reward": 0.18852764302864672,
2285
+ "step": 1820
2286
+ },
2287
+ {
2288
+ "completion_length": 80.175,
2289
+ "epoch": 18.3,
2290
+ "grad_norm": 13.0,
2291
+ "kl": 0.12606201171875,
2292
+ "learning_rate": 3.903333333333333e-07,
2293
+ "loss": 0.0,
2294
+ "reward": 0.17205136871198193,
2295
+ "reward_std": 0.019140477599285076,
2296
+ "rewards/DCR_reward": 0.17205136871198193,
2297
+ "step": 1830
2298
+ },
2299
+ {
2300
+ "completion_length": 79.5125,
2301
+ "epoch": 18.4,
2302
+ "grad_norm": 0.96484375,
2303
+ "kl": 0.17064599990844725,
2304
+ "learning_rate": 3.87e-07,
2305
+ "loss": 0.0,
2306
+ "reward": 0.07664455490885302,
2307
+ "reward_std": 0.04274343762583612,
2308
+ "rewards/DCR_reward": 0.07664455490885302,
2309
+ "step": 1840
2310
+ },
2311
+ {
2312
+ "completion_length": 81.4375,
2313
+ "epoch": 18.5,
2314
+ "grad_norm": 0.0,
2315
+ "kl": 0.1732177734375,
2316
+ "learning_rate": 3.836666666666666e-07,
2317
+ "loss": 0.0,
2318
+ "reward": 0.23127210177481175,
2319
+ "reward_std": 0.04357012182008475,
2320
+ "rewards/DCR_reward": 0.23127210177481175,
2321
+ "step": 1850
2322
+ },
2323
+ {
2324
+ "completion_length": 75.625,
2325
+ "epoch": 18.6,
2326
+ "grad_norm": 3.03125,
2327
+ "kl": 0.159375,
2328
+ "learning_rate": 3.8033333333333334e-07,
2329
+ "loss": 0.0,
2330
+ "reward": 0.27262264720629903,
2331
+ "reward_std": 0.05521087486195313,
2332
+ "rewards/DCR_reward": 0.27262264720629903,
2333
+ "step": 1860
2334
+ },
2335
+ {
2336
+ "completion_length": 134.3125,
2337
+ "epoch": 18.7,
2338
+ "grad_norm": 0.36328125,
2339
+ "kl": 0.1446624755859375,
2340
+ "learning_rate": 3.77e-07,
2341
+ "loss": 0.0,
2342
+ "reward": 0.11476055827224627,
2343
+ "reward_std": 0.02939585350050038,
2344
+ "rewards/DCR_reward": 0.11476055827224627,
2345
+ "step": 1870
2346
+ },
2347
+ {
2348
+ "completion_length": 72.4125,
2349
+ "epoch": 18.8,
2350
+ "grad_norm": 23.5,
2351
+ "kl": 0.1951904296875,
2352
+ "learning_rate": 3.736666666666666e-07,
2353
+ "loss": -0.0,
2354
+ "reward": 0.14638857576064765,
2355
+ "reward_std": 0.04362176135448124,
2356
+ "rewards/DCR_reward": 0.14638857576064765,
2357
+ "step": 1880
2358
+ },
2359
+ {
2360
+ "completion_length": 70.75,
2361
+ "epoch": 18.9,
2362
+ "grad_norm": 0.0,
2363
+ "kl": 0.186572265625,
2364
+ "learning_rate": 3.7033333333333333e-07,
2365
+ "loss": 0.0,
2366
+ "reward": 0.1708667165134102,
2367
+ "reward_std": 0.07184823253192008,
2368
+ "rewards/DCR_reward": 0.1708667165134102,
2369
+ "step": 1890
2370
+ },
2371
+ {
2372
+ "completion_length": 52.675,
2373
+ "epoch": 19.0,
2374
+ "grad_norm": 19.625,
2375
+ "kl": 0.183404541015625,
2376
+ "learning_rate": 3.67e-07,
2377
+ "loss": -0.0,
2378
+ "reward": 0.2897725820541382,
2379
+ "reward_std": 0.0576688679928111,
2380
+ "rewards/DCR_reward": 0.2897725820541382,
2381
+ "step": 1900
2382
+ },
2383
+ {
2384
+ "completion_length": 81.7625,
2385
+ "epoch": 19.1,
2386
+ "grad_norm": 11.5,
2387
+ "kl": 0.217041015625,
2388
+ "learning_rate": 3.6366666666666665e-07,
2389
+ "loss": 0.0,
2390
+ "reward": 0.156359511311166,
2391
+ "reward_std": 0.051926983702384175,
2392
+ "rewards/DCR_reward": 0.156359511311166,
2393
+ "step": 1910
2394
+ },
2395
+ {
2396
+ "completion_length": 130.8125,
2397
+ "epoch": 19.2,
2398
+ "grad_norm": 9.8125,
2399
+ "kl": 0.10965576171875,
2400
+ "learning_rate": 3.603333333333333e-07,
2401
+ "loss": 0.0,
2402
+ "reward": 0.10657796601299196,
2403
+ "reward_std": 0.0012149818532634527,
2404
+ "rewards/DCR_reward": 0.10657796601299196,
2405
+ "step": 1920
2406
+ },
2407
+ {
2408
+ "completion_length": 63.2375,
2409
+ "epoch": 19.3,
2410
+ "grad_norm": 19.0,
2411
+ "kl": 0.26494140625,
2412
+ "learning_rate": 3.57e-07,
2413
+ "loss": 0.0,
2414
+ "reward": 0.25602573398500683,
2415
+ "reward_std": 0.04811685611639405,
2416
+ "rewards/DCR_reward": 0.25602573398500683,
2417
+ "step": 1930
2418
+ },
2419
+ {
2420
+ "completion_length": 71.2125,
2421
+ "epoch": 19.4,
2422
+ "grad_norm": 0.039794921875,
2423
+ "kl": 0.2076171875,
2424
+ "learning_rate": 3.5366666666666664e-07,
2425
+ "loss": 0.0,
2426
+ "reward": 0.2313489816733636,
2427
+ "reward_std": 0.037926042393394255,
2428
+ "rewards/DCR_reward": 0.2313489816733636,
2429
+ "step": 1940
2430
+ },
2431
+ {
2432
+ "completion_length": 80.075,
2433
+ "epoch": 19.5,
2434
+ "grad_norm": 4.40625,
2435
+ "kl": 0.194873046875,
2436
+ "learning_rate": 3.503333333333333e-07,
2437
+ "loss": 0.0,
2438
+ "reward": 0.2062260712031275,
2439
+ "reward_std": 0.07047836606834608,
2440
+ "rewards/DCR_reward": 0.2062260712031275,
2441
+ "step": 1950
2442
+ },
2443
+ {
2444
+ "completion_length": 47.3875,
2445
+ "epoch": 19.6,
2446
+ "grad_norm": 9.125,
2447
+ "kl": 0.159039306640625,
2448
+ "learning_rate": 3.4699999999999997e-07,
2449
+ "loss": 0.0,
2450
+ "reward": 0.3275539556518197,
2451
+ "reward_std": 0.056122380661634,
2452
+ "rewards/DCR_reward": 0.3275539556518197,
2453
+ "step": 1960
2454
+ },
2455
+ {
2456
+ "completion_length": 81.525,
2457
+ "epoch": 19.7,
2458
+ "grad_norm": 17.875,
2459
+ "kl": 0.14603710174560547,
2460
+ "learning_rate": 3.436666666666667e-07,
2461
+ "loss": -0.0,
2462
+ "reward": 0.08158219500910491,
2463
+ "reward_std": 0.0475987725701998,
2464
+ "rewards/DCR_reward": 0.08158219500910491,
2465
+ "step": 1970
2466
+ },
2467
+ {
2468
+ "completion_length": 90.35,
2469
+ "epoch": 19.8,
2470
+ "grad_norm": 0.0,
2471
+ "kl": 0.16241455078125,
2472
+ "learning_rate": 3.403333333333333e-07,
2473
+ "loss": 0.0,
2474
+ "reward": 0.14252315481426195,
2475
+ "reward_std": 0.0510443922455579,
2476
+ "rewards/DCR_reward": 0.14252315481426195,
2477
+ "step": 1980
2478
+ },
2479
+ {
2480
+ "completion_length": 77.0875,
2481
+ "epoch": 19.9,
2482
+ "grad_norm": 0.8359375,
2483
+ "kl": 0.1235107421875,
2484
+ "learning_rate": 3.37e-07,
2485
+ "loss": -0.0,
2486
+ "reward": 0.11888001729967072,
2487
+ "reward_std": 0.041191360527292886,
2488
+ "rewards/DCR_reward": 0.11888001729967072,
2489
+ "step": 1990
2490
+ },
2491
+ {
2492
+ "completion_length": 100.05,
2493
+ "epoch": 20.0,
2494
+ "grad_norm": 8.125,
2495
+ "kl": 0.209521484375,
2496
+ "learning_rate": 3.336666666666667e-07,
2497
+ "loss": 0.0,
2498
+ "reward": 0.16669638943858445,
2499
+ "reward_std": 0.028858381987083702,
2500
+ "rewards/DCR_reward": 0.16669638943858445,
2501
+ "step": 2000
2502
+ },
2503
+ {
2504
+ "epoch": 20.0,
2505
+ "eval_completion_length": 84.2325,
2506
+ "eval_kl": 0.17778059005737304,
2507
+ "eval_loss": 6.65434640723106e-07,
2508
+ "eval_reward": 0.18110041963984258,
2509
+ "eval_reward_std": 0.0566114658890848,
2510
+ "eval_rewards/DCR_reward": 0.18110041963984258,
2511
+ "eval_runtime": 2663.7287,
2512
+ "eval_samples_per_second": 0.038,
2513
+ "eval_steps_per_second": 0.005,
2514
+ "step": 2000
2515
+ }
2516
+ ],
2517
+ "logging_steps": 10,
2518
+ "max_steps": 3000,
2519
+ "num_input_tokens_seen": 0,
2520
+ "num_train_epochs": 30,
2521
+ "save_steps": 1000,
2522
+ "stateful_callbacks": {
2523
+ "TrainerControl": {
2524
+ "args": {
2525
+ "should_epoch_stop": false,
2526
+ "should_evaluate": false,
2527
+ "should_log": false,
2528
+ "should_save": true,
2529
+ "should_training_stop": false
2530
+ },
2531
+ "attributes": {}
2532
+ }
2533
+ },
2534
+ "total_flos": 0.0,
2535
+ "train_batch_size": 8,
2536
+ "trial_name": null,
2537
+ "trial_params": null
2538
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6b7551886b878bf15125a64102f57ce9d092201642bfc5655464456c12aebb8
3
+ size 5752
vocab.json ADDED
The diff for this file is too large to render. See raw diff