File size: 4,603 Bytes
515ab58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43cf98b
515ab58
43cf98b
ce1a01b
515ab58
 
 
ce1a01b
 
 
 
515ab58
 
 
ce1a01b
 
 
 
0e03e09
515ab58
 
ce1a01b
 
 
 
 
 
 
 
 
 
 
515ab58
 
ce1a01b
 
 
43cf98b
 
 
515ab58
 
 
ef2f00a
515ab58
 
43cf98b
515ab58
 
 
ce1a01b
ef2f00a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2adeded
 
 
ef2f00a
2adeded
ef2f00a
ce1a01b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
library_name: transformers
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-1.5B
license: apache-2.0
language:
- en
tags:
- qwen
- qwen2.5
- causal-lm
- continued-pretraining
- openwebtext
datasets:
- Skylion007/openwebtext
model-index:
- name: Qwen2.5 1.5B Continued Pretraining on OpenWebText
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: OpenWebText
      type: openwebtext
    metrics:
    - type: perplexity
      name: Validation Perplexity
      value:
        VAL_PPL: null
---

# Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16)

**Base model**: `Qwen/Qwen2.5-1.5B`
**Objective**: This model was continued-pretrained on *English OpenWebText* to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to **reduce Chinese linguistic bias** and align more toward English usage.

## What’s in this repo?

* **Architecture**: Qwen2.5 1.5B (causal LM)
* **Weights dtype**: bfloat16 (BF16)
* **Tokenizer**: the exact tokenizer used during training (saved alongside the checkpoint)

## Training data

* **Dataset**: OpenWebText (English)
* **Total training tokens**: **1,638,400,000**
* **Sequence length (block\_size)**: **1024**
* **Effective batch**: batch\_size=4, grad\_accumulation=8 ⇒ tokens/iter=32,768
**Portion of dataset used**: **18.31%** of tokenized corpus (approx.)

## Training setup

* **Precision**: BF16 (`torch.bfloat16`, AMP autocast)
* **Optimizer**: AdamW (β1=0.9, β2=0.95, weight\_decay=0.1, eps=1e-8)
* **LR schedule**: Cosine, warmup\_steps=2,500, total\_steps=50,000
* **Learning rate**: 1e-4
* **Dropout**: 0.05
* **Grad clip**: 1.0
* **Gradient checkpointing**: disabled (match original Qwen run)
* **Hardware**: 1× NVIDIA A100 40GB (single GPU)

These settings mirror the completed Qwen run: batch\_size=4, grad\_accumulation=8, block\_size=1024; cosine schedule with max\_iters=50k and warmup\_iters=2.5k; AdamW with lr=1e-4, weight\_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP.

## Intended use & limitations

* Intended for **English** text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining.
* Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation.

## How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "Write a short paragraph about large language models."
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))
````

## Reproducibility

We log the **exact training tokenizer** and use it in this repo.
To recompute token counts:

```python
# TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size
# If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this:

import os, pickle, torch

ckpt = torch.load("ckpt.pt", map_location="cpu")
args = ckpt.get("args", {})  # saved by the training script
GAS = args.get("gradient_accumulation_steps")
BATCH = args.get("batch_size")
BLOCK = args.get("block_size")
iters = ckpt.get("iter_num")  # last saved iter
TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK

print("approx TRAIN_TOKENS:", TRAIN_TOKENS)

# To estimate dataset fraction:
# If you built memmap at data/<dataset>/<sanitized_model_name>/train.bin
import numpy as np
san = "{SANITIZED_MODEL_NAME}"  # e.g., "Qwen2.5-1.5B"
data_dir = os.path.join("data", "{DATASET}", san)
train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r")
val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r")
TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin)
DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS)
print("dataset fraction ~", DATASET_FRACTION)
```

## License

* Base model license: Apache‑2.0 (see base model card)
* This fine‑tuned checkpoint is released under **Apache‑2.0**. Please verify compliance for your use case.

## Citation

If you use this model, please cite:

```
@misc{MoreTrain_Qwen2.5_1.5B_owt},
  title  = {MoreTrain_Qwen2.5_1.5B_owt},
  author = {Seunghan Kim},
  year   = {2025},
  url    = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt}
}
```