File size: 4,603 Bytes
515ab58 43cf98b 515ab58 43cf98b ce1a01b 515ab58 ce1a01b 515ab58 ce1a01b 0e03e09 515ab58 ce1a01b 515ab58 ce1a01b 43cf98b 515ab58 ef2f00a 515ab58 43cf98b 515ab58 ce1a01b ef2f00a 2adeded ef2f00a 2adeded ef2f00a ce1a01b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
library_name: transformers
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-1.5B
license: apache-2.0
language:
- en
tags:
- qwen
- qwen2.5
- causal-lm
- continued-pretraining
- openwebtext
datasets:
- Skylion007/openwebtext
model-index:
- name: Qwen2.5 1.5B Continued Pretraining on OpenWebText
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: OpenWebText
type: openwebtext
metrics:
- type: perplexity
name: Validation Perplexity
value:
VAL_PPL: null
---
# Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16)
**Base model**: `Qwen/Qwen2.5-1.5B`
**Objective**: This model was continued-pretrained on *English OpenWebText* to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to **reduce Chinese linguistic bias** and align more toward English usage.
## What’s in this repo?
* **Architecture**: Qwen2.5 1.5B (causal LM)
* **Weights dtype**: bfloat16 (BF16)
* **Tokenizer**: the exact tokenizer used during training (saved alongside the checkpoint)
## Training data
* **Dataset**: OpenWebText (English)
* **Total training tokens**: **1,638,400,000**
* **Sequence length (block\_size)**: **1024**
* **Effective batch**: batch\_size=4, grad\_accumulation=8 ⇒ tokens/iter=32,768
**Portion of dataset used**: **18.31%** of tokenized corpus (approx.)
## Training setup
* **Precision**: BF16 (`torch.bfloat16`, AMP autocast)
* **Optimizer**: AdamW (β1=0.9, β2=0.95, weight\_decay=0.1, eps=1e-8)
* **LR schedule**: Cosine, warmup\_steps=2,500, total\_steps=50,000
* **Learning rate**: 1e-4
* **Dropout**: 0.05
* **Grad clip**: 1.0
* **Gradient checkpointing**: disabled (match original Qwen run)
* **Hardware**: 1× NVIDIA A100 40GB (single GPU)
These settings mirror the completed Qwen run: batch\_size=4, grad\_accumulation=8, block\_size=1024; cosine schedule with max\_iters=50k and warmup\_iters=2.5k; AdamW with lr=1e-4, weight\_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP.
## Intended use & limitations
* Intended for **English** text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining.
* Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")
prompt = "Write a short paragraph about large language models."
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))
````
## Reproducibility
We log the **exact training tokenizer** and use it in this repo.
To recompute token counts:
```python
# TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size
# If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this:
import os, pickle, torch
ckpt = torch.load("ckpt.pt", map_location="cpu")
args = ckpt.get("args", {}) # saved by the training script
GAS = args.get("gradient_accumulation_steps")
BATCH = args.get("batch_size")
BLOCK = args.get("block_size")
iters = ckpt.get("iter_num") # last saved iter
TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK
print("approx TRAIN_TOKENS:", TRAIN_TOKENS)
# To estimate dataset fraction:
# If you built memmap at data/<dataset>/<sanitized_model_name>/train.bin
import numpy as np
san = "{SANITIZED_MODEL_NAME}" # e.g., "Qwen2.5-1.5B"
data_dir = os.path.join("data", "{DATASET}", san)
train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r")
val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r")
TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin)
DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS)
print("dataset fraction ~", DATASET_FRACTION)
```
## License
* Base model license: Apache‑2.0 (see base model card)
* This fine‑tuned checkpoint is released under **Apache‑2.0**. Please verify compliance for your use case.
## Citation
If you use this model, please cite:
```
@misc{MoreTrain_Qwen2.5_1.5B_owt},
title = {MoreTrain_Qwen2.5_1.5B_owt},
author = {Seunghan Kim},
year = {2025},
url = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt}
}
``` |