Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16)

Base model: Qwen/Qwen2.5-1.5B Objective: This model was continued-pretrained on English OpenWebText to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to reduce Chinese linguistic bias and align more toward English usage.

What’s in this repo?

Architecture: Qwen2.5 1.5B (causal LM)
Weights dtype: bfloat16 (BF16)
Tokenizer: the exact tokenizer used during training (saved alongside the checkpoint)

Training data

Dataset: OpenWebText (English)
Total training tokens: 1,638,400,000
Sequence length (block_size): 1024
Effective batch: batch_size=4, grad_accumulation=8 ⇒ tokens/iter=32,768 Portion of dataset used: 18.31% of tokenized corpus (approx.)

Training setup

Precision: BF16 (torch.bfloat16, AMP autocast)
Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1, eps=1e-8)
LR schedule: Cosine, warmup_steps=2,500, total_steps=50,000
Learning rate: 1e-4
Dropout: 0.05
Grad clip: 1.0
Gradient checkpointing: disabled (match original Qwen run)
Hardware: 1× NVIDIA A100 40GB (single GPU)

These settings mirror the completed Qwen run: batch_size=4, grad_accumulation=8, block_size=1024; cosine schedule with max_iters=50k and warmup_iters=2.5k; AdamW with lr=1e-4, weight_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP.

Intended use & limitations

Intended for English text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining.
Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation.

How to use

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "Write a short paragraph about large language models."
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))

Reproducibility

We log the exact training tokenizer and use it in this repo. To recompute token counts:

# TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size
# If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this:

import os, pickle, torch

ckpt = torch.load("ckpt.pt", map_location="cpu")
args = ckpt.get("args", {})  # saved by the training script
GAS = args.get("gradient_accumulation_steps")
BATCH = args.get("batch_size")
BLOCK = args.get("block_size")
iters = ckpt.get("iter_num")  # last saved iter
TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK

print("approx TRAIN_TOKENS:", TRAIN_TOKENS)

# To estimate dataset fraction:
# If you built memmap at data/<dataset>/<sanitized_model_name>/train.bin
import numpy as np
san = "{SANITIZED_MODEL_NAME}"  # e.g., "Qwen2.5-1.5B"
data_dir = os.path.join("data", "{DATASET}", san)
train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r")
val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r")
TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin)
DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS)
print("dataset fraction ~", DATASET_FRACTION)

License

Base model license: Apache‑2.0 (see base model card)
This fine‑tuned checkpoint is released under Apache‑2.0. Please verify compliance for your use case.

Citation

If you use this model, please cite:

@misc{MoreTrain_Qwen2.5_1.5B_owt},
  title  = {MoreTrain_Qwen2.5_1.5B_owt},
  author = {Seunghan Kim},
  year   = {2025},
  url    = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt}
}

ShrimpPotato
/

MoreTrain_Qwen2.5_1.5B_owt