Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16)
Base model: Qwen/Qwen2.5-1.5B
Objective: This model was continued-pretrained on English OpenWebText to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to reduce Chinese linguistic bias and align more toward English usage.
What’s in this repo?
- Architecture: Qwen2.5 1.5B (causal LM)
- Weights dtype: bfloat16 (BF16)
- Tokenizer: the exact tokenizer used during training (saved alongside the checkpoint)
Training data
- Dataset: OpenWebText (English)
- Total training tokens: 1,638,400,000
- Sequence length (block_size): 1024
- Effective batch: batch_size=4, grad_accumulation=8 ⇒ tokens/iter=32,768 Portion of dataset used: 18.31% of tokenized corpus (approx.)
Training setup
- Precision: BF16 (
torch.bfloat16
, AMP autocast) - Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1, eps=1e-8)
- LR schedule: Cosine, warmup_steps=2,500, total_steps=50,000
- Learning rate: 1e-4
- Dropout: 0.05
- Grad clip: 1.0
- Gradient checkpointing: disabled (match original Qwen run)
- Hardware: 1× NVIDIA A100 40GB (single GPU)
These settings mirror the completed Qwen run: batch_size=4, grad_accumulation=8, block_size=1024; cosine schedule with max_iters=50k and warmup_iters=2.5k; AdamW with lr=1e-4, weight_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP.
Intended use & limitations
- Intended for English text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining.
- Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation.
How to use
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")
prompt = "Write a short paragraph about large language models."
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))
Reproducibility
We log the exact training tokenizer and use it in this repo. To recompute token counts:
# TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size
# If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this:
import os, pickle, torch
ckpt = torch.load("ckpt.pt", map_location="cpu")
args = ckpt.get("args", {}) # saved by the training script
GAS = args.get("gradient_accumulation_steps")
BATCH = args.get("batch_size")
BLOCK = args.get("block_size")
iters = ckpt.get("iter_num") # last saved iter
TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK
print("approx TRAIN_TOKENS:", TRAIN_TOKENS)
# To estimate dataset fraction:
# If you built memmap at data/<dataset>/<sanitized_model_name>/train.bin
import numpy as np
san = "{SANITIZED_MODEL_NAME}" # e.g., "Qwen2.5-1.5B"
data_dir = os.path.join("data", "{DATASET}", san)
train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r")
val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r")
TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin)
DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS)
print("dataset fraction ~", DATASET_FRACTION)
License
- Base model license: Apache‑2.0 (see base model card)
- This fine‑tuned checkpoint is released under Apache‑2.0. Please verify compliance for your use case.
Citation
If you use this model, please cite:
@misc{MoreTrain_Qwen2.5_1.5B_owt},
title = {MoreTrain_Qwen2.5_1.5B_owt},
author = {Seunghan Kim},
year = {2025},
url = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt}
}
- Downloads last month
- 9
Model tree for ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt
Base model
Qwen/Qwen2.5-1.5BDataset used to train ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt
Evaluation results
- Validation Perplexity on OpenWebTextself-reported[object Object]