--- library_name: transformers pipeline_tag: text-generation base_model: - Qwen/Qwen2.5-1.5B license: apache-2.0 language: - en tags: - qwen - qwen2.5 - causal-lm - continued-pretraining - openwebtext datasets: - Skylion007/openwebtext model-index: - name: Qwen2.5 1.5B Continued Pretraining on OpenWebText results: - task: type: text-generation name: Text Generation dataset: name: OpenWebText type: openwebtext metrics: - type: perplexity name: Validation Perplexity value: VAL_PPL: null --- # Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16) **Base model**: `Qwen/Qwen2.5-1.5B` **Objective**: This model was continued-pretrained on *English OpenWebText* to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to **reduce Chinese linguistic bias** and align more toward English usage. ## What’s in this repo? * **Architecture**: Qwen2.5 1.5B (causal LM) * **Weights dtype**: bfloat16 (BF16) * **Tokenizer**: the exact tokenizer used during training (saved alongside the checkpoint) ## Training data * **Dataset**: OpenWebText (English) * **Total training tokens**: **1,638,400,000** * **Sequence length (block\_size)**: **1024** * **Effective batch**: batch\_size=4, grad\_accumulation=8 ⇒ tokens/iter=32,768 **Portion of dataset used**: **18.31%** of tokenized corpus (approx.) ## Training setup * **Precision**: BF16 (`torch.bfloat16`, AMP autocast) * **Optimizer**: AdamW (β1=0.9, β2=0.95, weight\_decay=0.1, eps=1e-8) * **LR schedule**: Cosine, warmup\_steps=2,500, total\_steps=50,000 * **Learning rate**: 1e-4 * **Dropout**: 0.05 * **Grad clip**: 1.0 * **Gradient checkpointing**: disabled (match original Qwen run) * **Hardware**: 1× NVIDIA A100 40GB (single GPU) These settings mirror the completed Qwen run: batch\_size=4, grad\_accumulation=8, block\_size=1024; cosine schedule with max\_iters=50k and warmup\_iters=2.5k; AdamW with lr=1e-4, weight\_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP. ## Intended use & limitations * Intended for **English** text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining. * Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation. ## How to use ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt" tok = AutoTokenizer.from_pretrained(repo, use_fast=True) model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto") prompt = "Write a short paragraph about large language models." ids = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**ids, max_new_tokens=128) print(tok.decode(out[0], skip_special_tokens=True)) ```` ## Reproducibility We log the **exact training tokenizer** and use it in this repo. To recompute token counts: ```python # TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size # If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this: import os, pickle, torch ckpt = torch.load("ckpt.pt", map_location="cpu") args = ckpt.get("args", {}) # saved by the training script GAS = args.get("gradient_accumulation_steps") BATCH = args.get("batch_size") BLOCK = args.get("block_size") iters = ckpt.get("iter_num") # last saved iter TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK print("approx TRAIN_TOKENS:", TRAIN_TOKENS) # To estimate dataset fraction: # If you built memmap at data///train.bin import numpy as np san = "{SANITIZED_MODEL_NAME}" # e.g., "Qwen2.5-1.5B" data_dir = os.path.join("data", "{DATASET}", san) train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r") val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r") TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin) DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS) print("dataset fraction ~", DATASET_FRACTION) ``` ## License * Base model license: Apache‑2.0 (see base model card) * This fine‑tuned checkpoint is released under **Apache‑2.0**. Please verify compliance for your use case. ## Citation If you use this model, please cite: ``` @misc{MoreTrain_Qwen2.5_1.5B_owt}, title = {MoreTrain_Qwen2.5_1.5B_owt}, author = {Seunghan Kim}, year = {2025}, url = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt} } ```