File size: 4,603 Bytes

515ab58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43cf98b
515ab58
43cf98b
ce1a01b
515ab58
 
 
ce1a01b
 
 
 
515ab58
 
 
ce1a01b
 
 
 
0e03e09
515ab58
 
ce1a01b
 
 
 
 
 
 
 
 
 
 
515ab58
 
ce1a01b
 
 
43cf98b
 
 
515ab58
 
 
ef2f00a
515ab58
 
43cf98b
515ab58
 
 
ce1a01b
ef2f00a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2adeded
 
 
ef2f00a
2adeded
ef2f00a
ce1a01b

---
library_name: transformers
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-1.5B
license: apache-2.0
language:
- en
tags:
- qwen
- qwen2.5
- causal-lm
- continued-pretraining
- openwebtext
datasets:
- Skylion007/openwebtext
model-index:
- name: Qwen2.5 1.5B Continued Pretraining on OpenWebText
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: OpenWebText
      type: openwebtext
    metrics:
    - type: perplexity
      name: Validation Perplexity
      value:
        VAL_PPL: null
---

# Qwen2.5 1.5B Continued Pretraining on OpenWebText (BF16)

**Base model**: `Qwen/Qwen2.5-1.5B`
**Objective**: This model was continued-pretrained on *English OpenWebText* to examine if “thought process” and behavior shift when we overwrite a Chinese‑trained base with English‑only data, i.e., to **reduce Chinese linguistic bias** and align more toward English usage.

## What’s in this repo?

* **Architecture**: Qwen2.5 1.5B (causal LM)
* **Weights dtype**: bfloat16 (BF16)
* **Tokenizer**: the exact tokenizer used during training (saved alongside the checkpoint)

## Training data

* **Dataset**: OpenWebText (English)
* **Total training tokens**: **1,638,400,000**
* **Sequence length (block\_size)**: **1024**
* **Effective batch**: batch\_size=4, grad\_accumulation=8 ⇒ tokens/iter=32,768
**Portion of dataset used**: **18.31%** of tokenized corpus (approx.)

## Training setup

* **Precision**: BF16 (`torch.bfloat16`, AMP autocast)
* **Optimizer**: AdamW (β1=0.9, β2=0.95, weight\_decay=0.1, eps=1e-8)
* **LR schedule**: Cosine, warmup\_steps=2,500, total\_steps=50,000
* **Learning rate**: 1e-4
* **Dropout**: 0.05
* **Grad clip**: 1.0
* **Gradient checkpointing**: disabled (match original Qwen run)
* **Hardware**: 1× NVIDIA A100 40GB (single GPU)

These settings mirror the completed Qwen run: batch\_size=4, grad\_accumulation=8, block\_size=1024; cosine schedule with max\_iters=50k and warmup\_iters=2.5k; AdamW with lr=1e-4, weight\_decay=0.1, betas=(0.9,0.95), eps=1e-8; BF16/AMP.

## Intended use & limitations

* Intended for **English** text generation and research on distribution shift from Chinese‑centric pretraining to English‑only continued pretraining.
* Not a safety‑aligned instruction model; outputs can be inaccurate or unsafe in certain contexts. Please apply your own filtering/evaluation.

## How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "Write a short paragraph about large language models."
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))
````

## Reproducibility

We log the **exact training tokenizer** and use it in this repo.
To recompute token counts:

```python
# TRAIN_TOKENS ≈ total_steps * gradient_accumulation * batch_size * block_size
# If you saved a checkpoint 'ckpt.pt' with args, you can parse it like this:

import os, pickle, torch

ckpt = torch.load("ckpt.pt", map_location="cpu")
args = ckpt.get("args", {})  # saved by the training script
GAS = args.get("gradient_accumulation_steps")
BATCH = args.get("batch_size")
BLOCK = args.get("block_size")
iters = ckpt.get("iter_num")  # last saved iter
TRAIN_TOKENS = (iters or 0) * GAS * BATCH * BLOCK

print("approx TRAIN_TOKENS:", TRAIN_TOKENS)

# To estimate dataset fraction:
# If you built memmap at data/<dataset>/<sanitized_model_name>/train.bin
import numpy as np
san = "{SANITIZED_MODEL_NAME}"  # e.g., "Qwen2.5-1.5B"
data_dir = os.path.join("data", "{DATASET}", san)
train_bin = np.memmap(os.path.join(data_dir, "train.bin"), dtype=np.uint32, mode="r")
val_bin = np.memmap(os.path.join(data_dir, "val.bin"), dtype=np.uint32, mode="r")
TOTAL_DATASET_TOKENS = len(train_bin) + len(val_bin)
DATASET_FRACTION = TRAIN_TOKENS / max(1, TOTAL_DATASET_TOKENS)
print("dataset fraction ~", DATASET_FRACTION)
```

## License

* Base model license: Apache‑2.0 (see base model card)
* This fine‑tuned checkpoint is released under **Apache‑2.0**. Please verify compliance for your use case.

## Citation

If you use this model, please cite:

```
@misc{MoreTrain_Qwen2.5_1.5B_owt},
  title  = {MoreTrain_Qwen2.5_1.5B_owt},
  author = {Seunghan Kim},
  year   = {2025},
  url    = {https://huggingface.co/ShrimpPotato/MoreTrain_Qwen2.5_1.5B_owt}
}
```