reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000

This model was fine-tuned using PPO (Proximal Policy Optimization) as part of RLHF (Reinforcement Learning from Human Feedback).

Model Details

Model Path: data/olmo_reading_level_pairwise_reward_chosen_12th_grade_rejected_7th_grade_-1_steps_1000/best_model
Upload Date: 2025-07-17
Training Method: PPO/RLHF
Base Model: OLMo (inferred from path)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000")
model = AutoModelForCausalLM.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000")

# Generate text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files

This repository contains the best checkpoint from the training run, including:

Model weights (.safetensors format)
Tokenizer configuration
Model configuration
Generation configuration (if available)

Training Details

This model represents the best performing checkpoint from a PPO training run. For more details about the training process, please refer to the original training logs.