reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000

This model was fine-tuned using PPO (Proximal Policy Optimization) as part of RLHF (Reinforcement Learning from Human Feedback).

Model Details

  • Model Path: data/olmo_reading_level_pairwise_reward_chosen_12th_grade_rejected_7th_grade_-1_steps_1000/best_model
  • Upload Date: 2025-07-17
  • Training Method: PPO/RLHF
  • Base Model: OLMo (inferred from path)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000")
model = AutoModelForCausalLM.from_pretrained("Yuhan123/reading-level-pairwise-reward-chosen-12th-grade-rejected-7th-grade-1-steps-1000")

# Generate text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files

This repository contains the best checkpoint from the training run, including:

  • Model weights (.safetensors format)
  • Tokenizer configuration
  • Model configuration
  • Generation configuration (if available)

Training Details

This model represents the best performing checkpoint from a PPO training run. For more details about the training process, please refer to the original training logs.

Downloads last month
11
Safetensors
Model size
1.48B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support