Qwen2 RLOO Countdown (Step 250)

This model is a Qwen2-based language model fine-tuned using RLOO (Rank-order Learning with Localized Objectives) on countdown math problems.

Training Details

  • Base Model: thomasjhuang/qwen2-sft-warmup
  • Method: RLOO (Reinforcement Learning from Human Feedback)
  • Dataset: Jiayi-Pan/Countdown-Tasks-3to4
  • Training Steps: 250 optimizer steps
  • Learning Rate: 3e-6
  • Temperature: 0.1
  • Batch Size: 2
  • K Samples: 8

Key Fixes Applied

  1. Prompt Format: Updated to match SFT evaluation format with detailed instructions
  2. Token Length: Increased to 250 tokens for complete reasoning
  3. Temperature: Reduced to 0.1 for more deterministic generation
  4. Extraction: Fixed to work with vLLM outputs

Performance

During training at step 250, the model achieved:

  • Average rewards ranging from 0.05 to 0.50 across batches
  • Successful generation of proper <think> and <answer> tags
  • Correct solutions to various countdown math problems

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250")
model = AutoModelForCausalLM.from_pretrained("thomasjhuang/qwen2-rloo-countdown-step250")

prompt = '''Using the numbers [8, 16, 80], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=300, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Progress

This checkpoint represents an intermediate state in RLOO training where:

  • The model learned to follow the correct prompt format
  • Success rates improved from 0% to 10-50% on various problems
  • The model generates structured reasoning in <think> tags
  • Solutions are properly formatted in <answer> tags

For the latest checkpoint, see: thomasjhuang/qwen2-rloo-countdown-final

Downloads last month
3
Safetensors
Model size
494M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for thomasjhuang/qwen2-rloo-countdown-step250

Finetuned
(3)
this model