Proximal Policy Optimization

Здесь представлена Proximal Policy (PPO) модель, обученная в рамках курса по LLM от VK.

Детали

Описание модели

PPO является классическим RL-ным подходом ко многим задачам. Этот подход широко распространён за счёт своей универсальности, и в настоящее время продовые модели, если и используют RL, то в базируются именно на PPO.

В отличие от DPO, PPO имеет множество проблем, связанных именно с RL. В частности, многое зависит от того, насколько хороша имеющаяся у нас Reward модель. Конкретно в нашем случае для реварда использовалась модель HuggingFaceTB/SmolLM-135M-Instruct. По сравнению с DPO в наших экспериментах результаты получились чуть хуже.

Ниже показано влияние метода на модель на примере ответа на пользовательский вопрос "What's your morning routine like?"

Пример: ответ модели до файнтюна

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like. Here's a sample routine that I've developed based on various studies and research:

Morning Routine (10-15 minutes)

  1. Hydrate: Drink a full glass of water or a herbal tea (e.g., chamomile, peppermint) to start the day.
  2. Eat a nutritious breakfast: Prepare a healthy breakfast, such as oatmeal with fruit, scrambled eggs with spinach, or Greek yogurt with berries.
  3. Get some morning sunlight: Spend 10-15 minutes outside in natural light to help regulate your circadian rhythms.
  4. Take a few deep breaths: Inhale for 1-2 minutes, hold for 2-3 minutes, and exhale for 2-3 minutes.
  5. Stretch or move: Engage in some light stretching or movement to get your blood flowing and your muscles moving.

Morning Routine (15-30 minutes)

  1. Get some morning sunlight: Spend 10-1...

Пример: ответ модели после файнтюна

I'm a morning person, so I've got a lot on my plate. Here's my morning routine:

Morning Snack (10-15 minutes)

  • Fresh fruit (banana, apple, or orange)
  • Yogurt or a protein smoothie (1 scoop protein powder, 1 scoop yogurt, and 1 scoop protein powder)
  • Carrot sticks with hummus (1/2 cup cooked chickpeas, 1/2 cup chopped veggies, and 1 tablespoon hummus)

Lunch Break (15-20 minutes)

  • A healthy lunch option (breakfasts, lunches, or a combination)
  • A salad with mixed greens, cherry tomatoes, cucumbers, and avocado
  • A protein shake or smoothie with protein powder, banana, and almond milk

Afternoon Snack (10-15 minutes)

  • Carrot sticks with hummus
  • Fresh fruit (banana, apple, or orange)
  • Yogurt or a protein smoothie (1 scoop protein powder, 1 scoop yogurt, and 1 scoop protein powder)

Evening Snack (10-15 minutes)

  • A healthy snack (...

Референсы

  • Модель: HuggingFaceTB/SmolLM-135M-Instruct
  • Датасет: HumanLLMs/Human-Like-DPO-Dataset
  • Оригинальная статья: https://arxiv.org/abs/1707.06347
Downloads last month
3
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support