Proximal Policy Optimization

Здесь представлена Proximal Policy (PPO) модель, обученная в рамках курса по LLM от VK.

Детали

Описание модели

PPO является классическим RL-ным подходом ко многим задачам. Этот подход широко распространён за счёт своей универсальности, и в настоящее время продовые модели, если и используют RL, то в базируются именно на PPO.

В отличие от DPO, PPO имеет множество проблем, связанных именно с RL. В частности, многое зависит от того, насколько хороша имеющаяся у нас Reward модель. Конкретно в нашем случае для реварда использовалась модель HuggingFaceTB/SmolLM-135M-Instruct. По сравнению с DPO в наших экспериментах результаты получились чуть хуже.

Ниже показано влияние метода на модель на примере ответа на пользовательский вопрос "What's your morning routine like?"

Пример: ответ модели до файнтюна

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like. Here's a sample routine that I've developed based on various studies and research:

Morning Routine (10-15 minutes)

Hydrate: Drink a full glass of water or a herbal tea (e.g., chamomile, peppermint) to start the day.
Eat a nutritious breakfast: Prepare a healthy breakfast, such as oatmeal with fruit, scrambled eggs with spinach, or Greek yogurt with berries.
Get some morning sunlight: Spend 10-15 minutes outside in natural light to help regulate your circadian rhythms.
Take a few deep breaths: Inhale for 1-2 minutes, hold for 2-3 minutes, and exhale for 2-3 minutes.
Stretch or move: Engage in some light stretching or movement to get your blood flowing and your muscles moving.

Morning Routine (15-30 minutes)

Get some morning sunlight: Spend 10-1...

Пример: ответ модели после файнтюна

I'm a morning person, so I've got a lot on my plate. Here's my morning routine:

Morning Snack (10-15 minutes)

Fresh fruit (banana, apple, or orange)
Yogurt or a protein smoothie (1 scoop protein powder, 1 scoop yogurt, and 1 scoop protein powder)
Carrot sticks with hummus (1/2 cup cooked chickpeas, 1/2 cup chopped veggies, and 1 tablespoon hummus)

Lunch Break (15-20 minutes)

A healthy lunch option (breakfasts, lunches, or a combination)
A salad with mixed greens, cherry tomatoes, cucumbers, and avocado
A protein shake or smoothie with protein powder, banana, and almond milk

Afternoon Snack (10-15 minutes)

Carrot sticks with hummus
Fresh fruit (banana, apple, or orange)
Yogurt or a protein smoothie (1 scoop protein powder, 1 scoop yogurt, and 1 scoop protein powder)

Evening Snack (10-15 minutes)

A healthy snack (...

Референсы

Модель: HuggingFaceTB/SmolLM-135M-Instruct
Датасет: HumanLLMs/Human-Like-DPO-Dataset
Оригинальная статья: https://arxiv.org/abs/1707.06347