llama3-8b-mypo3_sim-full-beta7.5-lr4e-7

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 1.3631
Rewards/chosen: 0.0450
Rewards/rejected: -0.3543
Rewards/accuracies: 0.7579
Rewards/margins: 0.3992
Logps/rejected: -1.5366
Logps/chosen: -1.2652
Logits/rejected: -1.1234
Logits/chosen: -1.0985

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e-07
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 32
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
1.3793	0.0523	100	1.3815	-0.0364	-0.0923	0.6270	0.0559	-1.5017	-1.2760	-1.0412	-1.0098
1.3797	0.1047	200	1.3835	-0.0667	-0.2448	0.7103	0.1781	-1.5220	-1.2800	-1.0404	-1.0114
1.3748	0.1570	300	1.3803	0.0532	-0.2001	0.7341	0.2534	-1.5161	-1.2640	-1.0513	-1.0243
1.3639	0.2094	400	1.3851	0.0649	-0.2251	0.7302	0.2900	-1.5194	-1.2625	-1.0536	-1.0279
1.3736	0.2617	500	1.3799	0.0384	-0.3073	0.7282	0.3457	-1.5304	-1.2660	-1.0693	-1.0442
1.3698	0.3141	600	1.3888	-0.0230	-0.3563	0.7361	0.3333	-1.5369	-1.2742	-1.0838	-1.0584
1.3417	0.3664	700	1.3778	0.0230	-0.3367	0.7302	0.3597	-1.5343	-1.2681	-1.0931	-1.0674
1.413	0.4187	800	1.3758	-0.0158	-0.3821	0.7401	0.3663	-1.5404	-1.2732	-1.1025	-1.0780
1.3989	0.4711	900	1.3793	0.0086	-0.3610	0.7460	0.3696	-1.5375	-1.2700	-1.1075	-1.0822
1.3566	0.5234	1000	1.3717	0.1015	-0.2903	0.7440	0.3917	-1.5281	-1.2576	-1.1044	-1.0813
1.39	0.5758	1100	1.3751	0.1356	-0.2313	0.7341	0.3669	-1.5202	-1.2531	-1.1474	-1.1210
1.3829	0.6281	1200	1.3682	0.0202	-0.3839	0.7619	0.4041	-1.5406	-1.2684	-1.1289	-1.1032
1.3495	0.6805	1300	1.3676	0.0081	-0.3882	0.7540	0.3963	-1.5412	-1.2701	-1.0956	-1.0722
1.349	0.7328	1400	1.3697	0.1064	-0.2902	0.7421	0.3966	-1.5281	-1.2570	-1.1047	-1.0809
1.3702	0.7851	1500	1.3645	0.0567	-0.3406	0.7560	0.3973	-1.5348	-1.2636	-1.1163	-1.0916
1.3753	0.8375	1600	1.3645	0.0555	-0.3434	0.7520	0.3988	-1.5352	-1.2638	-1.1144	-1.0900
1.3577	0.8898	1700	1.3632	0.0357	-0.3637	0.7540	0.3994	-1.5379	-1.2664	-1.1254	-1.1003
1.3568	0.9422	1800	1.3634	0.0453	-0.3518	0.7520	0.3971	-1.5363	-1.2651	-1.1305	-1.1050
1.3632	0.9945	1900	1.3634	0.0445	-0.3542	0.7540	0.3986	-1.5366	-1.2652	-1.1235	-1.0986

Framework versions

Transformers 4.43.1
Pytorch 2.1.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

aaaalongaa
/

llama3-8b-mypo3_sim-full-beta7.5-lr4e-7

llama3-8b-mypo3_sim-full-beta7.5-lr4e-7

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for aaaalongaa/llama3-8b-mypo3_sim-full-beta7.5-lr4e-7

Dataset used to train aaaalongaa/llama3-8b-mypo3_sim-full-beta7.5-lr4e-7

Evaluation results