File size: 3,515 Bytes
cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 cf5804b 300bbf3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
base_model: meta-llama/Meta-Llama-3-8B-Instruct
tags:
- alignment-handbook
- generated_from_trainer
datasets:
- princeton-nlp/llama3-ultrafeedback-armorm
model-index:
- name: tpo-alignment/Instruct-Llama-3-8B-TPO-y2
results: []
license: mit
---
# Instruct-Llama-3-8B-TPO-y2 Model Card
TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/).
## Model Details
### Model Description
We fine-tuned [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response.
- **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- **Model type:** Causal Language Model
- **License:** mistral
- **Finetuned from model:** [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/sahsaeedi/TPO
- **Paper:** https://arxiv.org/abs/2405.16681
## How to Get Started with the Model
```
import torch
from transformers import pipeline
model_id = "tpo-alignment/Instruct-Llama-3-8B-TPO-y2"
generator = pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
do_sample=False,
eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
max_new_tokens=200)
print(outputs[0]['generated_text'])
```
## Training Details
### Training Data
We use [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) as the preference optimization dataset.
#### Training Hyperparameters
The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO).
## Technical Specifications
### Model Architecture and Objective
The model architecture is based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681).
#### Hardware
We used 8xA100 GPUs for model training.
## Citation
TPO paper:
```
@misc{saeidi2025triplepreferenceoptimizationachieving,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
year={2025},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.16681},
}
``` |