File size: 3,515 Bytes

cf5804b
300bbf3
 
 
 
 
 
 
 
 
 
cf5804b
300bbf3
cf5804b
300bbf3
cf5804b
 
 
 
 
300bbf3
cf5804b
300bbf3
 
 
 
cf5804b
300bbf3
cf5804b
 
 
300bbf3
 
cf5804b
 
 
300bbf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf5804b
 
 
 
 
300bbf3
cf5804b
 
 
300bbf3
cf5804b
 
 
300bbf3
cf5804b
 
 
300bbf3
cf5804b
 
 
300bbf3
cf5804b
 
 
300bbf3
cf5804b
300bbf3

---
base_model: meta-llama/Meta-Llama-3-8B-Instruct
tags:
- alignment-handbook
- generated_from_trainer
datasets:
- princeton-nlp/llama3-ultrafeedback-armorm
model-index:
- name: tpo-alignment/Instruct-Llama-3-8B-TPO-y2
  results: []
license: mit
---
# Instruct-Llama-3-8B-TPO-y2 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our [preprint](https://arxiv.org/abs/2405.16681) and [GitHub repository](https://github.com/sahsaeedi/TPO/).

## Model Details

### Model Description

We fine-tuned [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the second-best response as the preferred response, and the lowest-scoring response as the rejected response.

- **Developed by:** Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
- **Model type:** Causal Language Model
- **License:** mistral
- **Finetuned from model:** [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/sahsaeedi/TPO
- **Paper:** https://arxiv.org/abs/2405.16681


## How to Get Started with the Model
```
import torch
from transformers import pipeline
model_id = "tpo-alignment/Instruct-Llama-3-8B-TPO-y2"
generator = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
                      do_sample=False,
                      eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
                      max_new_tokens=200)
print(outputs[0]['generated_text'])
```

## Training Details

### Training Data

We use [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) as the preference optimization dataset.

#### Training Hyperparameters

The hyperparameters used can be found in the [repository](https://github.com/sahsaeedi/TPO).



## Technical Specifications

### Model Architecture and Objective

The model architecture is based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). We use the TPO training objective proposed in our [preprint](https://arxiv.org/abs/2405.16681).

#### Hardware

We used 8xA100 GPUs for model training.



## Citation

TPO paper:
```
@misc{saeidi2025triplepreferenceoptimizationachieving,
      title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
      author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
      year={2025},
      eprint={2405.16681},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.16681}, 
}
```