Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

SPY Lab - ETH Zurich

https://spylab.ai

ethz-spylab

AI & ML interests

Security, privacy, and trustworthiness of machine learning systems.

ethz-spylab 's collections 3

The Jailbreak Tax (Jailbreak Utility)

Models and dataset used in paper "The Jailbreak Tax: How Useful Are Your Jailbreak Outputs"

ethz-spylab/Llama-3.1-70B-Instruct_refuse_math

Text Generation • Updated Apr 16
ethz-spylab/Llama-3.1-70B-Instruct_refuse_biology

Text Generation • Updated Apr 16
ethz-spylab/Llama-3.1-70B-Instruct_do_math_again

Updated Feb 18
ethz-spylab/Llama-3.1-8B-Instruct_do_bio_again

Updated Mar 7

RLHF Trojan Competition

Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Paper • 2404.14461 • Published Apr 22, 2024 • 2
Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1
ethz-spylab/poisoned_generation_trojan1

Text Generation • Updated Apr 29, 2024 • 86 • 5
ethz-spylab/poisoned_generation_trojan2

Text Generation • Updated Apr 29, 2024 • 6 • 1

Models and datasets used for our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"

Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1
ethz-spylab/poisoned-rlhf-7b-SUDO-10

Text Generation • 7B • Updated Feb 7, 2024 • 231 • 2
ethz-spylab/poisoned-rlhf-7b-SUDO-3-topic

Text Generation • 7B • Updated Feb 7, 2024
ethz-spylab/poisoned-reward-7b-SUDO-03

7B • Updated Feb 7, 2024

The Jailbreak Tax (Jailbreak Utility)

Models and dataset used in paper "The Jailbreak Tax: How Useful Are Your Jailbreak Outputs"

ethz-spylab/Llama-3.1-70B-Instruct_refuse_math

Text Generation • Updated Apr 16
ethz-spylab/Llama-3.1-70B-Instruct_refuse_biology

Text Generation • Updated Apr 16
ethz-spylab/Llama-3.1-70B-Instruct_do_math_again

Updated Feb 18
ethz-spylab/Llama-3.1-8B-Instruct_do_bio_again

Updated Mar 7

Models and datasets used for our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"

Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1
ethz-spylab/poisoned-rlhf-7b-SUDO-10

Text Generation • 7B • Updated Feb 7, 2024 • 231 • 2
ethz-spylab/poisoned-rlhf-7b-SUDO-3-topic

Text Generation • 7B • Updated Feb 7, 2024
ethz-spylab/poisoned-reward-7b-SUDO-03

7B • Updated Feb 7, 2024

RLHF Trojan Competition

Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Paper • 2404.14461 • Published Apr 22, 2024 • 2
Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1
ethz-spylab/poisoned_generation_trojan1

Text Generation • Updated Apr 29, 2024 • 86 • 5
ethz-spylab/poisoned_generation_trojan2

Text Generation • Updated Apr 29, 2024 • 6 • 1

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs