๐ค Hugging Face | ๐ค ModelScope | ๐ Arxiv
็ฎไฝไธญๆ | English
๐ฆช Introduction
Currently, large language models (LLMs) predominantly employ simple refusal mechanisms to prevent generating harmful content. However, outright refusals can lead users to repeatedly attempt to bypass restrictions or migrate to less-regulated platforms, thereby increasing overall risk. To address this, we propose Constructive Safety Alignment (CSA), which not only prevents malicious misuse but also actively guides non-malicious users towards safe and beneficial outcomes. This approach is implemented in Oysterโ1 (Oy1). To evaluate CSA, we have developed a dedicated constructive benchmark that encompasses various risk types and user roles, simulating real-world user interactions. Oy1 achieves leading constructive alignment scores in both automated and manual evaluations, effectively rejecting adversarial queries and providing constructive guidance in complex risk scenarios.
๐งฉ Constructive Safety Alignment (CSA)
The goal of CSA is to go beyond simple refusals:
- Prevent malicious misuse
- Guide non-malicious users towards safety and positivity
Core Technologies
Game-Theoretic Interaction Modeling
- Models the model-user interaction as a hierarchical Stackelberg game.
- The model acts as the leader, determining strategies based on predicted user responses.
Multidimensional Risk Assessment
- Evaluates various types of risks and dynamically optimizes response strategies.
Structured Reasoning Chains + Linguistic Backpropagation (Lingo-BP)
- Explicitly decomposes into key safety decision nodes.
- Generates semantic signals from targets and backpropagates to adjust intermediate judgments.
- Precisely balances safety and usefulness along an interpretable pathway.
Oyster I Model Training
- Conducts preference learning based on generated safety reasoning paths.
- Enhances the model's ability for safe and constructive interactions.
๐ Constructive Benchmark
The Constructive Benchmark is a comprehensive dataset designed to evaluate the safety of large language models in real-world conversations. It includes 383 queries sourced from actual interactions and safety tests, covering 3 risk levels, 32 risk categories, and 257 diverse user roles such as professionals, students, and entrepreneurs. Each query offers up to five response options for comparison. This benchmark helps researchers and developers identify subtle risks in complex queries, assess the quality and safety of model responses, and enhance risk mitigation across various scenarios, thereby effectively improving the safety and alignment of conversational AI systems.
๐ Quick Start
0. Download
from huggingface_hub import snapshot_download
snapshot_download(repo_id="OysterAI/Oyster_1_Qwen_14B")
or
huggingface-cli download OysterAI/Oyster_1_Qwen_14B
1. Chat with Oysterโ1
You can interact with Oysterโ1 using just a few lines of code with the transformers
library. Specifically, we load the tokenizer and model using the from_pretrained
method and generate conversation content using the generate
method. Below is an example of how to converse with Oysterโ1:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "OysterAI/Oyster_1_Qwen_14B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
prompt = "I feel like ending it all. Nothing matters anymore."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐ก Expected Output: A compassionate and constructive response offering emotional support and resourcesโnot a refusal.
๐Citation
If you use OysterI in your research, please cite the following paper:
@article{duan2025oyster,
title={Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models},
author={Duan, Ranjie and Liu, Jiexi and Jia, Xiaojun and Zhao, Shiji and Cheng, Ruoxi and Wang, Fengxiang and Wei, Cheng and Xie, Yong and Liu, Chang and Li, Defeng and others},
journal={arXiv preprint arXiv:2509.01909},
year={2025}
}
๐ค Contributing We welcome collaboration and discussions in the area of safety alignment:
- Submit Issues to report problems
- Submit Pull Requests to improve the model or evaluations
- Share ideas in Discussions
๐ License
This project is licensed under the Apache 2.0 License.
๐ Acknowledgements
We thank the open-source community and the researchers advancing AI safety. Oyster-1 is part of Alibaba AAIG's commitment to responsible AI.
The world is your oyster. Letโs build AI that helps everyone find the pearl within.
- Downloads last month
- 4