internlm
/

POLAR-1_8B-Base

Text Classification

feature-extraction

Model card Files Files and versions Community

RowitZou commited on Jul 8

Commit

a99151f

·

verified ·

1 Parent(s): f88c870

Update README_zh-CN.md

Files changed (1) hide show

README_zh-CN.md +8 -3

README_zh-CN.md CHANGED Viewed

@@ -11,7 +11,7 @@
 [💻 Github](https://github.com/InternLM/POLAR) |
-[📜 论文](https://arxiv.org/abs/xxxxxx)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
@@ -37,7 +37,7 @@ POLAR 是一个经过大规模预训练的奖励模型，在训练范式和模
 **POLAR-1.8B-Base** 是仅经过预训练阶段的权重，适合根据特定需求进行微调。**POLAR-1.8B** 是经过偏好微调的奖励模型，可开箱即用，适用于大部分通用场景。
-我们通过 Proximal Policy Optimization（PPO）算法对 POLAR 的使用效果进行了验证，评测了四种语言模型的下游强化学习性能，评测工具是 [OpenCompass](https://github.com/internLM/OpenCompass/) 。详细信息请参阅[论文](https://arxiv.org/abs/xxxxxx)。
 <img src="./misc/result.png"/><br>
@@ -382,5 +382,10 @@ Reward: -7.23046875
 # 引用
 ```
-TBC
 ```

 [💻 Github](https://github.com/InternLM/POLAR) |
+[📜 论文](https://arxiv.org/abs/2507.05197)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
 **POLAR-1.8B-Base** 是仅经过预训练阶段的权重，适合根据特定需求进行微调。**POLAR-1.8B** 是经过偏好微调的奖励模型，可开箱即用，适用于大部分通用场景。
+我们通过 Proximal Policy Optimization（PPO）算法对 POLAR 的使用效果进行了验证，评测了四种语言模型的下游强化学习性能，评测工具是 [OpenCompass](https://github.com/internLM/OpenCompass/) 。详细信息请参阅[论文](https://arxiv.org/abs/2507.05197)。
 <img src="./misc/result.png"/><br>
 # 引用
 ```
+@article{dou2025pretrained,
+  title={Pre-Trained Policy Discriminators are General Reward Models},
+  author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
+  journal={arXiv preprint arXiv:2507.05197},
+  year={2025}
+}
 ```