Add metadata and link to paper and Github repo
Browse filesThis PR adds the `library_name` and `pipeline_tag` metadata to the model card, as well as a link to the paper [Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification](https://huggingface.co/papers/2502.14133) and the Github repository.
README.md
CHANGED
@@ -1,3 +1,14 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: feature-extraction
|
5 |
+
---
|
6 |
+
|
7 |
+
# SelfReg: Regularizing LLM-based Classifiers on Unintended Features
|
8 |
+
|
9 |
+
This repository contains the files related to the model presented in the paper [Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification](https://huggingface.co/papers/2502.14133).
|
10 |
+
The official implementation can be found at [wuxs/SelfReg](https://github.com/wuxs/SelfReg).
|
11 |
+
|
12 |
+
### Introduction
|
13 |
+
|
14 |
+
This is the official implementation of the paper [Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification](https://arxiv.org/abs/2502.14133) (accepted by KDD 2025). In the paper, we introduced a novel regularization strategy to regularize the use of "unintended features" for LLM-based text classifiers, where the unintended features can be sensitive attributes for privacy/fairness purposes or shortcut patterns for generlizability. We evaluate our proposed method on three real world datasets, namely "ToxicChat", "RewardBench", and "Dxy". We consider `Mistral-7B-inst-v0.2` as our backbone LLM and we pre-train our SAEs for it with 113 million tokens over 5 epochs.
|