Fill-Mask
Transformers
Safetensors
roberta
ChemBERTa
cheminformatics
Eval Results
File size: 7,350 Bytes
0ee4842
 
 
 
 
 
e6254ce
 
0ee4842
3fbf182
 
 
e6254ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ee4842
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4eff3a
0ee4842
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6254ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ee4842
 
 
 
 
 
 
 
 
5e76559
 
596eb0f
e6254ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
license: apache-2.0
datasets:
- Derify/augmented_canonical_druglike_QED_43M
- Derify/druglike
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- ChemBERTa
- cheminformatics
pipeline_tag: fill-mask
model-index:
- name: Derify/ChemBERTa-druglike
  results:
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BACE
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.8114
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BBBP
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7399
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: TOX21
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7522
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: HIV
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7527
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: SIDER
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.6577
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: CLINTOX
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.9660
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ESOL
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.8241
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: FREESOLV
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.5350
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: LIPO
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.6663
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: BACE
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 1.0105
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: CLEARANCE
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 43.4499
---

# ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES

## Model Description

This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.

## Training Procedure
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.

### Phase 1 – “easy” pretraining
- Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M)
- Masking probability: 15%
- Training duration: 9 epochs (chosen due to loss plateauing)
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies

### Phase 2 – “advanced” pretraining
- Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike)
- Masking probablity: 40%
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.

### Training Configuration
- Optimizer: NVIDIA Apex's FusedAdam optimizer
- Scheduler: Constant with warmup (10% of steps)
- Batch size: 144 sequences
- Precision: mixed-precision (fp16) and tf32 enabled

## Model Objective

This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
- Molecular similarity tasks
- Drug-like compound analysis
- Chemical space exploration in pharmaceutical contexts

## Evaluation

The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.

W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).

## Benchmarks
### Classification Datasets (ROC AUC - Higher is better)

| Model                     | BACE↑  | BBBP↑  | TOX21↑ | HIV↑   | SIDER↑ | CLINTOX↑ |
| ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- |
| **Tasks**                 | 1      | 1      | 12     | 1      | 27     | 2        |
| Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660   |

### Regression Datasets (RMSE - Lower is better)

| Model                     | ESOL↓  | FREESOLV↓ | LIPO↓  | BACE↓  | CLEARANCE↓ |
| ------------------------- | ------ | --------- | ------ | ------ | ---------- |
| **Tasks**                 | 1      | 1         | 1      | 1      | 1          |
| Derify/ChemBERTa-druglike | 0.8241 | 0.5350    | 0.6663 | 1.0105 | 43.4499    |

Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework. 
Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. 
The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. 
Each task was run with 3 different random seeds, and the mean performance is reported.

## Use Cases

- Molecular property prediction
- Drug discovery and development
- Chemical similarity analysis

## Limitations

- Optimized specifically for drug-like molecules
- Performance may vary on non-drug-like chemical compounds

## References
### ChemBERTa Series
```
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
      title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, 
      author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2020},
      eprint={2010.09885},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2010.09885}, 
}
```
```
@misc{ahmad2022chemberta2chemicalfoundationmodels,
      title={ChemBERTa-2: Towards Chemical Foundation Models}, 
      author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2022},
      eprint={2209.01712},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2209.01712}, 
}
```
```
@misc{singh2025chemberta3opensource,
  title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
  author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
  year={2025},
  howpublished={ChemRxiv},
  doi={10.26434/chemrxiv-2025-4glrl-v2},
  note={This content is a preprint and has not been peer-reviewed},
  url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}
```