File size: 7,350 Bytes
0ee4842 e6254ce 0ee4842 3fbf182 e6254ce 0ee4842 d4eff3a 0ee4842 e6254ce 0ee4842 5e76559 596eb0f e6254ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
license: apache-2.0
datasets:
- Derify/augmented_canonical_druglike_QED_43M
- Derify/druglike
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- ChemBERTa
- cheminformatics
pipeline_tag: fill-mask
model-index:
- name: Derify/ChemBERTa-druglike
results:
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: BACE
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.8114
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: BBBP
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7399
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: TOX21
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7522
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: HIV
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7527
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: SIDER
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.6577
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: CLINTOX
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.9660
- task:
type: regression
name: Regression (RMSE)
dataset:
name: ESOL
type: Derify/druglike
metrics:
- type: rmse
value: 0.8241
- task:
type: regression
name: Regression (RMSE)
dataset:
name: FREESOLV
type: Derify/druglike
metrics:
- type: rmse
value: 0.5350
- task:
type: regression
name: Regression (RMSE)
dataset:
name: LIPO
type: Derify/druglike
metrics:
- type: rmse
value: 0.6663
- task:
type: regression
name: Regression (RMSE)
dataset:
name: BACE
type: Derify/druglike
metrics:
- type: rmse
value: 1.0105
- task:
type: regression
name: Regression (RMSE)
dataset:
name: CLEARANCE
type: Derify/druglike
metrics:
- type: rmse
value: 43.4499
---
# ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES
## Model Description
This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.
## Training Procedure
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.
### Phase 1 – “easy” pretraining
- Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M)
- Masking probability: 15%
- Training duration: 9 epochs (chosen due to loss plateauing)
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies
### Phase 2 – “advanced” pretraining
- Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike)
- Masking probablity: 40%
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.
### Training Configuration
- Optimizer: NVIDIA Apex's FusedAdam optimizer
- Scheduler: Constant with warmup (10% of steps)
- Batch size: 144 sequences
- Precision: mixed-precision (fp16) and tf32 enabled
## Model Objective
This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
- Molecular similarity tasks
- Drug-like compound analysis
- Chemical space exploration in pharmaceutical contexts
## Evaluation
The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.
W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).
## Benchmarks
### Classification Datasets (ROC AUC - Higher is better)
| Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ |
| ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- |
| **Tasks** | 1 | 1 | 12 | 1 | 27 | 2 |
| Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 |
### Regression Datasets (RMSE - Lower is better)
| Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ |
| ------------------------- | ------ | --------- | ------ | ------ | ---------- |
| **Tasks** | 1 | 1 | 1 | 1 | 1 |
| Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 |
Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework.
Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length.
The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32.
Each task was run with 3 different random seeds, and the mean performance is reported.
## Use Cases
- Molecular property prediction
- Drug discovery and development
- Chemical similarity analysis
## Limitations
- Optimized specifically for drug-like molecules
- Performance may vary on non-drug-like chemical compounds
## References
### ChemBERTa Series
```
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2020},
eprint={2010.09885},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.09885},
}
```
```
@misc{ahmad2022chemberta2chemicalfoundationmodels,
title={ChemBERTa-2: Towards Chemical Foundation Models},
author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2022},
eprint={2209.01712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2209.01712},
}
```
```
@misc{singh2025chemberta3opensource,
title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
year={2025},
howpublished={ChemRxiv},
doi={10.26434/chemrxiv-2025-4glrl-v2},
note={This content is a preprint and has not been peer-reviewed},
url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}
``` |