|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Derify/augmented_canonical_druglike_QED_43M |
|
- Derify/druglike |
|
metrics: |
|
- roc_auc |
|
- rmse |
|
library_name: transformers |
|
tags: |
|
- ChemBERTa |
|
- cheminformatics |
|
pipeline_tag: fill-mask |
|
model-index: |
|
- name: Derify/ChemBERTa-druglike |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: BACE |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.8114 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: BBBP |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7399 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: TOX21 |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7522 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: HIV |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.7527 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: SIDER |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.6577 |
|
- task: |
|
type: text-classification |
|
name: Classification (ROC AUC) |
|
dataset: |
|
name: CLINTOX |
|
type: Derify/druglike |
|
metrics: |
|
- type: roc_auc |
|
value: 0.9660 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: ESOL |
|
type: Derify/druglike |
|
metrics: |
|
- type: rmse |
|
value: 0.8241 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: FREESOLV |
|
type: Derify/druglike |
|
metrics: |
|
- type: rmse |
|
value: 0.5350 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: LIPO |
|
type: Derify/druglike |
|
metrics: |
|
- type: rmse |
|
value: 0.6663 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: BACE |
|
type: Derify/druglike |
|
metrics: |
|
- type: rmse |
|
value: 1.0105 |
|
- task: |
|
type: regression |
|
name: Regression (RMSE) |
|
dataset: |
|
name: CLEARANCE |
|
type: Derify/druglike |
|
metrics: |
|
- type: rmse |
|
value: 43.4499 |
|
--- |
|
|
|
# ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES |
|
|
|
## Model Description |
|
|
|
This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules. |
|
|
|
## Training Procedure |
|
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks. |
|
|
|
### Phase 1 – “easy” pretraining |
|
- Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M) |
|
- Masking probability: 15% |
|
- Training duration: 9 epochs (chosen due to loss plateauing) |
|
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies |
|
|
|
### Phase 2 – “advanced” pretraining |
|
- Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike) |
|
- Masking probablity: 40% |
|
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score. |
|
|
|
### Training Configuration |
|
- Optimizer: NVIDIA Apex's FusedAdam optimizer |
|
- Scheduler: Constant with warmup (10% of steps) |
|
- Batch size: 144 sequences |
|
- Precision: mixed-precision (fp16) and tf32 enabled |
|
|
|
## Model Objective |
|
|
|
This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for: |
|
- Molecular similarity tasks |
|
- Drug-like compound analysis |
|
- Chemical space exploration in pharmaceutical contexts |
|
|
|
## Evaluation |
|
|
|
The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities. |
|
|
|
W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3). |
|
|
|
## Benchmarks |
|
### Classification Datasets (ROC AUC - Higher is better) |
|
|
|
| Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ | |
|
| ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- | |
|
| **Tasks** | 1 | 1 | 12 | 1 | 27 | 2 | |
|
| Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 | |
|
|
|
### Regression Datasets (RMSE - Lower is better) |
|
|
|
| Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ | |
|
| ------------------------- | ------ | --------- | ------ | ------ | ---------- | |
|
| **Tasks** | 1 | 1 | 1 | 1 | 1 | |
|
| Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 | |
|
|
|
Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework. |
|
Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. |
|
The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. |
|
Each task was run with 3 different random seeds, and the mean performance is reported. |
|
|
|
## Use Cases |
|
|
|
- Molecular property prediction |
|
- Drug discovery and development |
|
- Chemical similarity analysis |
|
|
|
## Limitations |
|
|
|
- Optimized specifically for drug-like molecules |
|
- Performance may vary on non-drug-like chemical compounds |
|
|
|
## References |
|
### ChemBERTa Series |
|
``` |
|
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining, |
|
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, |
|
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, |
|
year={2020}, |
|
eprint={2010.09885}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2010.09885}, |
|
} |
|
``` |
|
``` |
|
@misc{ahmad2022chemberta2chemicalfoundationmodels, |
|
title={ChemBERTa-2: Towards Chemical Foundation Models}, |
|
author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, |
|
year={2022}, |
|
eprint={2209.01712}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2209.01712}, |
|
} |
|
``` |
|
``` |
|
@misc{singh2025chemberta3opensource, |
|
title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models}, |
|
author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others}, |
|
year={2025}, |
|
howpublished={ChemRxiv}, |
|
doi={10.26434/chemrxiv-2025-4glrl-v2}, |
|
note={This content is a preprint and has not been peer-reviewed}, |
|
url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2} |
|
} |
|
``` |