license: apache-2.0
datasets:
- Derify/augmented_canonical_druglike_QED_43M
- Derify/druglike
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- ChemBERTa
- cheminformatics
pipeline_tag: fill-mask
model-index:
- name: Derify/ChemBERTa-druglike
results:
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: BACE
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.8114
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: BBBP
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7399
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: TOX21
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7522
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: HIV
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.7527
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: SIDER
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.6577
- task:
type: text-classification
name: Classification (ROC AUC)
dataset:
name: CLINTOX
type: Derify/druglike
metrics:
- type: roc_auc
value: 0.966
- task:
type: regression
name: Regression (RMSE)
dataset:
name: ESOL
type: Derify/druglike
metrics:
- type: rmse
value: 0.8241
- task:
type: regression
name: Regression (RMSE)
dataset:
name: FREESOLV
type: Derify/druglike
metrics:
- type: rmse
value: 0.535
- task:
type: regression
name: Regression (RMSE)
dataset:
name: LIPO
type: Derify/druglike
metrics:
- type: rmse
value: 0.6663
- task:
type: regression
name: Regression (RMSE)
dataset:
name: BACE
type: Derify/druglike
metrics:
- type: rmse
value: 1.0105
- task:
type: regression
name: Regression (RMSE)
dataset:
name: CLEARANCE
type: Derify/druglike
metrics:
- type: rmse
value: 43.4499
ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES
Model Description
This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.
Training Procedure
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.
Phase 1 – “easy” pretraining
- Dataset: augmented_canonical_druglike_QED_43M
- Masking probability: 15%
- Training duration: 9 epochs (chosen due to loss plateauing)
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies
Phase 2 – “advanced” pretraining
- Dataset: druglike dataset
- Masking probablity: 40%
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.
Training Configuration
- Optimizer: NVIDIA Apex's FusedAdam optimizer
- Scheduler: Constant with warmup (10% of steps)
- Batch size: 144 sequences
- Precision: mixed-precision (fp16) and tf32 enabled
Model Objective
This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
- Molecular similarity tasks
- Drug-like compound analysis
- Chemical space exploration in pharmaceutical contexts
Evaluation
The model's effectiveness was validated through downstream Chem-MRL training on the pubchem_10m_genmol_similarity dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.
W&B report on ChemBERTa-druglike evaluation.
Benchmarks
Classification Datasets (ROC AUC - Higher is better)
Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ |
---|---|---|---|---|---|---|
Tasks | 1 | 1 | 12 | 1 | 27 | 2 |
Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 |
Regression Datasets (RMSE - Lower is better)
Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ |
---|---|---|---|---|---|
Tasks | 1 | 1 | 1 | 1 | 1 |
Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 |
Benchmarks were conducted using the chemberta3 framework. Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. Each task was run with 3 different random seeds, and the mean performance is reported.
Use Cases
- Molecular property prediction
- Drug discovery and development
- Chemical similarity analysis
Limitations
- Optimized specifically for drug-like molecules
- Performance may vary on non-drug-like chemical compounds
References
ChemBERTa Series
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2020},
eprint={2010.09885},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.09885},
}
@misc{ahmad2022chemberta2chemicalfoundationmodels,
title={ChemBERTa-2: Towards Chemical Foundation Models},
author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2022},
eprint={2209.01712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2209.01712},
}
@misc{singh2025chemberta3opensource,
title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
year={2025},
howpublished={ChemRxiv},
doi={10.26434/chemrxiv-2025-4glrl-v2},
note={This content is a preprint and has not been peer-reviewed},
url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}