File size: 7,350 Bytes

---
license: apache-2.0
datasets:
- Derify/augmented_canonical_druglike_QED_43M
- Derify/druglike
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- ChemBERTa
- cheminformatics
pipeline_tag: fill-mask
model-index:
- name: Derify/ChemBERTa-druglike
  results:
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BACE
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.8114
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BBBP
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7399
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: TOX21
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7522
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: HIV
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.7527
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: SIDER
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.6577
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: CLINTOX
      type: Derify/druglike
    metrics:
    - type: roc_auc
      value: 0.9660
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ESOL
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.8241
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: FREESOLV
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.5350
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: LIPO
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 0.6663
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: BACE
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 1.0105
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: CLEARANCE
      type: Derify/druglike
    metrics:
    - type: rmse
      value: 43.4499
---

# ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES

## Model Description

This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.

## Training Procedure
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.

### Phase 1 – “easy” pretraining
- Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M)
- Masking probability: 15%
- Training duration: 9 epochs (chosen due to loss plateauing)
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies

### Phase 2 – “advanced” pretraining
- Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike)
- Masking probablity: 40%
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.

### Training Configuration
- Optimizer: NVIDIA Apex's FusedAdam optimizer
- Scheduler: Constant with warmup (10% of steps)
- Batch size: 144 sequences
- Precision: mixed-precision (fp16) and tf32 enabled

## Model Objective

This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
- Molecular similarity tasks
- Drug-like compound analysis
- Chemical space exploration in pharmaceutical contexts

## Evaluation

The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.

W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).

## Benchmarks
### Classification Datasets (ROC AUC - Higher is better)

| Model                     | BACE↑  | BBBP↑  | TOX21↑ | HIV↑   | SIDER↑ | CLINTOX↑ |
| ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- |
| **Tasks**                 | 1      | 1      | 12     | 1      | 27     | 2        |
| Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660   |

### Regression Datasets (RMSE - Lower is better)

| Model                     | ESOL↓  | FREESOLV↓ | LIPO↓  | BACE↓  | CLEARANCE↓ |
| ------------------------- | ------ | --------- | ------ | ------ | ---------- |
| **Tasks**                 | 1      | 1         | 1      | 1      | 1          |
| Derify/ChemBERTa-druglike | 0.8241 | 0.5350    | 0.6663 | 1.0105 | 43.4499    |

Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework. 
Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. 
The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. 
Each task was run with 3 different random seeds, and the mean performance is reported.

## Use Cases

- Molecular property prediction
- Drug discovery and development
- Chemical similarity analysis

## Limitations

- Optimized specifically for drug-like molecules
- Performance may vary on non-drug-like chemical compounds

## References
### ChemBERTa Series
```
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
      title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, 
      author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2020},
      eprint={2010.09885},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2010.09885}, 
}
```
```
@misc{ahmad2022chemberta2chemicalfoundationmodels,
      title={ChemBERTa-2: Towards Chemical Foundation Models}, 
      author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2022},
      eprint={2209.01712},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2209.01712}, 
}
```
```
@misc{singh2025chemberta3opensource,
  title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
  author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
  year={2025},
  howpublished={ChemRxiv},
  doi={10.26434/chemrxiv-2025-4glrl-v2},
  note={This content is a preprint and has not been peer-reviewed},
  url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}
```