metadata

license: apache-2.0
datasets:
  - Derify/augmented_canonical_druglike_QED_43M
  - Derify/druglike
metrics:
  - roc_auc
  - rmse
library_name: transformers
tags:
  - ChemBERTa
  - cheminformatics
pipeline_tag: fill-mask
model-index:
  - name: Derify/ChemBERTa-druglike
    results:
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: BACE
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.8114
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: BBBP
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.7399
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: TOX21
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.7522
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: HIV
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.7527
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: SIDER
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.6577
      - task:
          type: text-classification
          name: Classification (ROC AUC)
        dataset:
          name: CLINTOX
          type: Derify/druglike
        metrics:
          - type: roc_auc
            value: 0.966
      - task:
          type: regression
          name: Regression (RMSE)
        dataset:
          name: ESOL
          type: Derify/druglike
        metrics:
          - type: rmse
            value: 0.8241
      - task:
          type: regression
          name: Regression (RMSE)
        dataset:
          name: FREESOLV
          type: Derify/druglike
        metrics:
          - type: rmse
            value: 0.535
      - task:
          type: regression
          name: Regression (RMSE)
        dataset:
          name: LIPO
          type: Derify/druglike
        metrics:
          - type: rmse
            value: 0.6663
      - task:
          type: regression
          name: Regression (RMSE)
        dataset:
          name: BACE
          type: Derify/druglike
        metrics:
          - type: rmse
            value: 1.0105
      - task:
          type: regression
          name: Regression (RMSE)
        dataset:
          name: CLEARANCE
          type: Derify/druglike
        metrics:
          - type: rmse
            value: 43.4499

ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES

Model Description

This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.

Training Procedure

The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.

Phase 1 – “easy” pretraining

Dataset: augmented_canonical_druglike_QED_43M
Masking probability: 15%
Training duration: 9 epochs (chosen due to loss plateauing)
Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies

Phase 2 – “advanced” pretraining

Dataset: druglike dataset
Masking probablity: 40%
Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.

Training Configuration

Optimizer: NVIDIA Apex's FusedAdam optimizer
Scheduler: Constant with warmup (10% of steps)
Batch size: 144 sequences
Precision: mixed-precision (fp16) and tf32 enabled

Model Objective

This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:

Molecular similarity tasks
Drug-like compound analysis
Chemical space exploration in pharmaceutical contexts

Evaluation

The model's effectiveness was validated through downstream Chem-MRL training on the pubchem_10m_genmol_similarity dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.

W&B report on ChemBERTa-druglike evaluation.

Benchmarks

Classification Datasets (ROC AUC - Higher is better)

Model	BACE↑	BBBP↑	TOX21↑	HIV↑	SIDER↑	CLINTOX↑
Tasks	1	1	12	1	27	2
Derify/ChemBERTa-druglike	0.8114	0.7399	0.7522	0.7527	0.6577	0.9660

Regression Datasets (RMSE - Lower is better)

Model	ESOL↓	FREESOLV↓	LIPO↓	BACE↓	CLEARANCE↓
Tasks	1	1	1	1	1
Derify/ChemBERTa-druglike	0.8241	0.5350	0.6663	1.0105	43.4499

Benchmarks were conducted using the chemberta3 framework. Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. Each task was run with 3 different random seeds, and the mean performance is reported.

Use Cases

Molecular property prediction
Drug discovery and development
Chemical similarity analysis

Limitations

Optimized specifically for drug-like molecules
Performance may vary on non-drug-like chemical compounds

References

ChemBERTa Series

@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
      title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, 
      author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2020},
      eprint={2010.09885},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2010.09885}, 
}

@misc{ahmad2022chemberta2chemicalfoundationmodels,
      title={ChemBERTa-2: Towards Chemical Foundation Models}, 
      author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2022},
      eprint={2209.01712},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2209.01712}, 
}

@misc{singh2025chemberta3opensource,
  title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
  author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
  year={2025},
  howpublished={ChemRxiv},
  doi={10.26434/chemrxiv-2025-4glrl-v2},
  note={This content is a preprint and has not been peer-reviewed},
  url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}