ChemBERTa-druglike / README.md

Update README.md

596eb0f verified 15 days ago

7.35 kB

	---
	license: apache-2.0
	datasets:
	- Derify/augmented_canonical_druglike_QED_43M
	- Derify/druglike
	metrics:
	- roc_auc
	- rmse
	library_name: transformers
	tags:
	- ChemBERTa
	- cheminformatics
	pipeline_tag: fill-mask
	model-index:
	- name: Derify/ChemBERTa-druglike
	results:
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: BACE
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.8114
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: BBBP
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.7399
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: TOX21
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.7522
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: HIV
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.7527
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: SIDER
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.6577
	- task:
	type: text-classification
	name: Classification (ROC AUC)
	dataset:
	name: CLINTOX
	type: Derify/druglike
	metrics:
	- type: roc_auc
	value: 0.9660
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: ESOL
	type: Derify/druglike
	metrics:
	- type: rmse
	value: 0.8241
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: FREESOLV
	type: Derify/druglike
	metrics:
	- type: rmse
	value: 0.5350
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: LIPO
	type: Derify/druglike
	metrics:
	- type: rmse
	value: 0.6663
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: BACE
	type: Derify/druglike
	metrics:
	- type: rmse
	value: 1.0105
	- task:
	type: regression
	name: Regression (RMSE)
	dataset:
	name: CLEARANCE
	type: Derify/druglike
	metrics:
	- type: rmse
	value: 43.4499
	---

	# ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES

	## Model Description

	This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.

	## Training Procedure
	The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.

	### Phase 1 – “easy” pretraining
	- Dataset: [augmented_canonical_druglike_QED_43M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_43M)
	- Masking probability: 15%
	- Training duration: 9 epochs (chosen due to loss plateauing)
	- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies

	### Phase 2 – “advanced” pretraining
	- Dataset: [druglike dataset](https://huggingface.co/datasets/Derify/druglike)
	- Masking probablity: 40%
	- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.

	### Training Configuration
	- Optimizer: NVIDIA Apex's FusedAdam optimizer
	- Scheduler: Constant with warmup (10% of steps)
	- Batch size: 144 sequences
	- Precision: mixed-precision (fp16) and tf32 enabled

	## Model Objective

	This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
	- Molecular similarity tasks
	- Drug-like compound analysis
	- Chemical space exploration in pharmaceutical contexts

	## Evaluation

	The model's effectiveness was validated through downstream Chem-MRL training on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.

	W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).

	## Benchmarks
	### Classification Datasets (ROC AUC - Higher is better)

	\| Model \| BACE↑ \| BBBP↑ \| TOX21↑ \| HIV↑ \| SIDER↑ \| CLINTOX↑ \|
	\| ------------------------- \| ------ \| ------ \| ------ \| ------ \| ------ \| -------- \|
	\| Tasks \| 1 \| 1 \| 12 \| 1 \| 27 \| 2 \|
	\| Derify/ChemBERTa-druglike \| 0.8114 \| 0.7399 \| 0.7522 \| 0.7527 \| 0.6577 \| 0.9660 \|

	### Regression Datasets (RMSE - Lower is better)

	\| Model \| ESOL↓ \| FREESOLV↓ \| LIPO↓ \| BACE↓ \| CLEARANCE↓ \|
	\| ------------------------- \| ------ \| --------- \| ------ \| ------ \| ---------- \|
	\| Tasks \| 1 \| 1 \| 1 \| 1 \| 1 \|
	\| Derify/ChemBERTa-druglike \| 0.8241 \| 0.5350 \| 0.6663 \| 1.0105 \| 43.4499 \|

	Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework.
	Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length.
	The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32.
	Each task was run with 3 different random seeds, and the mean performance is reported.

	## Use Cases

	- Molecular property prediction
	- Drug discovery and development
	- Chemical similarity analysis

	## Limitations

	- Optimized specifically for drug-like molecules
	- Performance may vary on non-drug-like chemical compounds

	## References
	### ChemBERTa Series
	```
	@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
	title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
	author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
	year={2020},
	eprint={2010.09885},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2010.09885},
	}
	```
	```
	@misc{ahmad2022chemberta2chemicalfoundationmodels,
	title={ChemBERTa-2: Towards Chemical Foundation Models},
	author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
	year={2022},
	eprint={2209.01712},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2209.01712},
	}
	```
	```
	@misc{singh2025chemberta3opensource,
	title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
	author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
	year={2025},
	howpublished={ChemRxiv},
	doi={10.26434/chemrxiv-2025-4glrl-v2},
	note={This content is a preprint and has not been peer-reviewed},
	url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
	}
	```