
π§ͺ IbnSinna-2B-Pharma: Drug Discovery Language Model
Advanced Pharmaceutical AI for Molecular Discovery
π¬ Binary Classification β’ π Regression ⒠𧬠Conditional Generation
π Note: This model supersedes the earlier IbnSinna-2B-Drug model, which has been deprecated. Please use this version for all applications.
π¨ Model Overview
IbnSinna-2B-Pharma is a specialized language model fine-tuned for three core drug discovery tasks:
- Binary Classification: Predicting a wide range of molecular properties with Yes/No answers, including ADMET properties (e.g., clinical toxicity, BBB permeability), bioactivity (e.g., BACE and HIV inhibition), and pathway interactions (e.g., stress response and nuclear receptor binding)
- Regression: Predicting key quantitative physicochemical values, specifically aqueous solubility (logS) and hydration free energy.
- Conditional Generation: Designing novel molecules by providing a chemical scaffold as a structural constraint.
Built on Google's TXGemma-2B architecture and fine-tuned on comprehensive pharmaceutical datasets, this model serves as a powerful tool for computational drug discovery and molecular property prediction.
Important: This model was trained as a plain text generator (not chat-based). Prompts should end with \nAnswer:
and the model will complete the answer directly.
π Evaluation
The model was evaluated on samples from standard benchmark datasets (TDC: ClinTox, ESOL) and a custom generation task.
1. Binary Classification (ClinTox)
- Performance: The model demonstrates a good ability to identify positive cases (Recall: 66.7%) but does so at the cost of a high false-positive rate (Precision: 22.2%). This makes it suitable for initial screening where missing a potentially toxic compound is more critical than misidentifying a safe one.
- Metrics:
- Accuracy: 73.3%
- F1 Score: 33.3%
2. Regression (ESOL Solubility)
- Performance: The model correctly understands the task and provides formatted numerical output. However, its quantitative accuracy is low (MAE: 2.57) compared to specialized regression models. It should be used for qualitative estimation rather than precise value prediction.
- Metrics:
- RMSE: 3.05
- MAE: 2.57
3. Conditional Generation (Scaffold-based)
- Performance: The model excels at this task when using sampling-based decoding. It generates a high rate of chemically valid and unique molecules, demonstrating a strong ability to explore chemical space.
- Metrics:
- SMILES Validity Rate: 90.0%
- Molecule Uniqueness Rate: 100.0%
π Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"OussamaEL/IbnSinna-2B-Pharma",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("OussamaEL/IbnSinna-2B-Pharma")
def ask(prompt, max_new_tokens=32, do_sample=False):
"""
Queries the model. Use do_sample=True for generative tasks.
"""
text = f"{prompt}\nAnswer:"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
pad_token_id=tokenizer.eos_token_id,
temperature=0.8 if do_sample else None,
top_k=50 if do_sample else None
)
# Decode and extract the answer part
generated_text = tokenizer.decode(outputs, skip_special_tokens=True)
try:
answer = generated_text.split("Answer:").strip()
except IndexError:
answer = "Could not parse answer."
return answer
# Example: HIV Inhibitor Classification (Greedy Decoding)
prompt = "Molecule: COc1ccc(N2C(=O)C3c4[nH]c5ccc(C)cc5c4C4CCC(C(C)(C)C)CC4C3C2=O)cc1\nQuestion: Is this molecule an HIV inhibitor?"
print(ask(prompt, max_new_tokens=5, do_sample=False))
# Example: Molecule Generation (Sampling)
prompt_gen = "Molecule Scaffold: c1ccccc1\nQuestion: Generate a novel molecule containing this scaffold."
print(ask(prompt_gen, max_new_tokens=100, do_sample=True))
π― Model Capabilities
The model excels at three primary tasks:
1οΈβ£ Binary Classification
Predicting molecular properties with Yes/No outcomes:
- Clinical Toxicity: Predicting adverse effects and safety profiles (from ClinTox).
- BACE Inhibition: Ξ²-secretase 1 inhibitor prediction (from BACE).
- HIV Inhibition: Antiviral activity prediction (from HIV).
- Stress Response Pathways: Activation of pathways like ARE, HSE, and p53 (from Tox21).
- Nuclear Receptor Binding: Ligand prediction for receptors like ER, AR, and AhR (from Tox21).
- ADMET Properties: Blood-Brain Barrier (BBB) permeability (from BBBP).
2οΈβ£ Regression
Predicting quantitative molecular properties:
- Aqueous Solubility (logS): Water solubility prediction (from ESOL).
- Hydration Free Energy: Calculating the free energy of solvation in water (from FreeSolv).
3οΈβ£ Conditional Generation
Designing novel molecules based on constraints:
- Scaffold-based Design: Generating novel molecules that contain a specific user-provided core structure or scaffold.
π» Detailed Usage Examples
Classification Tasks
classification_prompts = [
"Analyze Oc1ccc(Cl)c(Cl)c1 and determine if it is a stress response pathway ARE activator.",
"Analyze COc1ccccc1NC(=O)CC(C)=O and determine if it is a nuclear receptor AhR ligand.",
"Analyze CC[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)CC[C@@H]4[C@H]3CC[C@]2(C)[C@H]1O and determine if it is a nuclear receptor AR ligand."
]
for prompt in classification_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=5,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
gen = outputs[0, inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(gen, skip_special_tokens=True).strip().split()[0] # Get first word (Yes/No)
print(f"Q: {prompt[:50]}...")
print(f"A: {answer}\n")
Molecule Generation
generation_prompts = [
"Design a molecule containing the core structure O=C1Nc2ccccc2Oc2ccccc21.",
"Design a molecule containing the core structure c1ccc2c(c1)Cc1ccccc1N2.",
"Generate a molecule based on the scaffold c1ccccc1."
]
for prompt in generation_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
top_p=0.95
)
gen = outputs[0, inputs["input_ids"].shape[1]:]
smiles = tokenizer.decode(gen, skip_special_tokens=True).strip()
print(f"Scaffold: {prompt}")
print(f"Generated: {smiles}\n")
Property Prediction
property_prompts = [
"What is the predicted aqueous solubility (logS) of CCCCCCCO?",
"What is the predicted aqueous solubility (logS) of Cc1ccc(O)cc1?"
]
for prompt in property_prompts:
text = prompt + "\nAnswer:"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
gen = outputs[0, inputs["input_ids"].shape[1]:]
value = tokenizer.decode(gen, skip_special_tokens=True).strip()
print(f"Molecule: {prompt.split('of ')[-1].replace('?', '')}")
print(f"Predicted logS: {value}\n")
π§ Technical Specifications
Model Architecture
- Base Model: Google's TXGemma-2B-predict
- Parameters: ~2.5B (2,635,108,608 total)
- Fine-tuning Method: QLoRA (4-bit NF4 + bf16 compute) β Full model merge to FP16
- LoRA Config: r=8, alpha=16, dropout=0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Training Precision: 4-bit NF4 with bf16 compute dtype
- Released Precision: FP16 (merged model)
Training Details
- Optimizer: Paged AdamW 8-bit
- Learning Rate: 1e-4 with cosine scheduler
- Effective Batch Size: 4
- Warmup Steps: 20
- Max Sequence Length: 256-512 tokens (recommend 256 for short prompts)
- Training Framework: HuggingFace Transformers + PEFT + TRL
Dataset Information
The model was trained on a custom-built dataset aggregated from MoleculeNet.
- Source: MoleculeNet (Tox21, ClinTox, BBBP, HIV, BACE, ESOL, FreeSolv)
- Training Samples: 33,179
- Training Task Distribution (approximate):
- ~53% Classification
- ~44% Generation
- ~3% Regression
β οΈ Limitations and Ethical Considerations
Intended Use
- β Research and Development: Drug discovery research, lead optimization
- β Educational Purposes: Teaching molecular modeling concepts
- β Screening Tools: Initial compound screening and prioritization
- β NOT for Clinical Decisions: Do not use for patient treatment decisions
- β NOT for Final Validation: Always require wet-lab validation
Known Limitations
- Low Quantitative Accuracy: The model's predictions on regression tasks (e.g., LogS) are not precise and should be treated as qualitative estimates.
- High False-Positive Rate in Classification: The model tends to be overly cautious in toxicity prediction, leading to a high rate of false positives.
- Imperfect Prompt Adherence: In generation tasks, the model may occasionally produce molecules that do not perfectly adhere to all structural constraints (e.g., including a required scaffold).
- Dataset Biases: The model's knowledge is limited to the chemical space and biases present in the MoleculeNet datasets it was trained on.
- SMILES-based: The model operates on 1D SMILES strings and does not reason about 3D molecular conformations.
Ethical Guidelines
- Always validate predictions experimentally
- Consider potential biases in drug discovery pipelines
- Ensure equitable application across different populations
- Respect intellectual property in molecular design
π Citation
If you use this model in your research, please cite:
@model{ibnsinna_2b_pharma_2025,
title={IbnSinna-2B-Pharma: Drug Discovery Language Model},
author={Oussama El Allam},
year={2025},
month={August},
url={https://huggingface.co/OussamaEL/IbnSinna-2B-Pharma},
note={Fine-tuned TXGemma-2B for pharmaceutical and drug discovery applications},
publisher={Hugging Face}
}
π€ Acknowledgments
- Google DeepMind for the TXGemma base model
- The drug discovery and cheminformatics community
- Contributors to the training datasets
π Links
- Model: IbnSinna-2B-Pharma
- Base Model: TXGemma-2B-predict
- Dataset: OussamaEL/drug-discovery-dataset
Made with 𧬠for the Drug Discovery Community
Advanced Pharmaceutical AI for Next-Generation Drug Development
- Downloads last month
- 19
Model tree for OussamaEL/IbnSinna-2B-Pharma
Evaluation results
- Accuracy on ClinTox (Toxicity)self-reported0.733
- F1 Score on ClinTox (Toxicity)self-reported0.333
- Precision on ClinTox (Toxicity)self-reported0.222
- Recall on ClinTox (Toxicity)self-reported0.667
- RMSE on ESOL (Aqueous Solubility)self-reported3.053
- MAE on ESOL (Aqueous Solubility)self-reported2.569
- SMILES Validity Rate on Scaffold-based (Benzene)self-reported0.900
- Molecule Uniqueness Rate on Scaffold-based (Benzene)self-reported1.000