Built With Llama!
Built With Axolotl!
Overview
We fine-tuned SmileyLlama with DPO to optimize SMILES strings it generates for a combination of drug-likeness and binding affinity to the SARS-CoV-2 Main Protease (Mpro) as assessed by Autodock-Vina. As a result, this model generates molecules with much higher predicted binding affinity than those generated by SmileyLlama. Additionally, this model inherits the ability of SmileyLlama to take other directions in its prompt. For instance, one can prompt it for "High SARS2PRO, <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecular weight, <= 5 logP" for molecules which bind to SARS2PRO and have an improved likelihood of obeying Lipinski's rule of 5.
For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231
How to use
This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.
Options for "properties" that SmileyLlama was trained on are
( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors
( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors
( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight
( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP
( <= 7, <= 10, > 10) Rotatable bonds
( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3
( <= 90, <= 140, <= 200, > 200) TPSA
(a macrocycle, no macrocycles)
(has, lacks) bad SMARTS
lacks covalent warheads
has covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)
A substructure of *SMILES_STRING*
A chemical of *CHEMICAL_FORMULA*
List of possible warheads:
- sulfonyl fluorides:
[#16](=[#8])(=[#8])-[#9]
- chloroacetamides:
[#8]=[#6](-[#6]-[#17])-[#7]
- cyanoacrylamides:
[#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
- epoxides:
[#6]1-[#6]-[#8]-1
- aziridines:
[#6]1-[#6]-[#7]-1
- disulfides:
[#16]-[#16]
- aldehydes:
[#6](=[#8])-[#1]
- vinyl sulfones:
[#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
- boronic acids/esters:
[#6]-[#5](-[#8])-[#8]
- acrylamides:
[#6]=[#6]-[#6](=[#8])-[#7]
- cyanamides:
[#6]-[#7](-[#6]#[#7])-[#6]
- chloroFluoroAcetamides:
[#7]-[#6](=[#8])-[#6](-[#9])-[#17]
- butynamides:
[#6]#[#6]-[#6](=[#8])-[#7]-[#6]
- chloropropionamides:
[#7]-[#6](=[#8])-[#6](-[#6])-[#17]
- fluorosulfates:
[#8]=[#16](=[#8])(-[#9])-[#8]
- beta lactams:
[#7]1-[#6]-[#6]-[#6]-1=[#8]
Generating a drug-like molecule which obeys the Lipinski rule of five and has a better likelihood of binding to SARS-CoV-2 Main Protease
import torch
import transformers
model_id = "/path/to/your/model"
system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: High SARS2PRO, <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecular weight, <= 5 logP"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
temperature=1.0
)
outputs = pipeline(
prompt,
max_new_tokens=128,
num_return_sequences=4
)
for k in range(4):
print(outputs[k]["generated_text"][-1])
You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.
- Downloads last month
- 50
Model tree for THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro
Base model
meta-llama/Llama-3.1-8B