PyTorch
llama

Built With Llama!

Built With Axolotl!

Overview

We fine-tuned SmileyLlama with DPO to optimize SMILES strings it generates for a combination of drug-likeness and binding affinity to the SARS-CoV-2 Main Protease (Mpro) as assessed by Autodock-Vina. As a result, this model generates molecules with much higher predicted binding affinity than those generated by SmileyLlama. Additionally, this model inherits the ability of SmileyLlama to take other directions in its prompt. For instance, one can prompt it for "High SARS2PRO, <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecular weight, <= 5 logP" for molecules which bind to SARS2PRO and have an improved likelihood of obeying Lipinski's rule of 5.

For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231

How to use

This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.

Options for "properties" that SmileyLlama was trained on are

  • ( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors
  • ( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors
  • ( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight
  • ( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP
  • ( <= 7, <= 10, > 10) Rotatable bonds
  • ( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3
  • ( <= 90, <= 140, <= 200, > 200) TPSA
  • (a macrocycle, no macrocycles)
  • (has, lacks) bad SMARTS
  • lacks covalent warheads
  • has covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)
  • A substructure of *SMILES_STRING*
  • A chemical of *CHEMICAL_FORMULA*

List of possible warheads:

  • sulfonyl fluorides: [#16](=[#8])(=[#8])-[#9]
  • chloroacetamides: [#8]=[#6](-[#6]-[#17])-[#7]
  • cyanoacrylamides: [#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
  • epoxides: [#6]1-[#6]-[#8]-1
  • aziridines: [#6]1-[#6]-[#7]-1
  • disulfides: [#16]-[#16]
  • aldehydes: [#6](=[#8])-[#1]
  • vinyl sulfones: [#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
  • boronic acids/esters: [#6]-[#5](-[#8])-[#8]
  • acrylamides: [#6]=[#6]-[#6](=[#8])-[#7]
  • cyanamides: [#6]-[#7](-[#6]#[#7])-[#6]
  • chloroFluoroAcetamides: [#7]-[#6](=[#8])-[#6](-[#9])-[#17]
  • butynamides: [#6]#[#6]-[#6](=[#8])-[#7]-[#6]
  • chloropropionamides: [#7]-[#6](=[#8])-[#6](-[#6])-[#17]
  • fluorosulfates: [#8]=[#16](=[#8])(-[#9])-[#8]
  • beta lactams: [#7]1-[#6]-[#6]-[#6]-1=[#8]

Generating a drug-like molecule which obeys the Lipinski rule of five and has a better likelihood of binding to SARS-CoV-2 Main Protease

import torch
import transformers

model_id = "/path/to/your/model"

system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: High SARS2PRO, <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecular weight, <= 5 logP"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    temperature=1.0
)

outputs = pipeline(
    prompt,
    max_new_tokens=128,
    num_return_sequences=4
)
for k in range(4):
  print(outputs[k]["generated_text"][-1])

You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro

Collection including THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro