Built With Llama!

Built With Axolotl!

Overview

We fine-tuned Llama-3.1-8B-Instruct on the task of generating SMILES string representations of molecules for a few million molecules. This gives us a model, SmileyLlama, which can generate SMILES strings of drug-like molecules on demand.

For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231

How to use

This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.

Options for "properties" that SmileyLlama was trained on are

( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors
( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors
( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight
( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP
( <= 7, <= 10, > 10) Rotatable bonds
( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3
( <= 90, <= 140, <= 200, > 200) TPSA
(a macrocycle, no macrocycles)
(has, lacks) bad SMARTS
lacks covalent warheads
has covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)
A substructure of *SMILES_STRING*
A chemical of *CHEMICAL_FORMULA*

List of possible warheads:

sulfonyl fluorides: [#16](=[#8])(=[#8])-[#9]
chloroacetamides: [#8]=[#6](-[#6]-[#17])-[#7]
cyanoacrylamides: [#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
epoxides: [#6]1-[#6]-[#8]-1
aziridines: [#6]1-[#6]-[#7]-1
disulfides: [#16]-[#16]
aldehydes: [#6](=[#8])-[#1]
vinyl sulfones: [#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
boronic acids/esters: [#6]-[#5](-[#8])-[#8]
acrylamides: [#6]=[#6]-[#6](=[#8])-[#7]
cyanamides: [#6]-[#7](-[#6]#[#7])-[#6]
chloroFluoroAcetamides: [#7]-[#6](=[#8])-[#6](-[#9])-[#17]
butynamides: [#6]#[#6]-[#6](=[#8])-[#7]-[#6]
chloropropionamides: [#7]-[#6](=[#8])-[#6](-[#6])-[#17]
fluorosulfates: [#8]=[#16](=[#8])(-[#9])-[#8]
beta lactams: [#7]1-[#6]-[#6]-[#6]-1=[#8]

Generating a drug-like molecule which obeys the Lipinski rule of five

import torch
import transformers

model_id = "/path/to/your/model"

system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecule, <= 5 logP:"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    temperature=1.0
)

outputs = pipeline(
    prompt,
    max_new_tokens=128,
    num_return_sequences=4
)
for k in range(4):
  print(outputs[k]["generated_text"][-1])

You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.

THGLab
/

Llama-3.1-8B-SmileyLlama-1.1

Built With Llama!

Built With Axolotl!

Overview

How to use

List of possible warheads:

Generating a drug-like molecule which obeys the Lipinski rule of five

Model tree for THGLab/Llama-3.1-8B-SmileyLlama-1.1

Collection including THGLab/Llama-3.1-8B-SmileyLlama-1.1

SmileyLlama