SmileyLlama
Collection
3 items
โข
Updated
We fine-tuned Llama-3.1-8B-Instruct on the task of generating SMILES string representations of molecules for a few million molecules. This gives us a model, SmileyLlama, which can generate SMILES strings of drug-like molecules on demand.
For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231
This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.
Options for "properties" that SmileyLlama was trained on are
( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors
( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors
( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight
( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP
( <= 7, <= 10, > 10) Rotatable bonds
( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3
( <= 90, <= 140, <= 200, > 200) TPSA
(a macrocycle, no macrocycles)
(has, lacks) bad SMARTS
lacks covalent warheads
has covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)
A substructure of *SMILES_STRING*
A chemical of *CHEMICAL_FORMULA*
[#16](=[#8])(=[#8])-[#9]
[#8]=[#6](-[#6]-[#17])-[#7]
[#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
[#6]1-[#6]-[#8]-1
[#6]1-[#6]-[#7]-1
[#16]-[#16]
[#6](=[#8])-[#1]
[#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
[#6]-[#5](-[#8])-[#8]
[#6]=[#6]-[#6](=[#8])-[#7]
[#6]-[#7](-[#6]#[#7])-[#6]
[#7]-[#6](=[#8])-[#6](-[#9])-[#17]
[#6]#[#6]-[#6](=[#8])-[#7]-[#6]
[#7]-[#6](=[#8])-[#6](-[#6])-[#17]
[#8]=[#16](=[#8])(-[#9])-[#8]
[#7]1-[#6]-[#6]-[#6]-1=[#8]
import torch
import transformers
model_id = "/path/to/your/model"
system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecule, <= 5 logP:"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
temperature=1.0
)
outputs = pipeline(
prompt,
max_new_tokens=128,
num_return_sequences=4
)
for k in range(4):
print(outputs[k]["generated_text"][-1])
You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.