PyTorch
llama

Built With Llama!

Built With Axolotl!

Overview

We fine-tuned Llama-3.1-8B-Instruct on the task of generating SMILES string representations of molecules for a few million molecules. This gives us a model, SmileyLlama, which can generate SMILES strings of drug-like molecules on demand.

For more details, read the ArXiv preprint here: https://arxiv.org/abs/2409.02231

How to use

This can be loaded using the same method as Llama3.1, and the memory requirements are the same as Llama-3.1-8B.

Options for "properties" that SmileyLlama was trained on are

  • ( <= 3, <= 4, <= 5, <= 7, > 7) H-bond donors
  • ( <= 3, <= 4, <= 5, <= 10, <= 15) H-bond acceptors
  • ( <= 300, <= 400, <= 500, <= 600, > 600) Molecular weight
  • ( <= 3, <= 4, <= 5, <= 10, <= 15, > 15) logP
  • ( <= 7, <= 10, > 10) Rotatable bonds
  • ( < 0.4, > 0.4, > 0.5, > 0.6) Fraction sp3
  • ( <= 90, <= 140, <= 200, > 200) TPSA
  • (a macrocycle, no macrocycles)
  • (has, lacks) bad SMARTS
  • lacks covalent warheads
  • has covalent warheads: (sulfonyl fluorides, acrylamides, ...) (see below for details)
  • A substructure of *SMILES_STRING*
  • A chemical of *CHEMICAL_FORMULA*

List of possible warheads:

  • sulfonyl fluorides: [#16](=[#8])(=[#8])-[#9]
  • chloroacetamides: [#8]=[#6](-[#6]-[#17])-[#7]
  • cyanoacrylamides: [#7]-[#6](=[#8])-[#6](-[#6]#[#7])=[#6]
  • epoxides: [#6]1-[#6]-[#8]-1
  • aziridines: [#6]1-[#6]-[#7]-1
  • disulfides: [#16]-[#16]
  • aldehydes: [#6](=[#8])-[#1]
  • vinyl sulfones: [#6]=[#6]-[#16](=[#8])(=[#8])-[#7]
  • boronic acids/esters: [#6]-[#5](-[#8])-[#8]
  • acrylamides: [#6]=[#6]-[#6](=[#8])-[#7]
  • cyanamides: [#6]-[#7](-[#6]#[#7])-[#6]
  • chloroFluoroAcetamides: [#7]-[#6](=[#8])-[#6](-[#9])-[#17]
  • butynamides: [#6]#[#6]-[#6](=[#8])-[#7]-[#6]
  • chloropropionamides: [#7]-[#6](=[#8])-[#6](-[#6])-[#17]
  • fluorosulfates: [#8]=[#16](=[#8])(-[#9])-[#8]
  • beta lactams: [#7]1-[#6]-[#6]-[#6]-1=[#8]

Generating a drug-like molecule which obeys the Lipinski rule of five

import torch
import transformers

model_id = "/path/to/your/model"

system_txt = "You love and excel at generating SMILES strings of drug-like molecules"
user_txt = "Output a SMILES string for a drug like molecule with the following properties: <= 5 H-bond donors, <= 10 H-bond acceptors, <= 500 molecule, <= 5 logP:"
prompt = f"### Instruction:\n{system_text}\n\n### Input:\n{user_text}\n\n### Response:\n"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    temperature=1.0
)

outputs = pipeline(
    prompt,
    max_new_tokens=128,
    num_return_sequences=4
)
for k in range(4):
  print(outputs[k]["generated_text"][-1])

You can use num_return_sequences to efficiently generate many SMILES strings rapidly, though this is limited by your memory.

Downloads last month
131
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for THGLab/Llama-3.1-8B-SmileyLlama-1.1

Finetuned
(1619)
this model
Finetunes
2 models

Collection including THGLab/Llama-3.1-8B-SmileyLlama-1.1