YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SMILES-based Transformer Encoder-Decoder (SMI-TED)

arXiv

This repository provides a HuggingFace-compatible version of the SMI-TED model, a SMILES-based Transformer Encoder-Decoder for chemical language modeling.


πŸ“¦ Forked Resources

🏷️ Original Resources


πŸš€ Usage

pip install smi-ted
import torch
import smi_ted
from transformers import AutoConfig, AutoModel, AutoTokenizer

# Load config, tokenizer, and model from HuggingFace Hub
config = AutoConfig.from_pretrained("bisectgroup/materials-smi-ted-fork")
tokenizer = AutoTokenizer.from_pretrained("bisectgroup/materials-smi-ted-fork")
model = AutoModel.from_pretrained("bisectgroup/materials-smi-ted-fork")

# Link tokenizer to model (required for SMILES reconstruction)
model.smi_ted.tokenizer = tokenizer
model.smi_ted.set_padding_idx_from_tokenizer()

# Example SMILES strings
smiles = [
    'CC1C2CCC(C2)C1CN(CCO)C(=O)c1ccc(Cl)cc1',
    'COc1ccc(-c2cc(=O)c3c(O)c(OC)c(OC)cc3o2)cc1O',
    'CCOC(=O)c1ncn2c1CN(C)C(=O)c1cc(F)ccc1-2',
    'Clc1ccccc1-c1nc(-c2ccncc2)no1',
    'CC(C)(Oc1ccc(Cl)cc1)C(=O)OCc1cccc(CO)n1'
]

# Encode and decode SMILES
with torch.no_grad():
    encoder_outputs = model.encode(smiles)
    decoded_smiles = model.decode(encoder_outputs)

print(decoded_smiles)

πŸ“ Citation

If you use this model, please cite:

@article{soares2025open,
  title={An open-source family of large encoder-decoder foundation models for chemistry},
  author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
  journal={Communications Chemistry},
  volume={8},
  number={1},
  pages={193},
  year={2025},
  publisher={Nature Publishing Group UK London}
}
@article{soares2024large,
  title={A large encoder-decoder family of foundation models for chemical language},
  author={Soares, Eduardo and Shirasuna, Victor and Brazil, Emilio Vital and Cerqueira, Renato and Zubarev, Dmitry and Schmidt, Kristin},
  journal={arXiv preprint arXiv:2407.20267},
  year={2024}
}

πŸ“§ Contact

For questions or collaborations, contact:


Note:
This fork adapts the original SMI-TED codebase for seamless integration with HuggingFace's AutoModel and AutoTokenizer interfaces. For full source code and training scripts, see the original IBM repo.

Downloads last month
1,493
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support