ScentLLaMA

A tiny LLaMA-based language model with 600k parameters, pretrained specifically on the synthetic ScentSet dataset (572k entries, ~15M tokens).
Designed exclusively to describe and classify smells and aromas.

Model Details

  • Parameters: ~600,000
  • Task: Text generation of smell descriptions
  • Training data: ScentSet (synthetic dataset of smell descriptions)
  • Training date: July 2025
  • License: CC BY 4.0

πŸ“‰ Training & Evaluation Loss

The following plot shows the training and evaluation loss over time.
Training was performed for approximately 160,000 steps.

The evaluation loss remains consistently close to the training loss throughout training (within ~0.01),
indicating that the model generalizes well and shows no signs of overfitting.

Training loss

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sixf0ur/ScentLLaMA"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "A fresh and fruity aroma with hints of"
inputs = tokenizer(prompt, return_token_type_ids=False, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=25)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# > A fresh and fruity aroma with hints of green leaves and a hint of something earthy. It is a ripe plum.

Citation

@misc{ScentLLaMA_2025,
  author       = {David S.},
  title        = {ScentLLaMA: A tiny LLaMA Model for Smell Description Generation},
  year         = {2025},
  publisher    = {Hugging Face Models},
  howpublished = {\url{https://huggingface.co/sixf0ur/ScentLLaMA}},
  note         = {Pretrained on the ScentSet dataset to generate natural language descriptions of smells}
}
Downloads last month
38
Safetensors
Model size
607k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train sixf0ur/ScentLLaMA