A sentencepiece tokenizer was applied to a corpus of 269 million Russian search queries.

The encoder-model was trained for the e-commerce search query similarity task, and the search queries were short.

The dataset for validation, which was manually annotated, comprised 362,000 instances.

Fedor Krasnov, Fedor Kurushin, and Egor Mogilevich "Custom shared encoder for enhanced recall in e-commerce product search task", Proc. SPIE 13730, Second International Conference on Computing, Machine Learning, and Data Science (CMLDS 2025), 137300B (21 July 2025);

Validation results


## don't forget
# pip install protobuf sentencepiece

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('fkrasnov2/SBE')
tokenizer = AutoTokenizer.from_pretrained('fkrasnov2/SBE')

input_ids = tokenizer.encode("чёрное платье", max_length=model.config.max_position_embeddings, truncation=True, return_tensors='pt')

model.eval()
vector = model(input_ids=input_ids, attention_mask=input_ids!=tokenizer.pad_token_id)[0][0,0]

assert model.config.hidden_size == vector.shape[0]

This model is designed for use in e-commerce IR and helps differentiate products.

The same products:

  • cos ( SBE("apple 16 синий про макс 256"), SBE("iphone 16 синий pro max 256") ) = 0.96

  • cos ( SBE("iphone 15 pro max"), SBE("айфон 15 про макс") ) = 0.98

Different products:

  • cos ( SBE("iphone 15 pro max"), SBE("iphone 16 pro max") ) = 0.85
Downloads last month
15
Safetensors
Model size
14.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support