Tokenizer error when trying to run pipeline on CPU

#2
by mapto - opened

Hello, thanks for making this model available. I'm doing research on models for Latin and yours seems to be an important contribution in this area.
However, I encounter a problem. Using transformers==4.53.2 I get the error below. Any idea how to overcome it?

Device set to use cpu
Traceback (most recent call last):
  File "fill.py", line 163, in <module>
    pipeline("fill-mask", model="HPLT/hplt_bert_base_la", token=transformers_token, trust_remote_code=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/__init__.py", line 1268, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1109, in __init__
    self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/fill_mask.py", line 242, in _sanitize_parameters
    if self.tokenizer.mask_token_id is None:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'mask_token_id'

This initialisation got rid of the errors, but the model now predicts weird tokens:

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pipeline("fill-mask", model, tokenizer=tokenizer, token=transformers_token, trust_remote_code=True)
HPLT org

Hi @mapto
Can you try it now? It looks like the HF auto-conversion of the model to safetensors has broken something. We removed the auto-converted model, so it should be fine now.

HPLT org

Note that you should be careful to not use a space in front of the mask token when using HF pipelines. Now when you use the fixed model like this:

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pip = pipeline("fill-mask", model, tokenizer=tokenizer, trust_remote_code=True)
pip("Ubi autem[MASK] planitie potuerint reperiri,")

You should get this output:

[{'score': 0.6775742173194885,
  'token': 402,
  'token_str': 'in',
  'sequence': 'Ubi autem in planitie potuerint reperiri,'},
 {'score': 0.11163018643856049,
  'token': 516,
  'token_str': 'de',
  'sequence': 'Ubi autem de planitie potuerint reperiri,'},
 {'score': 0.06605511158704758,
  'token': 479,
  'token_str': 'ex',
  'sequence': 'Ubi autem ex planitie potuerint reperiri,'},
 {'score': 0.04487193748354912,
  'token': 365,
  'token_str': 'a',
  'sequence': 'Ubi autem a planitie potuerint reperiri,'},
 {'score': 0.013039215467870235,
  'token': 364,
  'token_str': 'e',
  'sequence': 'Ubi autem e planitie potuerint reperiri,'}]

Hello and thank you for your response. I've tried it again and the model now generates meaningful responses.
Your remark about the whitespace is also much appreciated, as many examples online disregard this.

mapto changed discussion status to closed

Sign up or log in to comment