Tokenizer error when trying to run pipeline on CPU
#2
by
mapto
- opened
Hello, thanks for making this model available. I'm doing research on models for Latin and yours seems to be an important contribution in this area.
However, I encounter a problem. Using transformers==4.53.2 I get the error below. Any idea how to overcome it?
Device set to use cpu
Traceback (most recent call last):
File "fill.py", line 163, in <module>
pipeline("fill-mask", model="HPLT/hplt_bert_base_la", token=transformers_token, trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/__init__.py", line 1268, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1109, in __init__
self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/fill_mask.py", line 242, in _sanitize_parameters
if self.tokenizer.mask_token_id is None:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'mask_token_id'
This initialisation got rid of the errors, but the model now predicts weird tokens:
tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pipeline("fill-mask", model, tokenizer=tokenizer, token=transformers_token, trust_remote_code=True)
Note that you should be careful to not use a space in front of the mask token when using HF pipelines. Now when you use the fixed model like this:
tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pip = pipeline("fill-mask", model, tokenizer=tokenizer, trust_remote_code=True)
pip("Ubi autem[MASK] planitie potuerint reperiri,")
You should get this output:
[{'score': 0.6775742173194885,
'token': 402,
'token_str': 'in',
'sequence': 'Ubi autem in planitie potuerint reperiri,'},
{'score': 0.11163018643856049,
'token': 516,
'token_str': 'de',
'sequence': 'Ubi autem de planitie potuerint reperiri,'},
{'score': 0.06605511158704758,
'token': 479,
'token_str': 'ex',
'sequence': 'Ubi autem ex planitie potuerint reperiri,'},
{'score': 0.04487193748354912,
'token': 365,
'token_str': 'a',
'sequence': 'Ubi autem a planitie potuerint reperiri,'},
{'score': 0.013039215467870235,
'token': 364,
'token_str': 'e',
'sequence': 'Ubi autem e planitie potuerint reperiri,'}]
Hello and thank you for your response. I've tried it again and the model now generates meaningful responses.
Your remark about the whitespace is also much appreciated, as many examples online disregard this.
mapto
changed discussion status to
closed