Tokenizer error when trying to run pipeline on CPU

by mapto - opened 12 days ago

12 days ago

Hello, thanks for making this model available. I'm doing research on models for Latin and yours seems to be an important contribution in this area.
However, I encounter a problem. Using transformers==4.53.2 I get the error below. Any idea how to overcome it?

Device set to use cpu
Traceback (most recent call last):
  File "fill.py", line 163, in <module>
    pipeline("fill-mask", model="HPLT/hplt_bert_base_la", token=transformers_token, trust_remote_code=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/__init__.py", line 1268, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1109, in __init__
    self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mapto/work/islab/mlm-latin/venv/lib/python3.12/site-packages/transformers/pipelines/fill_mask.py", line 242, in _sanitize_parameters
    if self.tokenizer.mask_token_id is None:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'mask_token_id'

mapto

12 days ago

This initialisation got rid of the errors, but the model now predicts weird tokens:

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pipeline("fill-mask", model, tokenizer=tokenizer, token=transformers_token, trust_remote_code=True)

ltgoslo

HPLT org 8 days ago

Hi @mapto
Can you try it now? It looks like the HF auto-conversion of the model to safetensors has broken something. We removed the auto-converted model, so it should be fine now.

davda54

HPLT org 8 days ago

Note that you should be careful to not use a space in front of the mask token when using HF pipelines. Now when you use the fixed model like this:

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_la")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_la", trust_remote_code=True)
pip = pipeline("fill-mask", model, tokenizer=tokenizer, trust_remote_code=True)
pip("Ubi autem[MASK] planitie potuerint reperiri,")

You should get this output:

[{'score': 0.6775742173194885,
  'token': 402,
  'token_str': 'in',
  'sequence': 'Ubi autem in planitie potuerint reperiri,'},
 {'score': 0.11163018643856049,
  'token': 516,
  'token_str': 'de',
  'sequence': 'Ubi autem de planitie potuerint reperiri,'},
 {'score': 0.06605511158704758,
  'token': 479,
  'token_str': 'ex',
  'sequence': 'Ubi autem ex planitie potuerint reperiri,'},
 {'score': 0.04487193748354912,
  'token': 365,
  'token_str': 'a',
  'sequence': 'Ubi autem a planitie potuerint reperiri,'},
 {'score': 0.013039215467870235,
  'token': 364,
  'token_str': 'e',
  'sequence': 'Ubi autem e planitie potuerint reperiri,'}]

mapto

7 days ago

Hello and thank you for your response. I've tried it again and the model now generates meaningful responses.
Your remark about the whitespace is also much appreciated, as many examples online disregard this.

mapto changed discussion status to closed 7 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment