TheDrummer/Mixtral-4x3B-v1 · Update config.json

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  caf2b48d95e818798cb565d97be5d194a283982f6b7a40c15d3655d510a8d24d
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 1751, in set_vocab
    self._set_vocab_sentencepiece()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 768, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 785, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Voxtral-RP-3B-v1g-Workspace/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 1754, in set_vocab
    self._set_vocab_llama_hf()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 870, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/llama.cpp/gguf-py/gguf/vocab.py", line 511, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 7856, in <module>
    main()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 7850, in main
    model_instance.write()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 411, in write
    self.prepare_metadata(vocab_only=False)
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 524, in prepare_metadata
    self.set_vocab()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 1757, in set_vocab
    self._set_vocab_gpt2()
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 704, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 614, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/./llama.cpp/convert_hf_to_gguf.py", line 692, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

minpeter

1 day ago

oh,, It seems like the problem is that gguf conversion is impossible...
How do you handle it on other Mistral models?

minpeter

1 day ago

I think it would be solved by leaving the weights as is and replacing the tokenizer with a compatible one... Will the gguf convert script allow tekken.json?

TheDrummer

Owner 1 day ago

Fixed it by editing convert_hf_to_gguf.py and removing the condition:

        # NOTE: if you get an error here, you need to update the convert_hf_to_gguf_update.py script
        #       or pull the latest version of the model from Huggingface
        #       don't edit the hashes manually!
        if chkhsh == "63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e": # just this one
            # ref: https://huggingface.co/mistralai/Mistral-Nemo-Base-2407
            res = "tekken"

forcing res to be tekken regardless

TheDrummer

Owner 1 day ago

Tried dropping in tekken.json but that failed. Also I added some tokens. Fortunately, the hack I did above was enough to GGUF

TheDrummer

Owner 1 day ago

llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
llama_model_load_from_file_impl: failed to load model

Nope, still broken lol

minpeter

1 day ago

•

edited 1 day ago

To be honest, I have absolutely no knowledge about GGUF...
I guess I'll have to look into how to port the tokenizer from GGUF later when I have time.

minpeter

1 day ago

•

edited 1 day ago

How is the tokenizer in this version (Mixtral-4x3B-v1)?

TheDrummer

Owner about 24 hours ago

Still couldn't get it to work even after wrangling config.json and tokenizer.json

Might have been a mistake to add tokens. Ah well

TheDrummer

Owner about 24 hours ago

How is the tokenizer in this version (Mixtral-4x3B-v1)?

This one doesn't have added tokens

minpeter

about 23 hours ago

I need to figure out how to change my custom tokenizer to gguf,, I'll leave a comment when I figure something out.

TheDrummer

Owner about 19 hours ago

Wait... your model used a custom tokenizer?

minpeter

about 13 hours ago

In the process of converting the Mistral tokenizer, a new tokenizer was created, so I think it should be called a "custom tokenizer".

minpeter

about 13 hours ago

•

edited about 12 hours ago

~~Could you share the new model weights? I think I can convert them to gguf.~~
Oh, I forgot that there are additional tokens in the tokenizer,, oh my :(

TheDrummer

Owner about 12 hours ago

https://huggingface.co/TheDrummer/Voxtral-RP-3B-v1g

here you go. could you req access? not ideal to be doing a workaround quant every time though.