Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

philigrale · 2025-03-09T17:49:37Z

Hi,
I want to try this program because I love the concept. (I am using version 0.12.8)

I have already converted the Mistral Large Instruct 2411 into the model file, but when I want to convert the tokenizer, I get the following error:

line 109, in <module>
    raise Exception('Cannot resolve bosId or eosIds')
Exception: Cannot resolve bosId or eosIds

I am not sure what is causing this. Did anyone else have this problem ?

Thanks a lot for any help !

The text was updated successfully, but these errors were encountered:

b4rtaz · 2025-03-09T19:30:21Z

Hello @philigrale,

I pushed a fix. I was able to convert the tokenizer for this model, but I haven't checked if it works with the converted model.

philigrale · 2025-03-09T20:58:21Z

Hello @b4rtaz,

thank you very much for the quick reply and fix. I tried just now and the conversion works, thank you !

Unfortunately, when I run the model I get the following error:

dllama: src/tokenizer.cpp:262: void Tokenizer::encode(char*, int*, int*, bool, bool): Assertion strLen == 0' failed.`

these are my start arguments:

./dllama chat --model /path/to/my/dllama_model_mistral-large_q40.m --tokenizer /path/to/my/dllama_tokenizer_mistral-large.t --buffer-float-type q80 --max-seq-len 4096 --nthreads 8 --workers (my worker ip address and port)

Here the full output:

📄 BosId: 1 (<s>)
📄 EosId: 2 (</s>) 2 (</s>) 
📄 RegularVocabSize: 1
📄 SpecialVocabSize: 32767
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 12288
💡 HiddenDim: 28672
💡 VocabSize: 32768
💡 nLayers: 88
💡 nHeads: 96
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama
💡 RopeTheta: 1000000
📀 RequiredMemory: 36704073 kB
⭕ Socket[0]: connecting to 192.168.178.84:9998 worker
⭕ Socket[0]: connected
⭕ Network is initialized
🧠 CPU: avx2
💿 Loading weights...
💿 Weights loaded
⭐ Chat template: llama2
🛑 Stop: </s>
🛑 Stop: </s>
💻 System prompt (optional): you are an IT-Expert

👱 User
> Hi there
dllama: src/tokenizer.cpp:262: void Tokenizer::encode(char*, int*, int*, bool, bool): Assertion `strLen == 0' failed.
Abgebrochen

Thank you.

b4rtaz · 2025-03-09T21:03:34Z

Thanks for checking it. Unfortunately, it seems that determining what is wrong requires more effort. Mistral is a low priority for now, so this problem will be addressed much later.

philigrale · 2025-03-09T21:16:02Z

Thanks, that's unfortunate to hear, I was looking forward to using this model. But thank you very much for your efforts !

antoine-sac · 2025-03-21T22:32:04Z

The error comes from the vocabulary not being parsed correctly, which you can see from these 2 lines

📄 RegularVocabSize: 1
📄 SpecialVocabSize: 32767

This seems to be caused by the tokenizer expecting the bosid to separate the regular vocab from the special vocab. When the bosid is the first element of the vocab, this assumption does not hold so the vocab is not parsed correctly.

This is fairly common and not especially linked to mistral. I had the same issue trying to run tinyllama.

There is actually a TODO line about this in the code

// TODO: this is very unstable assumption that bosId splits regular and special vocab

philigrale · 2025-03-22T15:11:44Z

Thank you for the explanation !
For me personally this is to complex, my knowledge is not yet that deep into the architecture of models and their usage.
Would you consider this a difficult solution, or is it simply not relevant enough till now ?
Thanks for your help and time.

antoine-sac · 2025-03-22T20:26:14Z

I patched the vocabulary parsing for my use case yesterday, it's not very pretty nor at all generic but it should work for you as well. You may also need to patch the token converter to handle byte tokens correctly.

You may need to change the specialVocabSize which is currently hardcoded to 3 (for bos, eos and unk as the first 3 tokens).

@b4rtaz I think it'd be reasonable to assume the special vocab precedes the regular vocab when bosId is 1 and automatically parse the vocab correctly in both cases.

philigrale · 2025-03-22T22:18:50Z

Thanks,
since I am not deep enough into the subject, I am (right now) not capable of doing this, but I am going to attempt it by research.

Although thank you very much for the diagnose of the problem.
I will try my best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

philigrale commented Mar 9, 2025

b4rtaz commented Mar 9, 2025

philigrale commented Mar 9, 2025

b4rtaz commented Mar 9, 2025

philigrale commented Mar 9, 2025

antoine-sac commented Mar 21, 2025

philigrale commented Mar 22, 2025

antoine-sac commented Mar 22, 2025 •

edited

Loading

philigrale commented Mar 22, 2025

Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

Comments

philigrale commented Mar 9, 2025

b4rtaz commented Mar 9, 2025

philigrale commented Mar 9, 2025

b4rtaz commented Mar 9, 2025

philigrale commented Mar 9, 2025

antoine-sac commented Mar 21, 2025

philigrale commented Mar 22, 2025

antoine-sac commented Mar 22, 2025 • edited Loading

philigrale commented Mar 22, 2025

antoine-sac commented Mar 22, 2025 •

edited

Loading