Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when converting tokenizer from Mistral Large Instruct 2411: "Exception: Cannot resolve bosId or eosIds" #185

Open
philigrale opened this issue Mar 9, 2025 · 8 comments

Comments

@philigrale
Copy link

Hi,
I want to try this program because I love the concept. (I am using version 0.12.8)

I have already converted the Mistral Large Instruct 2411 into the model file, but when I want to convert the tokenizer, I get the following error:

line 109, in <module>
    raise Exception('Cannot resolve bosId or eosIds')
Exception: Cannot resolve bosId or eosIds

I am not sure what is causing this. Did anyone else have this problem ?

Thanks a lot for any help !

@b4rtaz
Copy link
Owner

b4rtaz commented Mar 9, 2025

Hello @philigrale,

I pushed a fix. I was able to convert the tokenizer for this model, but I haven't checked if it works with the converted model.

@philigrale
Copy link
Author

Hello @b4rtaz,

thank you very much for the quick reply and fix. I tried just now and the conversion works, thank you !

Unfortunately, when I run the model I get the following error:

dllama: src/tokenizer.cpp:262: void Tokenizer::encode(char*, int*, int*, bool, bool): Assertion strLen == 0' failed.`

these are my start arguments:

./dllama chat --model /path/to/my/dllama_model_mistral-large_q40.m --tokenizer /path/to/my/dllama_tokenizer_mistral-large.t --buffer-float-type q80 --max-seq-len 4096 --nthreads 8 --workers (my worker ip address and port)

Here the full output:

📄 BosId: 1 (<s>)
📄 EosId: 2 (</s>) 2 (</s>) 
📄 RegularVocabSize: 1
📄 SpecialVocabSize: 32767
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 12288
💡 HiddenDim: 28672
💡 VocabSize: 32768
💡 nLayers: 88
💡 nHeads: 96
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama
💡 RopeTheta: 1000000
📀 RequiredMemory: 36704073 kB
⭕ Socket[0]: connecting to 192.168.178.84:9998 worker
⭕ Socket[0]: connected
⭕ Network is initialized
🧠 CPU: avx2
💿 Loading weights...
💿 Weights loaded
⭐ Chat template: llama2
🛑 Stop: </s>
🛑 Stop: </s>
💻 System prompt (optional): you are an IT-Expert

👱 User
> Hi there
dllama: src/tokenizer.cpp:262: void Tokenizer::encode(char*, int*, int*, bool, bool): Assertion `strLen == 0' failed.
Abgebrochen

Thank you.

@b4rtaz
Copy link
Owner

b4rtaz commented Mar 9, 2025

Thanks for checking it. Unfortunately, it seems that determining what is wrong requires more effort. Mistral is a low priority for now, so this problem will be addressed much later.

@philigrale
Copy link
Author

Thanks, that's unfortunate to hear, I was looking forward to using this model. But thank you very much for your efforts !

@antoine-sac
Copy link
Contributor

The error comes from the vocabulary not being parsed correctly, which you can see from these 2 lines

📄 RegularVocabSize: 1
📄 SpecialVocabSize: 32767

This seems to be caused by the tokenizer expecting the bosid to separate the regular vocab from the special vocab. When the bosid is the first element of the vocab, this assumption does not hold so the vocab is not parsed correctly.

This is fairly common and not especially linked to mistral. I had the same issue trying to run tinyllama.

There is actually a TODO line about this in the code

// TODO: this is very unstable assumption that bosId splits regular and special vocab

@philigrale
Copy link
Author

Thank you for the explanation !
For me personally this is to complex, my knowledge is not yet that deep into the architecture of models and their usage.
Would you consider this a difficult solution, or is it simply not relevant enough till now ?
Thanks for your help and time.

@antoine-sac
Copy link
Contributor

antoine-sac commented Mar 22, 2025

I patched the vocabulary parsing for my use case yesterday, it's not very pretty nor at all generic but it should work for you as well. You may also need to patch the token converter to handle byte tokens correctly.

You may need to change the specialVocabSize which is currently hardcoded to 3 (for bos, eos and unk as the first 3 tokens).

@b4rtaz I think it'd be reasonable to assume the special vocab precedes the regular vocab when bosId is 1 and automatically parse the vocab correctly in both cases.

@philigrale
Copy link
Author

Thanks,
since I am not deep enough into the subject, I am (right now) not capable of doing this, but I am going to attempt it by research.

Although thank you very much for the diagnose of the problem.
I will try my best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants