-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Avoid duplicate special tokens in chat formats #1439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is to ensure that we don't duplicate any special tokens. Hopefully I amended the existing formats correctly?
@CISC thank you (also for the other PRs), this looks good. I may update the warning to use the python |
me too |
Hi, Matt from Hugging Face here. Just a note that we expect chat templates to contain all the special tokens the model expects, and therefore in Transformers, whenever a chat has been formatted with a chat template, it should be tokenized without adding additional special tokens. We really had no choice but to do it this way - many base models add EOS at the end of sequences, but this will always cause problems if we want to generate a chat completion following a user message, as user messages often don't end with EOS in many chat formats. |
@Rocketknight1 Thanks for checking in. :) Yeah, I perfectly understand the reasoning, it just wasn't obvious until recently that this and some built in chat formats were causing duplicate tokens due to this. |
Having multiple BOS can ruin generation, this would occur in several ways, usually through user adding them unnecessarily, in this case output a warning if we detect two in a row.
However, several formats/templates also have BOS in them, and we would add another one later in completion, therefore tokenize format/template prompts before completion to avoid this.
Added a new
added_special
property toChatFormatterResponse
to detect when BOS/EOS has already been added to prompt.Also added missing
token_cls
,token_sep
,toadd_bos_token
andadd_eos_token
_LlamaModel
because I thought I needed them (turns out I didn't).Fixes #1501