Skip to content

Avoid duplicate special tokens in chat formats #1439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 4, 2024

Conversation

CISC
Copy link
Contributor

@CISC CISC commented May 9, 2024

Having multiple BOS can ruin generation, this would occur in several ways, usually through user adding them unnecessarily, in this case output a warning if we detect two in a row.

However, several formats/templates also have BOS in them, and we would add another one later in completion, therefore tokenize format/template prompts before completion to avoid this.

Added a new added_special property to ChatFormatterResponse to detect when BOS/EOS has already been added to prompt.

Also added missing token_cls, token_sep, add_bos_token and add_eos_token to _LlamaModel because I thought I needed them (turns out I didn't).

Fixes #1501

CISC added 3 commits May 9, 2024 20:04
This is to ensure that we don't duplicate any special tokens.

Hopefully I amended the existing formats correctly?
@CISC CISC changed the title Templates sometimes have BOS in them, remove duplicate Avoid duplicate special tokens in chat formats May 10, 2024
@CISC CISC mentioned this pull request May 29, 2024
@CISC CISC mentioned this pull request May 31, 2024
4 tasks
@abetlen
Copy link
Owner

abetlen commented Jun 4, 2024

@CISC thank you (also for the other PRs), this looks good. I may update the warning to use the python warnings library (would be useful for deprecations as well) so it can be silenced in a standard way.

@abetlen abetlen closed this Jun 4, 2024
@abetlen abetlen deleted the remove-unwanted-bos branch June 4, 2024 14:13
@abetlen abetlen restored the remove-unwanted-bos branch June 4, 2024 14:13
@abetlen abetlen deleted the remove-unwanted-bos branch June 4, 2024 14:14
@abetlen abetlen restored the remove-unwanted-bos branch June 4, 2024 14:14
@abetlen abetlen reopened this Jun 4, 2024
@abetlen abetlen merged commit 027f7bc into abetlen:main Jun 4, 2024
16 checks passed
@CISC CISC deleted the remove-unwanted-bos branch June 4, 2024 17:03
@BigCatGit
Copy link

me too

@Rocketknight1
Copy link

Rocketknight1 commented Jun 6, 2024

Hi, Matt from Hugging Face here. Just a note that we expect chat templates to contain all the special tokens the model expects, and therefore in Transformers, whenever a chat has been formatted with a chat template, it should be tokenized without adding additional special tokens.

We really had no choice but to do it this way - many base models add EOS at the end of sequences, but this will always cause problems if we want to generate a chat completion following a user message, as user messages often don't end with EOS in many chat formats.

@CISC
Copy link
Contributor Author

CISC commented Jun 6, 2024

@Rocketknight1 Thanks for checking in. :)

Yeah, I perfectly understand the reasoning, it just wasn't obvious until recently that this and some built in chat formats were causing duplicate tokens due to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Llama 3 Double BOS
4 participants