Fix tokenization edge case where llama output does not start with a space #1375
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Created following investigation in this issue:
noamgat/lm-format-enforcer#92
See this notebook for a reproduction of the problem:
https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC
When using the model
TheBloke/tinyllama-1.1b-chat-v1.0-GGUF
In the current implementation, the token sequence
[6377]
decodes to{"
while the token sequence[1,6377]
decodes to"
. This is because the LLama tokenizer doesn't add a leading space when decoding this sequence, but the llama-cpp-python code that wraps it assumes that it does.This breaks that assumption, and only returns
output[1:]
instead ofoutput
when the first character is a space.Implementation note: I made the check
output[0:1] == ' '
and notoutput[0] == ' '
to avoid edge cases where the output is empty (maybe if the first tokens are partial unicode characters).