Fix tokenization edge case where llama output does not start with a space #1375

noamgat · 2024-04-23T08:10:14Z

Created following investigation in this issue:
noamgat/lm-format-enforcer#92

See this notebook for a reproduction of the problem:
https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC

When using the model
TheBloke/tinyllama-1.1b-chat-v1.0-GGUF

In the current implementation, the token sequence [6377] decodes to {" while the token sequence [1,6377] decodes to ". This is because the LLama tokenizer doesn't add a leading space when decoding this sequence, but the llama-cpp-python code that wraps it assumes that it does.
This breaks that assumption, and only returns output[1:] instead of output when the first character is a space.

Implementation note: I made the check output[0:1] == ' ' and not output[0] == ' ' to avoid edge cases where the output is empty (maybe if the first tokens are partial unicode characters).

…pace See this notebook: https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC

Fixing to compare to b' ' instead of (str)' '

noamgat · 2024-05-03T13:10:50Z

Is it possible to review this? I think its a very straightforward fix.

abetlen · 2024-05-03T15:05:31Z

@noamgat yup and thank you for looking into this

noamgat added 2 commits April 23, 2024 11:06

Fix tokenization edge case where llama output does not start with a s…

239200a

…pace See this notebook: https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC

Update _internals.py

935e532

Fixing to compare to b' ' instead of (str)' '

Merge branch 'main' into patch-1

5c9d560

abetlen merged commit e0d7674 into abetlen:main May 4, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tokenization edge case where llama output does not start with a space #1375

Fix tokenization edge case where llama output does not start with a space #1375

Uh oh!

noamgat commented Apr 23, 2024

Uh oh!

noamgat commented May 3, 2024

Uh oh!

abetlen commented May 3, 2024

Uh oh!

Uh oh!

Uh oh!

Fix tokenization edge case where llama output does not start with a space #1375

Fix tokenization edge case where llama output does not start with a space #1375

Uh oh!

Conversation

noamgat commented Apr 23, 2024

Uh oh!

noamgat commented May 3, 2024

Uh oh!

abetlen commented May 3, 2024

Uh oh!

Uh oh!

Uh oh!