Skip to content

Commit e0d7674

Browse files
noamgatabetlen
andauthored
fix: detokenization case where first token does not start with a leading space (#1375)
* Fix tokenization edge case where llama output does not start with a space See this notebook: https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC * Update _internals.py Fixing to compare to b' ' instead of (str)' ' --------- Co-authored-by: Andrei <abetlen@gmail.com>
1 parent 1f56c64 commit e0d7674

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

llama_cpp/_internals.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ def detokenize(self, tokens: List[int], special: bool = False) -> bytes:
203203
# NOTE: Llama1 models automatically added a space at the start of the prompt
204204
# this line removes a leading space if the first token is a beginning of sentence token
205205
return (
206-
output[1:] if len(tokens) > 0 and tokens[0] == self.token_bos() else output
206+
output[1:] if len(tokens) > 0 and tokens[0] == self.token_bos() and output[0:1] == b' ' else output
207207
)
208208

209209
# Extra
@@ -812,4 +812,4 @@ def sample(
812812
def accept(self, ctx_main: _LlamaContext, id: int, apply_grammar: bool):
813813
if apply_grammar and self.grammar is not None:
814814
ctx_main.grammar_accept_token(self.grammar, id)
815-
self.prev.append(id)
815+
self.prev.append(id)

0 commit comments

Comments
 (0)