Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<think> tags are not removed when set to be removed if the closing </think> isn't generated in the same pass. #1359

Open
wh33t opened this issue Feb 8, 2025 · 6 comments

Comments

@wh33t
Copy link

wh33t commented Feb 8, 2025

Describe the Issue
[think] tags are not stripped from the output even when "remove" is set on think tags IF the closing [/think] tag is not output in the same pass.

Sometimes thinking requires more than 512 tokens of output, and if the opening [think] and the closing [/think] is not delivered within the same inference pass, kcpp fails to remove the [think]block[/think] from the context window.

However they are removed properly if [think] and [/think] are delivered in the same pass.

Additional Information:
Kcpp 1.83

PS. I have intentionally used the [] (square brackets) in this post so that they show up correctly.

Also, I would like to propose added functionality:

  1. A settable parameter "[Think] tag auto removal threshold". You set this to an integer. Let's say we set it to 1000. After 1000 tokens/words of output AFTER a closing [/think], that whole [think]block[/think] is removed. Because clearly it's important for the think block to exist to help predict the output but after a while it just eats up precious context window and vram. Also keep in mind, you wouldn't necessarily want to strip every think block in the context window after 1000 tokens/words, you'd want to just strip the earliest think block in the context window.
@LostRuins
Copy link
Owner

hmm it's a little tricky to take this approach. For now, you can try instead increasing the max output "number of tokens" to above 512. This is actually possible - in Lite just manually edit the value above the slider.

Image

Note that its not advisable to increase this value beyond 25% of your max context size. So if your max context size is 8192, then you can safely set your "max output" to 2048, which should ensure that the closing </think> is captured within the same request.

Does this help?

@x-legion
Copy link

x-legion commented Feb 9, 2025

Hey @LostRuins, I was wondering if it might be possible to implement the s1: Simple test-time scaling technique from this paper, using this implementation, for deepseek r1 type models? It seems like it could make even smaller models perform better!

@LostRuins
Copy link
Owner

You could simulate it manually in story mode - when you start getting a response, delete it and replace with a "Wait" instead.

@wh33t
Copy link
Author

wh33t commented Feb 9, 2025

hmm it's a little tricky to take this approach. For now, you can try instead increasing the max output "number of tokens" to above 512. This is actually possible - in Lite just manually edit the value above the slider.

Image

Note that its not advisable to increase this value beyond 25% of your max context size. So if your max context size is 8192, then you can safely set your "max output" to 2048, which should ensure that the closing </think> is captured within the same request.

Does this help?

Aye, that will work. Or I can just keep manually removing them (feels clunky though).

Thanks!

@LostRuins
Copy link
Owner

Hi, this feature has been added as a toggle in the latest release

@wh33t
Copy link
Author

wh33t commented Mar 1, 2025

I tested it out. Pretty impressive how not submitting the [think] blocks keeps the context nice and efficient.

I still think it would be better to somehow set a specific number of words that must exist between the closing think tag and the most recent submission from the user.

While testing this update out I encountered a weird quirk where there was about 240 tokens of think before the model began to output it's actual response, then it ran out of tokens to use, and then because I had it set so that the think tags weren't submitted, it basically started it's [think] phase all over again. This will be a recurring issue if the model runs out of tokens near the closing [/think].

Somehow telling kcpp to only incorporate the last or most recent think tags would be the best solution imo. I know it would cause delays while kcpp updates the context window but I think it would be worth it for people like me who want to use kcpp for long form writing.

Either way though, it's much better with these new features. Great work and thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants