Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing a long session causes a long recompute #1383

Open
Jookia opened this issue Feb 20, 2025 · 9 comments
Open

Editing a long session causes a long recompute #1383

Jookia opened this issue Feb 20, 2025 · 9 comments

Comments

@Jookia
Copy link

Jookia commented Feb 20, 2025

Describe the Issue
I crated a new with a bot on KoboldLite and wrote enough to go far over the context limit. Many context shifts happen and tokens are erased. I then undo a few messages and write something. It then takes a long time to compute.

I'm not exactly sure what the intended behaviour is here, or how to fix it. I'm guessing this is a natural result of the frontend passing as many tokens as it can and the context shifting, so undoing will try and prepend text at the start of the model and cause a recompute. It would be nice to work around that somehow.

Additional Information:

I use this unit file to run koboldcpp:

[Unit]
Description=koboldcpp daemon

[Service]
AmbientCapabilities=
CapabilityBoundingSet=
DeviceAllow=
DynamicUser=yes
ExecStart=koboldcpp --quiet --whispermodel whisper.gguf --ttsmodel tts.gguf --ttswavtokenizer ttswavtokenizer.gguf model.gguf
IPAddressAllow=127.0.0.1
IPAddressDeny=any
LockPersonality=yes
MemoryDenyWriteExecute=yes
PrivateDevices=yes
PrivateMounts=yes
PrivatePIDs=yes
PrivateUsers=yes
ProcSubset=pid
ProtectClock=yes
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=invisible
RemoveIPC=yes
RestrictAddressFamilies=AF_INET
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SecureBits=
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged
SystemCallFilter=~@resources
Type=simple
WorkingDirectory=/var/local/koboldcpp

[Install]
WantedBy=multi-user.target

I'm using the Cydonia-22B-v2k-Q4_K_M model, OuteTTS-0.3-1B-Q4_0 model, and whisper-large-v3-f16 models.
I'm using Arch Linux with an AMD Ryzen 7 3700X processor. No GPU acceleration is used.

Log and story textdata:

LOG.txt
STORY_TEXTDATA.txt

@LostRuins
Copy link
Owner

Yes, that's how it works. The old stuff in the context is gone, so when you revert to it again it must be reprocessed.

@Jookia
Copy link
Author

Jookia commented Feb 20, 2025 via email

@LostRuins
Copy link
Owner

Yes, the solution is to manually truncate your story to keep it shorter, moving the excess into other places like notes or a different file. This only happens because you exceed the max context length, so text moves out and then gets back in subsequently

@MrReplikant
Copy link

@LostRuins is there any way to automate that process? user script, perhaps? That'd be cool, we could even feed that to RAG of we wanted to...

-Darth

@LostRuins
Copy link
Owner

Doing it automatically would cause the same issues you are facing now - that the start of the context keeps changing thus causing a reprocess. Doing it manually will only cause the reprocess when you modify it.

@Jookia
Copy link
Author

Jookia commented Feb 21, 2025

Manually truncating is a problem because I'd have to keep doing it and I don't know how many tokens I'm using in text. It would also mean a recompute as I fold in information in to memory or something, which would cause a recompute as it puts something at the start of the context. I'd be interesting on hearing how other people deal with this, do they just guess how many tokens they've used?

Having some kind of ratchet mechanism for shifting in UI seems like something that could work to me, where undoing doesn't move the context window back.

@LostRuins
Copy link
Owner

Hmm outside of a mod I don't think there's any good solution right now.

You could try add an extra function before submit_generation is called, that does something like: if total length of story exceeds max context, truncate away first half of story. That would kind of give a ratcheting effect.

Most people are fine with an occasional recompute, it only takes a few seconds on GPU at most.

@Jookia
Copy link
Author

Jookia commented Feb 21, 2025

Huh, really? It takes about a minute or two on my CPU. 😅 Maybe I should buy a GPU.

@LostRuins
Copy link
Owner

Yes, that would greatly speed up inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants