-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beam Search Implementation #84
Comments
It definitely needs to be adapted for the new version, so expect it to need some minor changes at least. But I'm not sure I'd do it the same way. In V1 I avoided using batches so the beam search wouldn't have VRAM overhead, but then of course there was extra latency instead. I think you should be able to get the best of both worlds with a slightly different approach, though. Just haven't quite figured it out yet. |
@turboderp have you put more thought into this? I barely understand any of it, but the way I see it there are different strategies where in terms of advantages you have to pick "either or" between the different strategies, correct? Meaning that if one gives you benefit, you lose other benefits by not using the other. With that being said, would it not be possible to combine different strategies? Thus gaining the benefits of all strategies while even mitigating the disadvantages of some, like those that give generic and monotone outputs. Oh and by the way: When autosplitting across GPUs, would it not make more sense to always (or at least have a parameter to) choose last-to-first order of GPU loading? That way we reserve the left-aligned space for the system and the right-aligned space for the model-loading, meaning we'd only see an overload when all GPUs are at full capacity. Currently working on a chatbot application where I want to make some of the more complicated processes simplified, so this would be a great addition if such an implementation is possible 😇 |
Hey @cohan8999, @turboderp has done a ton of work and there is still tons of work to do and it's my bad for suggesting I'd commit to creating this. I'll be honest with you. I haven't been doing a lot of llama based SFT these days and am mostly doing with Claude, GPT4/GPT4 SFT these days. In terms of strategies @cohan8999, no, this does not impact Top K and Top P sampling. This actually enhances the token sampling process. In regards to multiple algorithms. I'm not sure what you mean. I'm not fully updated to date on the latest token sampling processes and I highly recommend you do a deep dive on the current ecosystem for token sampling. It's not black and white. You don't pick one or the other. They can work in conjunction often... And sometimes cannot. Cheers, Chris |
Part of the motivation for the dynamic generator is to have a better framework for sampling strategies like beam search, so it's probably coming at some point. It's not in particularly high demand, though, as it's a super-greedy algorithm, and everyone's looking away from that towards more creative random sampling approaches. |
Hello Exllama friends,
I was curious what the thoughts are on implementing beam search in v2.
In the v1. Beam search was implemented in the core generator.
I was curious what would the requirements be to migrate the same source over to v2.
And if there is anything I should be mindful of, if creating a PR migrating v1 beam search to v2.
The text was updated successfully, but these errors were encountered: