Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam Search Implementation #84

Open
ChrisCates opened this issue Oct 3, 2023 · 4 comments
Open

Beam Search Implementation #84

ChrisCates opened this issue Oct 3, 2023 · 4 comments

Comments

@ChrisCates
Copy link

Hello Exllama friends,

I was curious what the thoughts are on implementing beam search in v2.
In the v1. Beam search was implemented in the core generator.

I was curious what would the requirements be to migrate the same source over to v2.
And if there is anything I should be mindful of, if creating a PR migrating v1 beam search to v2.

@turboderp
Copy link
Member

It definitely needs to be adapted for the new version, so expect it to need some minor changes at least. But I'm not sure I'd do it the same way. In V1 I avoided using batches so the beam search wouldn't have VRAM overhead, but then of course there was extra latency instead. I think you should be able to get the best of both worlds with a slightly different approach, though. Just haven't quite figured it out yet.

@cohan8999
Copy link

cohan8999 commented Apr 26, 2024

@turboderp have you put more thought into this? I barely understand any of it, but the way I see it there are different strategies where in terms of advantages you have to pick "either or" between the different strategies, correct? Meaning that if one gives you benefit, you lose other benefits by not using the other.

With that being said, would it not be possible to combine different strategies? Thus gaining the benefits of all strategies while even mitigating the disadvantages of some, like those that give generic and monotone outputs.

Oh and by the way: When autosplitting across GPUs, would it not make more sense to always (or at least have a parameter to) choose last-to-first order of GPU loading? That way we reserve the left-aligned space for the system and the right-aligned space for the model-loading, meaning we'd only see an overload when all GPUs are at full capacity.

Currently working on a chatbot application where I want to make some of the more complicated processes simplified, so this would be a great addition if such an implementation is possible 😇

@ChrisCates
Copy link
Author

ChrisCates commented Apr 27, 2024

Hey @cohan8999, @turboderp has done a ton of work and there is still tons of work to do and it's my bad for suggesting I'd commit to creating this.

I'll be honest with you. I haven't been doing a lot of llama based SFT these days and am mostly doing with Claude, GPT4/GPT4 SFT these days.

In terms of strategies @cohan8999, no, this does not impact Top K and Top P sampling. This actually enhances the token sampling process.

In regards to multiple algorithms. I'm not sure what you mean. I'm not fully updated to date on the latest token sampling processes and I highly recommend you do a deep dive on the current ecosystem for token sampling. It's not black and white. You don't pick one or the other. They can work in conjunction often... And sometimes cannot.

Cheers, Chris

@turboderp
Copy link
Member

Part of the motivation for the dynamic generator is to have a better framework for sampling strategies like beam search, so it's probably coming at some point. It's not in particularly high demand, though, as it's a super-greedy algorithm, and everyone's looking away from that towards more creative random sampling approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants