Beam Search Implementation #84

ChrisCates · 2023-10-03T02:33:19Z

Hello Exllama friends,

I was curious what the thoughts are on implementing beam search in v2.
In the v1. Beam search was implemented in the core generator.

I was curious what would the requirements be to migrate the same source over to v2.
And if there is anything I should be mindful of, if creating a PR migrating v1 beam search to v2.

turboderp · 2023-10-03T18:43:50Z

It definitely needs to be adapted for the new version, so expect it to need some minor changes at least. But I'm not sure I'd do it the same way. In V1 I avoided using batches so the beam search wouldn't have VRAM overhead, but then of course there was extra latency instead. I think you should be able to get the best of both worlds with a slightly different approach, though. Just haven't quite figured it out yet.

cohan8999 · 2024-04-26T22:03:10Z

@turboderp have you put more thought into this? I barely understand any of it, but the way I see it there are different strategies where in terms of advantages you have to pick "either or" between the different strategies, correct? Meaning that if one gives you benefit, you lose other benefits by not using the other.

With that being said, would it not be possible to combine different strategies? Thus gaining the benefits of all strategies while even mitigating the disadvantages of some, like those that give generic and monotone outputs.

Oh and by the way: When autosplitting across GPUs, would it not make more sense to always (or at least have a parameter to) choose last-to-first order of GPU loading? That way we reserve the left-aligned space for the system and the right-aligned space for the model-loading, meaning we'd only see an overload when all GPUs are at full capacity.

Currently working on a chatbot application where I want to make some of the more complicated processes simplified, so this would be a great addition if such an implementation is possible 😇

ChrisCates · 2024-04-27T20:52:47Z

Hey @cohan8999, @turboderp has done a ton of work and there is still tons of work to do and it's my bad for suggesting I'd commit to creating this.

I'll be honest with you. I haven't been doing a lot of llama based SFT these days and am mostly doing with Claude, GPT4/GPT4 SFT these days.

In terms of strategies @cohan8999, no, this does not impact Top K and Top P sampling. This actually enhances the token sampling process.

In regards to multiple algorithms. I'm not sure what you mean. I'm not fully updated to date on the latest token sampling processes and I highly recommend you do a deep dive on the current ecosystem for token sampling. It's not black and white. You don't pick one or the other. They can work in conjunction often... And sometimes cannot.

Cheers, Chris

turboderp · 2024-06-14T13:24:56Z

Part of the motivation for the dynamic generator is to have a better framework for sampling strategies like beam search, so it's probably coming at some point. It's not in particularly high demand, though, as it's a super-greedy algorithm, and everyone's looking away from that towards more creative random sampling approaches.

turboderp mentioned this issue Jun 17, 2024

beam search support #396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beam Search Implementation #84

Beam Search Implementation #84

ChrisCates commented Oct 3, 2023

turboderp commented Oct 3, 2023

cohan8999 commented Apr 26, 2024 •

edited

Loading

ChrisCates commented Apr 27, 2024 •

edited

Loading

turboderp commented Jun 14, 2024

Beam Search Implementation #84

Beam Search Implementation #84

Comments

ChrisCates commented Oct 3, 2023

turboderp commented Oct 3, 2023

cohan8999 commented Apr 26, 2024 • edited Loading

ChrisCates commented Apr 27, 2024 • edited Loading

turboderp commented Jun 14, 2024

cohan8999 commented Apr 26, 2024 •

edited

Loading

ChrisCates commented Apr 27, 2024 •

edited

Loading