You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So since using automatic1111 and sdnext with P40s extensively. I have found that only the calculations have to be upcast to get most of your performance back.
Just how many places would such a change have to be made in order to do this. And do any of the ops use tensor cores? Tall order or easy enough change?
The text was updated successfully, but these errors were encountered:
Tall order for sure. Once the hardware I've ordered starts coming in, perhaps I could add a P40 and do some profiling and maybe provide some alternative kernels with upcasting. But it would be an extensive change and a lot more code to maintain, so I don't know if it's really feasible with my time budget.
I was hoping it was as simple as with AutoGPTQ where it could be done in only a few places.At least for that part and not use of tensor cores. The P100 works surprising well and also lacks them. It kills flash attention but it is definitely bearable.
So since using automatic1111 and sdnext with P40s extensively. I have found that only the calculations have to be upcast to get most of your performance back.
Just how many places would such a change have to be made in order to do this. And do any of the ops use tensor cores? Tall order or easy enough change?
The text was updated successfully, but these errors were encountered: