Upcasting all the calculations. #185

Ph0rk0z · 2023-11-27T16:19:52Z

So since using automatic1111 and sdnext with P40s extensively. I have found that only the calculations have to be upcast to get most of your performance back.

Just how many places would such a change have to be made in order to do this. And do any of the ops use tensor cores? Tall order or easy enough change?

turboderp · 2023-12-25T17:36:36Z

Tall order for sure. Once the hardware I've ordered starts coming in, perhaps I could add a P40 and do some profiling and maybe provide some alternative kernels with upcasting. But it would be an extensive change and a lot more code to maintain, so I don't know if it's really feasible with my time budget.

Ph0rk0z · 2023-12-26T13:15:18Z

I was hoping it was as simple as with AutoGPTQ where it could be done in only a few places.At least for that part and not use of tensor cores. The P100 works surprising well and also lacks them. It kills flash attention but it is definitely bearable.

krzysiekpodk · 2024-01-05T11:56:56Z

Out of curiosity would it be possible to more easily switch from float16 to bfloat?

IMbackK mentioned this issue Jan 22, 2024

Tesla P40 performance is still very low. #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upcasting all the calculations. #185

Upcasting all the calculations. #185

Ph0rk0z commented Nov 27, 2023

turboderp commented Dec 25, 2023

Ph0rk0z commented Dec 26, 2023

krzysiekpodk commented Jan 5, 2024

Upcasting all the calculations. #185

Upcasting all the calculations. #185

Comments

Ph0rk0z commented Nov 27, 2023

turboderp commented Dec 25, 2023

Ph0rk0z commented Dec 26, 2023

krzysiekpodk commented Jan 5, 2024