Tesla P40 performance is still very low. #40

siriume · 2023-09-17T01:55:52Z

Tesla P40 performance is still very low, only using 80W underload.
Any process on exllama todo "Look into improving P40 performance"?
env:

kernel: 6.1.53-x64v3-xanmod1
system: "Linux Mint 21.2 Victoria"
cuda: cuda_11.8.r11.8
nvidia-dirvers:NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2
nvidia-driver-520 causes my 3090 to slow down to 36t/s, now is 39t/s

Test, p40 4096 takes too long, so I use 1024 instead.

python test_inference.py -m models/CodeLlama-34B-instruct-4.0bpw-h6-exl2/ -s -l 1024
 -- Model: models/CodeLlama-34B-instruct-4.0bpw-h6-exl2/
 -- Options: ['length: 1024', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Measuring token speed...
 ** Position     1 + 127 tokens:    1.1905 t/s
 ** Position   128 + 128 tokens:    1.1948 t/s
 ** Position   256 + 128 tokens:    1.1928 t/s
 ** Position   384 + 128 tokens:    1.1925 t/s
 ** Position   512 + 128 tokens:    1.1884 t/s
 ** Position   768 + 128 tokens:    1.1811 t/s
 ** Position   896 + 128 tokens:    1.1774 t/s

IMbackK · 2024-01-22T09:33:14Z

this is just #185 p40 is incredibly slow in fp16

turboderp · 2024-06-14T13:14:46Z

Closing this as stale. Better P40 performance is somewhere on the list of priorities, but there's too much going on right now.

lee-b · 2024-08-26T16:21:39Z

@turboderp , could you summarise the known (and unknown) parts of this issue, so that others can consider taking it on?

IMbackK · 2024-08-26T20:11:58Z

Its really quite simple, exllama's kernels do all calculations on half floats, Pascal gpus other than GP100 (p100) are very slow in fp16 because only a tiny fraction of the devices shaders can do fp16 (1/64th of fp32).

To work around this, you would have to upcast to 32bit do the calculation and then downcast back down to 16bit for storage in every kernel when compiling for pascal.
This would greatly improve performance on pascal but of course pascal is still over all poorly suited for llms either way.

Gpus better suited would be any nvidia gpu turing (2018) and newer or any amd gpu gcn3/gfx803 (2015) and newer as these devices support natively support fp16 at full rate or better (dual issue).

turboderp closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesla P40 performance is still very low. #40

Tesla P40 performance is still very low. #40

siriume commented Sep 17, 2023

IMbackK commented Jan 22, 2024

turboderp commented Jun 14, 2024

lee-b commented Aug 26, 2024

IMbackK commented Aug 26, 2024 •

edited

Loading

Tesla P40 performance is still very low. #40

Tesla P40 performance is still very low. #40

Comments

siriume commented Sep 17, 2023

IMbackK commented Jan 22, 2024

turboderp commented Jun 14, 2024

lee-b commented Aug 26, 2024

IMbackK commented Aug 26, 2024 • edited Loading

IMbackK commented Aug 26, 2024 •

edited

Loading