Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesla P40 performance is still very low. #40

Closed
siriume opened this issue Sep 17, 2023 · 4 comments
Closed

Tesla P40 performance is still very low. #40

siriume opened this issue Sep 17, 2023 · 4 comments

Comments

@siriume
Copy link

siriume commented Sep 17, 2023

Tesla P40 performance is still very low, only using 80W underload.
Any process on exllama todo "Look into improving P40 performance"?
env:

kernel: 6.1.53-x64v3-xanmod1
system: "Linux Mint 21.2 Victoria"
cuda: cuda_11.8.r11.8
nvidia-dirvers:NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2
nvidia-driver-520 causes my 3090 to slow down to 36t/s, now is 39t/s

Test, p40 4096 takes too long, so I use 1024 instead.

python test_inference.py -m models/CodeLlama-34B-instruct-4.0bpw-h6-exl2/ -s -l 1024
 -- Model: models/CodeLlama-34B-instruct-4.0bpw-h6-exl2/
 -- Options: ['length: 1024', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Measuring token speed...
 ** Position     1 + 127 tokens:    1.1905 t/s
 ** Position   128 + 128 tokens:    1.1948 t/s
 ** Position   256 + 128 tokens:    1.1928 t/s
 ** Position   384 + 128 tokens:    1.1925 t/s
 ** Position   512 + 128 tokens:    1.1884 t/s
 ** Position   768 + 128 tokens:    1.1811 t/s
 ** Position   896 + 128 tokens:    1.1774 t/s
@IMbackK
Copy link

IMbackK commented Jan 22, 2024

this is just #185 p40 is incredibly slow in fp16

@turboderp
Copy link
Member

Closing this as stale. Better P40 performance is somewhere on the list of priorities, but there's too much going on right now.

@turboderp turboderp closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
@lee-b
Copy link

lee-b commented Aug 26, 2024

@turboderp , could you summarise the known (and unknown) parts of this issue, so that others can consider taking it on?

@IMbackK
Copy link

IMbackK commented Aug 26, 2024

Its really quite simple, exllama's kernels do all calculations on half floats, Pascal gpus other than GP100 (p100) are very slow in fp16 because only a tiny fraction of the devices shaders can do fp16 (1/64th of fp32).

To work around this, you would have to upcast to 32bit do the calculation and then downcast back down to 16bit for storage in every kernel when compiling for pascal.
This would greatly improve performance on pascal but of course pascal is still over all poorly suited for llms either way.

Gpus better suited would be any nvidia gpu turing (2018) and newer or any amd gpu gcn3/gfx803 (2015) and newer as these devices support natively support fp16 at full rate or better (dual issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants