[pytorch upstream] pointwise is slow compare to cuda #4001

jianyizh · 2025-04-24T07:52:37Z

Describe the issue

in torchbench model pyhpc_equation_of_state inference, the whole model is fused into a large pointwise kernel, and it's slow compare to a100. This model is faster than a100 in eager mode. FP64 maybe a reason, but I manually change it to fp32, and performance is still low.
pointwise_test.zip
| | 1550 | A100 |
|fp64| 0.77ms | 0.43ms|
|fp32| 0.30ms | 0.08ms|

Environment details

triton 3.3.0+git0bcc8265
pytorch: 3ed5f1fb77669c8ac5d02e7acc0218e31b71c0b6

jianyizh added the performance label Apr 24, 2025

LiyangLingIntel added this to the 1. [PT 2.8 Upstream] TorchInductor milestone Apr 24, 2025

vlad-penkin modified the milestones: 1. [PT 2.8 Upstream] TorchInductor, 4. [Performance] Core Apr 28, 2025

vlad-penkin added the tests: torchinductor label Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pytorch upstream] pointwise is slow compare to cuda #4001

[pytorch upstream] pointwise is slow compare to cuda #4001

jianyizh commented Apr 24, 2025

[pytorch upstream] pointwise is slow compare to cuda #4001

[pytorch upstream] pointwise is slow compare to cuda #4001

Comments

jianyizh commented Apr 24, 2025

Describe the issue

Environment details