-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183
Comments
Hi @IMbackK. Internal ticket has been created to investigate this issue. Thanks! |
Hi @IMbackK, can you provide the full kernel log? And check if this issue present with |
the issue is not present with |
@IMbackK Thanks, some other questions to help repro this issue:
Can you also run rocminfo with some environment variables to get more info: |
The mi100 have entered a failed state before the machine is sufficently booted to be interactive, runtime suspend/resume happens fairly often, i understand this is normal. In the case of this machine there is a daemon that monitors hw sensors of the devices, presumably it causes the wakeup.
I tried 6.6.75 with amdgpu-dkms (ie amdgpu from this repo) running and 6.13.1 with upstream amdgpu.
unfortionatly this dosent result in any additional prints:
Dmsg: dmesg.log |
Why are you using upstream kernels builds (from https://wiki.ubuntu.com/Kernel/MainlineBuilds I'm guessing) instead of the ones provided by Ubuntu (they aren't officially supported by ROCm either)? Did you try the default 6.8.0 kernel with Ubuntu 24.04 and amdgpu-dkms from https://repo.radeon.com included in the ROCm install docs? |
I am self building kernels from kernel.org, i do this as i also do kernel development in other modules. I can try 6.8 too but i do have the expectation that the upstream amdgpu works, maybe not for rocm where the ioctls the downstream kernel in this repo provides are used, but at the very least there should be no crash just from booting. I gues since the issue occurs in both the kernel provided in this repo and in latest upstream stable i could also report this to the mailing list. |
Oh I see. Yes, do try the default 6.8.0 kernel from Ubuntu with both upstream amdgpu and the dkms module on https://repo.radeon.com.
Yeah that's a good idea. |
@IMbackK I tried reproducing with a 7900xtx and an upstream kernel 6.12 + upstream amdgpu on Ubuntu 24.04 but was unable to. Do you have any other info (HW/SW) about your system config that might be relevant? Did you build the kernel with any specific parameters or used any kernel boot parameters? |
It for sure only happens on mi100 specifically, i also have a 6800xt in the same system that runtime suspends fine. My kconfig only differs from the ubuntu default in that i enabled CONFIG_DMABUF_MOVE_NOTIFY=y and CONFIG_HSA_AMD_P2P=y both of which i have around to allow testing ROCR's dmabuf mode |
This also happens on my Mi100 on Ubuntu 24.04 + standard ROCm install instructions. It is also fixed by I might be in a better place to try experiments than the submitter, since my card is not in production. |
I'm able to reproduce this issue with a system with multiple MI100s. Running kernel and amdgpu built from the
This issue is MI100 specific. Any cause of resume triggers it, including a simple
amdgpu_firmware_info report from two MI100 cards in case the issue is firmware version specific. Cards with both VBIOS versions hang on resume. |
@sohaibnd looks like this is pretty well established now, with regards to in what situation the crash occurs. Have you been able to reproduce this problem locally? |
Problem Description
Booting with runtime pm enabled causes the devices to fail to apear due to a failure to resume smu on mi100 devices, dmesg:
rocm-smi:
Operating System
ubuntu 24.04
CPU
Epyc 7552
GPU
MI100
ROCm Version
ROCm 6.3.1
ROCm Component
No response
Steps to Reproduce
add
amdgpu.runpm=1
to kernel command line(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: