Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

Open
IMbackK opened this issue Feb 5, 2025 · 13 comments
Open

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

IMbackK opened this issue Feb 5, 2025 · 13 comments

Comments

@IMbackK
Copy link

IMbackK commented Feb 5, 2025

Problem Description

Booting with runtime pm enabled causes the devices to fail to apear due to a failure to resume smu on mi100 devices, dmesg:

[   33.711163] [drm] PCIE GART of 512M enabled.
[   33.716881] [drm] PTB located at 0x00000087FEF00000
[   33.723056] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[   33.778774] amdgpu 0000:03:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   33.850011] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   33.858894] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[   33.865392] amdgpu 0000:03:00.0: amdgpu: SMC is not ready
[   33.871308] amdgpu 0000:03:00.0: amdgpu: SMC engine is not correctly up!
[   33.878965] amdgpu 0000:03:00.0: amdgpu: resume of IP block <smu> failed -5
[   33.886517] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   37.878379] [drm] PCIE GART of 512M enabled.
[   37.883438] [drm] PTB located at 0x00000087FEF00000
[   37.889037] amdgpu 0000:83:00.0: amdgpu: PSP is resuming...
[   37.945322] amdgpu 0000:83:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   38.016581] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   38.024903] amdgpu 0000:83:00.0: amdgpu: SMU is resuming...
[   38.030956] amdgpu 0000:83:00.0: amdgpu: SMC is not ready
[   38.036682] amdgpu 0000:83:00.0: amdgpu: SMC engine is not correctly up!
[   38.044093] amdgpu 0000:83:00.0: amdgpu: resume of IP block <smu> failed -5
[   38.051452] amdgpu 0000:83:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   38.416529] amdgpu 0000:c3:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

rocm-smi:

Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK    Fan  Perf     PwrCap       VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                          
==========================================================================================================================
0       3     0x738c,   4106   N/A     N/A    N/A, N/A, 0         None    None    0%   unknown  Unsupported  0%     0%    

Operating System

ubuntu 24.04

CPU

Epyc 7552

GPU

MI100

ROCm Version

ROCm 6.3.1

ROCm Component

No response

Steps to Reproduce

add amdgpu.runpm=1 to kernel command line

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module is loaded
hsa api call failure at: /usr/src/debug/rocminfo/rocminfo-rocm-6.2.4/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Additional Information

No response

@ppanchad-amd
Copy link

Hi @IMbackK. Internal ticket has been created to investigate this issue. Thanks!

@sohaibnd
Copy link

sohaibnd commented Feb 5, 2025

Hi @IMbackK, can you provide the full kernel log? And check if this issue present with amdgpu.runpm=0

@IMbackK
Copy link
Author

IMbackK commented Feb 5, 2025

the issue is not present with amdgpu.runpm=0, as is logical since the device is not suspended in this state and the error occurs on resume, i will provide a full log later as currently the machine is in use. Is there anything else i can gather for you at the same time?

@sohaibnd
Copy link

sohaibnd commented Feb 5, 2025

@IMbackK Thanks, some other questions to help repro this issue:

  • What is triggering the resume? Are you running some workload?
  • Which kernel version are you using (use uname -a command to check)?

Can you also run rocminfo with some environment variables to get more info: AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo

@IMbackK
Copy link
Author

IMbackK commented Feb 6, 2025

What is triggering the resume? Are you running some workload?

The mi100 have entered a failed state before the machine is sufficently booted to be interactive, runtime suspend/resume happens fairly often, i understand this is normal. In the case of this machine there is a daemon that monitors hw sensors of the devices, presumably it causes the wakeup.

Which kernel version are you using (use uname -a command to check)?

I tried 6.6.75 with amdgpu-dkms (ie amdgpu from this repo) running and 6.13.1 with upstream amdgpu.

Can you also run rocminfo with some environment variables to get more info: AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo

unfortionatly this dosent result in any additional prints:

% AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo
ROCk module is loaded
Failed to open /dev/dri/renderD129: Invalid argument
hsa api call failure at: /usr/src/debug/rocminfo/rocminfo-rocm-6.2.4/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
# amdvbflash -i
adapter seg  bn dn dID       asic           flash      romsize test    bios p/n    
======= ==== == == ==== =============== ============== ======= ==== ================
   0    0000 03 00 738C MI100(Slave)    GD25Q80C        100000 pass 113-D3431500-101
   1    0000 83 00 738C MI100(Slave)    GD25Q80C        100000 pass 113-D3431500-100
   2    0000 C3 00 73BF Navi21          P25Q80H         100000 pass 113-5N21XT_200801

Dmsg: dmesg.log

@sohaibnd
Copy link

sohaibnd commented Feb 7, 2025

Why are you using upstream kernels builds (from https://wiki.ubuntu.com/Kernel/MainlineBuilds I'm guessing) instead of the ones provided by Ubuntu (they aren't officially supported by ROCm either)? Did you try the default 6.8.0 kernel with Ubuntu 24.04 and amdgpu-dkms from https://repo.radeon.com included in the ROCm install docs?

@IMbackK
Copy link
Author

IMbackK commented Feb 7, 2025

I am self building kernels from kernel.org, i do this as i also do kernel development in other modules.

I can try 6.8 too but i do have the expectation that the upstream amdgpu works, maybe not for rocm where the ioctls the downstream kernel in this repo provides are used, but at the very least there should be no crash just from booting.

I gues since the issue occurs in both the kernel provided in this repo and in latest upstream stable i could also report this to the mailing list.

@sohaibnd
Copy link

sohaibnd commented Feb 7, 2025

Oh I see. Yes, do try the default 6.8.0 kernel from Ubuntu with both upstream amdgpu and the dkms module on https://repo.radeon.com.

I gues since the issue occurs in both the kernel provided in this repo and in latest upstream stable i could also report this to the mailing list.

Yeah that's a good idea.

@sohaibnd
Copy link

sohaibnd commented Feb 7, 2025

@IMbackK I tried reproducing with a 7900xtx and an upstream kernel 6.12 + upstream amdgpu on Ubuntu 24.04 but was unable to. Do you have any other info (HW/SW) about your system config that might be relevant? Did you build the kernel with any specific parameters or used any kernel boot parameters?

@IMbackK
Copy link
Author

IMbackK commented Feb 8, 2025

It for sure only happens on mi100 specifically, i also have a 6800xt in the same system that runtime suspends fine.
Besides amdgpu.runpm=1 i also have iommu=pt which may be relevant.

My kconfig only differs from the ubuntu default in that i enabled CONFIG_DMABUF_MOVE_NOTIFY=y and CONFIG_HSA_AMD_P2P=y both of which i have around to allow testing ROCR's dmabuf mode

@bjj
Copy link

bjj commented Feb 26, 2025

This also happens on my Mi100 on Ubuntu 24.04 + standard ROCm install instructions. It is also fixed by amdgpu.runpm=0 This is on an MSI Z370M motherboard + i7-8700K.

I might be in a better place to try experiments than the submitter, since my card is not in production.

@LunNova
Copy link

LunNova commented Mar 3, 2025

I'm able to reproduce this issue with a system with multiple MI100s.

Running kernel and amdgpu built from the rocm-6.3.3 tag of this repository and amdgpu.runpm=1 results in this error when the card tries to resume:

[ 1075.338832] amdgpu 0000:c8:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[ 1075.408586] amdgpu 0000:c8:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1075.408601] amdgpu 0000:c8:00.0: amdgpu: SMU is resuming...
[ 1075.408610] amdgpu 0000:c8:00.0: amdgpu: SMC is not ready
[ 1075.408646] amdgpu 0000:c8:00.0: amdgpu: SMC engine is not correctly up!
[ 1075.408670] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -5
[ 1075.409439] amdgpu 0000:c8:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).

This issue is MI100 specific. Any cause of resume triggers it, including a simple cat /sys/class/drm/card6/device/gpu_busy_percent. Adding env vars for logging in AMD userspace utilities is of no benefit.

$ sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 0, firmware version: 0x00000000
PFP feature version: 0, firmware version: 0x00000000
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000018
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 0, firmware version: 0x00000000
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 48, firmware version: 0x00000047
IMU feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00170050
ASD feature version: 0, firmware version: 0x21000059
TA XGMI feature version: 0x00000000, firmware version: 0x20000014
TA RAS feature version: 0x00000000, firmware version: 0x1b00013e
TA HDCP feature version: 0x00000000, firmware version: 0x00000000
TA DTM feature version: 0x00000000, firmware version: 0x00000000
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00361d00 (54.29.0)
SDMA0 feature version: 44, firmware version: 0x00000012
SDMA1 feature version: 44, firmware version: 0x00000012
SDMA2 feature version: 44, firmware version: 0x00000012
SDMA3 feature version: 44, firmware version: 0x00000012
SDMA4 feature version: 44, firmware version: 0x00000012
SDMA5 feature version: 44, firmware version: 0x00000012
SDMA6 feature version: 44, firmware version: 0x00000012
SDMA7 feature version: 44, firmware version: 0x00000012
VCN feature version: 0, firmware version: 0x01101015
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
TOC feature version: 0, firmware version: 0x00000000
MES_KIQ feature version: 0, firmware version: 0x00000000
MES feature version: 0, firmware version: 0x00000000
VPE feature version: 0, firmware version: 0x00000000
VBIOS version: 113-D3432400-100
$ sudo cat /sys/kernel/debug/dri/6/amdgpu_firmware_info
<snipped exact same value for all other FW versions>
VBIOS version: 113-D3431500-101

amdgpu_firmware_info report from two MI100 cards in case the issue is firmware version specific. Cards with both VBIOS versions hang on resume.

@IMbackK
Copy link
Author

IMbackK commented Mar 24, 2025

@sohaibnd looks like this is pretty well established now, with regards to in what situation the crash occurs. Have you been able to reproduce this problem locally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants