[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

IMbackK · 2025-02-05T17:47:26Z

Problem Description

Booting with runtime pm enabled causes the devices to fail to apear due to a failure to resume smu on mi100 devices, dmesg:

[   33.711163] [drm] PCIE GART of 512M enabled.
[   33.716881] [drm] PTB located at 0x00000087FEF00000
[   33.723056] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[   33.778774] amdgpu 0000:03:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   33.850011] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   33.858894] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[   33.865392] amdgpu 0000:03:00.0: amdgpu: SMC is not ready
[   33.871308] amdgpu 0000:03:00.0: amdgpu: SMC engine is not correctly up!
[   33.878965] amdgpu 0000:03:00.0: amdgpu: resume of IP block <smu> failed -5
[   33.886517] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   37.878379] [drm] PCIE GART of 512M enabled.
[   37.883438] [drm] PTB located at 0x00000087FEF00000
[   37.889037] amdgpu 0000:83:00.0: amdgpu: PSP is resuming...
[   37.945322] amdgpu 0000:83:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   38.016581] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   38.024903] amdgpu 0000:83:00.0: amdgpu: SMU is resuming...
[   38.030956] amdgpu 0000:83:00.0: amdgpu: SMC is not ready
[   38.036682] amdgpu 0000:83:00.0: amdgpu: SMC engine is not correctly up!
[   38.044093] amdgpu 0000:83:00.0: amdgpu: resume of IP block <smu> failed -5
[   38.051452] amdgpu 0000:83:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   38.416529] amdgpu 0000:c3:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

rocm-smi:

Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK    Fan  Perf     PwrCap       VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                          
==========================================================================================================================
0       3     0x738c,   4106   N/A     N/A    N/A, N/A, 0         None    None    0%   unknown  Unsupported  0%     0%

Operating System

ubuntu 24.04

CPU

Epyc 7552

GPU

MI100

ROCm Version

ROCm 6.3.1

ROCm Component

No response

Steps to Reproduce

add amdgpu.runpm=1 to kernel command line

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module is loaded
hsa api call failure at: /usr/src/debug/rocminfo/rocminfo-rocm-6.2.4/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-02-05T21:54:24Z

Hi @IMbackK. Internal ticket has been created to investigate this issue. Thanks!

sohaibnd · 2025-02-05T22:05:43Z

Hi @IMbackK, can you provide the full kernel log? And check if this issue present with amdgpu.runpm=0

IMbackK · 2025-02-05T22:07:28Z

the issue is not present with amdgpu.runpm=0, as is logical since the device is not suspended in this state and the error occurs on resume, i will provide a full log later as currently the machine is in use. Is there anything else i can gather for you at the same time?

sohaibnd · 2025-02-05T23:34:38Z

@IMbackK Thanks, some other questions to help repro this issue:

What is triggering the resume? Are you running some workload?
Which kernel version are you using (use uname -a command to check)?

Can you also run rocminfo with some environment variables to get more info: AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo

IMbackK · 2025-02-06T21:30:05Z

What is triggering the resume? Are you running some workload?

The mi100 have entered a failed state before the machine is sufficently booted to be interactive, runtime suspend/resume happens fairly often, i understand this is normal. In the case of this machine there is a daemon that monitors hw sensors of the devices, presumably it causes the wakeup.

Which kernel version are you using (use uname -a command to check)?

I tried 6.6.75 with amdgpu-dkms (ie amdgpu from this repo) running and 6.13.1 with upstream amdgpu.

Can you also run rocminfo with some environment variables to get more info: AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo

unfortionatly this dosent result in any additional prints:

% AMD_LOG_LEVEL=5 HSAKMT_DEBUG_LEVEL=7 rocminfo
ROCk module is loaded
Failed to open /dev/dri/renderD129: Invalid argument
hsa api call failure at: /usr/src/debug/rocminfo/rocminfo-rocm-6.2.4/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

# amdvbflash -i
adapter seg  bn dn dID       asic           flash      romsize test    bios p/n    
======= ==== == == ==== =============== ============== ======= ==== ================
   0    0000 03 00 738C MI100(Slave)    GD25Q80C        100000 pass 113-D3431500-101
   1    0000 83 00 738C MI100(Slave)    GD25Q80C        100000 pass 113-D3431500-100
   2    0000 C3 00 73BF Navi21          P25Q80H         100000 pass 113-5N21XT_200801

Dmsg: dmesg.log

sohaibnd · 2025-02-07T17:23:59Z

Why are you using upstream kernels builds (from https://wiki.ubuntu.com/Kernel/MainlineBuilds I'm guessing) instead of the ones provided by Ubuntu (they aren't officially supported by ROCm either)? Did you try the default 6.8.0 kernel with Ubuntu 24.04 and amdgpu-dkms from https://repo.radeon.com included in the ROCm install docs?

IMbackK · 2025-02-07T19:35:21Z

I am self building kernels from kernel.org, i do this as i also do kernel development in other modules.

I can try 6.8 too but i do have the expectation that the upstream amdgpu works, maybe not for rocm where the ioctls the downstream kernel in this repo provides are used, but at the very least there should be no crash just from booting.

I gues since the issue occurs in both the kernel provided in this repo and in latest upstream stable i could also report this to the mailing list.

sohaibnd · 2025-02-07T21:20:42Z

Oh I see. Yes, do try the default 6.8.0 kernel from Ubuntu with both upstream amdgpu and the dkms module on https://repo.radeon.com.

I gues since the issue occurs in both the kernel provided in this repo and in latest upstream stable i could also report this to the mailing list.

Yeah that's a good idea.

sohaibnd · 2025-02-07T23:21:20Z

@IMbackK I tried reproducing with a 7900xtx and an upstream kernel 6.12 + upstream amdgpu on Ubuntu 24.04 but was unable to. Do you have any other info (HW/SW) about your system config that might be relevant? Did you build the kernel with any specific parameters or used any kernel boot parameters?

IMbackK · 2025-02-08T07:10:29Z

It for sure only happens on mi100 specifically, i also have a 6800xt in the same system that runtime suspends fine.
Besides amdgpu.runpm=1 i also have iommu=pt which may be relevant.

My kconfig only differs from the ubuntu default in that i enabled CONFIG_DMABUF_MOVE_NOTIFY=y and CONFIG_HSA_AMD_P2P=y both of which i have around to allow testing ROCR's dmabuf mode

bjj · 2025-02-26T23:55:24Z

This also happens on my Mi100 on Ubuntu 24.04 + standard ROCm install instructions. It is also fixed by amdgpu.runpm=0 This is on an MSI Z370M motherboard + i7-8700K.

I might be in a better place to try experiments than the submitter, since my card is not in production.

LunNova · 2025-03-03T01:57:39Z

I'm able to reproduce this issue with a system with multiple MI100s.

Running kernel and amdgpu built from the rocm-6.3.3 tag of this repository and amdgpu.runpm=1 results in this error when the card tries to resume:

[ 1075.338832] amdgpu 0000:c8:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[ 1075.408586] amdgpu 0000:c8:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1075.408601] amdgpu 0000:c8:00.0: amdgpu: SMU is resuming...
[ 1075.408610] amdgpu 0000:c8:00.0: amdgpu: SMC is not ready
[ 1075.408646] amdgpu 0000:c8:00.0: amdgpu: SMC engine is not correctly up!
[ 1075.408670] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -5
[ 1075.409439] amdgpu 0000:c8:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).

This issue is MI100 specific. Any cause of resume triggers it, including a simple cat /sys/class/drm/card6/device/gpu_busy_percent. Adding env vars for logging in AMD userspace utilities is of no benefit.

$ sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 0, firmware version: 0x00000000
PFP feature version: 0, firmware version: 0x00000000
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000018
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 0, firmware version: 0x00000000
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 48, firmware version: 0x00000047
IMU feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00170050
ASD feature version: 0, firmware version: 0x21000059
TA XGMI feature version: 0x00000000, firmware version: 0x20000014
TA RAS feature version: 0x00000000, firmware version: 0x1b00013e
TA HDCP feature version: 0x00000000, firmware version: 0x00000000
TA DTM feature version: 0x00000000, firmware version: 0x00000000
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00361d00 (54.29.0)
SDMA0 feature version: 44, firmware version: 0x00000012
SDMA1 feature version: 44, firmware version: 0x00000012
SDMA2 feature version: 44, firmware version: 0x00000012
SDMA3 feature version: 44, firmware version: 0x00000012
SDMA4 feature version: 44, firmware version: 0x00000012
SDMA5 feature version: 44, firmware version: 0x00000012
SDMA6 feature version: 44, firmware version: 0x00000012
SDMA7 feature version: 44, firmware version: 0x00000012
VCN feature version: 0, firmware version: 0x01101015
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
TOC feature version: 0, firmware version: 0x00000000
MES_KIQ feature version: 0, firmware version: 0x00000000
MES feature version: 0, firmware version: 0x00000000
VPE feature version: 0, firmware version: 0x00000000
VBIOS version: 113-D3432400-100
$ sudo cat /sys/kernel/debug/dri/6/amdgpu_firmware_info
<snipped exact same value for all other FW versions>
VBIOS version: 113-D3431500-101

amdgpu_firmware_info report from two MI100 cards in case the issue is firmware version specific. Cards with both VBIOS versions hang on resume.

IMbackK · 2025-03-24T11:20:17Z

@sohaibnd looks like this is pretty well established now, with regards to in what situation the crash occurs. Have you been able to reproduce this problem locally?

ppanchad-amd added the Under Investigation label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

IMbackK commented Feb 5, 2025 •

edited

Loading

ppanchad-amd commented Feb 5, 2025

sohaibnd commented Feb 5, 2025

IMbackK commented Feb 5, 2025 •

edited

Loading

sohaibnd commented Feb 5, 2025

IMbackK commented Feb 6, 2025 •

edited

Loading

sohaibnd commented Feb 7, 2025

IMbackK commented Feb 7, 2025 •

edited

Loading

sohaibnd commented Feb 7, 2025

sohaibnd commented Feb 7, 2025

IMbackK commented Feb 8, 2025 •

edited

Loading

bjj commented Feb 26, 2025

LunNova commented Mar 3, 2025 •

edited

Loading

IMbackK commented Mar 24, 2025 •

edited

Loading

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

Comments

IMbackK commented Feb 5, 2025 • edited Loading

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented Feb 5, 2025

sohaibnd commented Feb 5, 2025

IMbackK commented Feb 5, 2025 • edited Loading

sohaibnd commented Feb 5, 2025

IMbackK commented Feb 6, 2025 • edited Loading

sohaibnd commented Feb 7, 2025

IMbackK commented Feb 7, 2025 • edited Loading

sohaibnd commented Feb 7, 2025

sohaibnd commented Feb 7, 2025

IMbackK commented Feb 8, 2025 • edited Loading

bjj commented Feb 26, 2025

LunNova commented Mar 3, 2025 • edited Loading

IMbackK commented Mar 24, 2025 • edited Loading

IMbackK commented Feb 5, 2025 •

edited

Loading

IMbackK commented Feb 5, 2025 •

edited

Loading

IMbackK commented Feb 6, 2025 •

edited

Loading

IMbackK commented Feb 7, 2025 •

edited

Loading

IMbackK commented Feb 8, 2025 •

edited

Loading

LunNova commented Mar 3, 2025 •

edited

Loading

IMbackK commented Mar 24, 2025 •

edited

Loading