Skip to content

[Issue]: RX 9060 XT ring_timeout with invalid shader crashes card. #202

@harikattar

Description

@harikattar

Problem Description

Originally reported on mesa, but this is a deeper problem.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/14392

Trivially reproducible with apitrace:

Cursemark.zip

Note recent mesa has added a workaround for this specific game. The shader in question uses uninitialized variables, and the workaround is to zero-initialize every variable when compiling shaders. It only fixes it if the executable is named Cursemark, so the .trace fill will still reproduce this error. Please note: the workaround is not interesting. I'm not reporting an error with this game demo, this is just the fastest and most reliable way to reproduce the hard-fault on the card. I've seen ring timeouts repeatedly in a variety of games and applications but they take time. This kills it within seconds.

This is, most likely, a firmware issue. But at minimum the driver should be able to recover without crashing the entire graphics stack.

Operating System

Debian 13 (Trixie)

CPU

AMD Ryzen Threadripper 2920X 12-Core Processor

GPU

AMD Radeon RX 9060 XT (gfx1200)

ROCm Version

7.1.1

ROCm Component

No response

Steps to Reproduce

apitrace replay Cursemark.trace in Xorg, I haven't tested under wayland.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

�[37mROCk module is loaded�[0m

HSA System Attributes

Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 3


Name: gfx1200
Uuid: GPU-35349d058425a41a
Marketing Name: AMD Radeon RX 9060 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 32768(0x8000) KB
Chip ID: 30096(0x7590)
ASIC Revision: 1(0x1)
Cacheline Size: 256(0x100)
Max Clock Freq. (MHz): 2700
BDFID: 17664
Internal Node ID: 2
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 128
SDMA engine uCode:: 662
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16695296(0xfec000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1200
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx12-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

[   87.853679] amdgpu 0000:45:00.0: amdgpu: Dumping IP State
[   87.854542] amdgpu 0000:45:00.0: amdgpu: Dumping IP State Completed
[   87.854603] amdgpu 0000:45:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[   87.854606] amdgpu 0000:45:00.0: amdgpu: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[   87.854609] amdgpu 0000:45:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=9261, emitted seq=9264
[   87.854615] amdgpu 0000:45:00.0: amdgpu:  Process glretrace pid 12755 thread glretrace:cs0 pid 12756
[   87.854618] amdgpu 0000:45:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[   90.096230] amdgpu 0000:45:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[   90.096238] amdgpu 0000:45:00.0: amdgpu: GPU reset begin!. Source:  1
[   90.125450] iommu ivhd1: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0000:00:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00]
[   92.344425] amdgpu 0000:45:00.0: amdgpu: MES(1) failed to respond to msg=REMOVE_QUEUE
[   92.344432] amdgpu 0000:45:00.0: amdgpu: failed to unmap legacy queue
[   92.586063] [drm:gfx_v12_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[   92.612452] amdgpu 0000:45:00.0: amdgpu: MODE1 reset
[   92.612456] amdgpu 0000:45:00.0: amdgpu: GPU mode1 reset
[   92.612526] amdgpu 0000:45:00.0: amdgpu: GPU smu mode1 reset
[   93.613596] amdgpu 0000:45:00.0: amdgpu: GPU reset succeeded, trying to resume
[   93.613688] amdgpu 0000:45:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x0000008000000000).
[   93.613797] amdgpu 0000:45:00.0: amdgpu: VRAM is lost due to GPU reset!
[   93.613799] amdgpu 0000:45:00.0: amdgpu: PSP is resuming...
[   93.851610] amdgpu 0000:45:00.0: amdgpu: RAS: optional ras ta ucode is not available
[   93.854920] amdgpu 0000:45:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   93.854923] amdgpu 0000:45:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[   93.854925] amdgpu 0000:45:00.0: amdgpu: SMU is resuming...
[   93.854928] amdgpu 0000:45:00.0: amdgpu: smu driver if version = 0x0000002e, smu fw if version = 0x00000032, smu fw program = 0, smu fw version = 0x00664500 (102.69.0)
[   93.854931] amdgpu 0000:45:00.0: amdgpu: SMU driver if version not matched
[   93.889179] amdgpu 0000:45:00.0: amdgpu: SMU is resumed successfully!
[   93.889394] amdgpu 0000:45:00.0: amdgpu: program CP_MES_CNTL : 0x4000000
[   93.889397] amdgpu 0000:45:00.0: amdgpu: program CP_MES_CNTL : 0xc000000
[   93.899636] amdgpu 0000:45:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x0A000700
[   94.078609] amdgpu 0000:45:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[   94.078614] amdgpu 0000:45:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   94.078616] amdgpu 0000:45:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   94.078618] amdgpu 0000:45:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[   94.078620] amdgpu 0000:45:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[   94.078621] amdgpu 0000:45:00.0: amdgpu: ring sdma0 uses VM inv eng 9 on hub 0
[   94.078623] amdgpu 0000:45:00.0: amdgpu: ring sdma1 uses VM inv eng 10 on hub 0
[   94.078624] amdgpu 0000:45:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[   94.078626] amdgpu 0000:45:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[   94.083435] amdgpu 0000:45:00.0: amdgpu: GPU reset(1) succeeded!
[   94.096708] amdgpu 0000:45:00.0: [drm] device wedged, but recovered through reset

Tested 6.16.6, 6.17.10, 6.18.1, 6.19-rc1, with both upstream and amdgpu-dkms, and all firmware releases from your internal git commits:

23451b6e5c1bdcf2183495a8d35bf7e0fb9a8ea4
82c3ef8fd13470434a51b8323a7b33ede62e5c1a
1fb4bba0382a2e9a83831de1cfda9a0097857e86

(Firmware was updated by checking out linux-firmware with the appropriate commit and rebooting. Verified changed via /sys/class/drm/device/fw_version with the following combinations:

asd_fw_version 0x210000fc (33, 0, 0, 252)
dmcub_fw_version 0x0a000400 (10, 0, 4, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b36 (11, 54)
mec_fw_version 0x00000c6c (12, 108)
mes_fw_version 0x00000083 (131,)
mes_kiq_fw_version 0x00000083 (131,)
pfp_fw_version 0x00000b7c (11, 124)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910902e (9, 16, 144, 46)

*reboot*

asd_fw_version 0x21000104 (33, 0, 1, 4)
dmcub_fw_version 0x0a000700 (10, 0, 7, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b40 (11, 64)
mec_fw_version 0x00000c80 (12, 128)
mes_fw_version 0x00000084 (132,)
mes_kiq_fw_version 0x00000084 (132,)
pfp_fw_version 0x00000b86 (11, 134)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910b001 (9, 16, 176, 1)

*reboot*

asd_fw_version 0x21000104 (33, 0, 1, 4)
dmcub_fw_version 0x0a000700 (10, 0, 7, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b40 (11, 64)
mec_fw_version 0x00000c80 (12, 128)
mes_fw_version 0x00000084 (132,)
mes_kiq_fw_version 0x00000084 (132,)
pfp_fw_version 0x00000b86 (11, 134)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910b001 (9, 16, 176, 1)

I believe the middle is from the amdgpu-dkms-firmware package 30.20.1.0.30200100-2255209.24.04

devcoredump:

devcoredump_full_crash_2025122021.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions