-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Problem Description
Originally reported on mesa, but this is a deeper problem.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/14392
Trivially reproducible with apitrace:
Note recent mesa has added a workaround for this specific game. The shader in question uses uninitialized variables, and the workaround is to zero-initialize every variable when compiling shaders. It only fixes it if the executable is named Cursemark, so the .trace fill will still reproduce this error. Please note: the workaround is not interesting. I'm not reporting an error with this game demo, this is just the fastest and most reliable way to reproduce the hard-fault on the card. I've seen ring timeouts repeatedly in a variety of games and applications but they take time. This kills it within seconds.
This is, most likely, a firmware issue. But at minimum the driver should be able to recover without crashing the entire graphics stack.
Operating System
Debian 13 (Trixie)
CPU
AMD Ryzen Threadripper 2920X 12-Core Processor
GPU
AMD Radeon RX 9060 XT (gfx1200)
ROCm Version
7.1.1
ROCm Component
No response
Steps to Reproduce
apitrace replay Cursemark.trace in Xorg, I haven't tested under wayland.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
�[37mROCk module is loaded�[0m
HSA System Attributes
Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32802540(0x1f486ec) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen Threadripper 2920X 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16511036(0xfbf03c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 3
Name: gfx1200
Uuid: GPU-35349d058425a41a
Marketing Name: AMD Radeon RX 9060 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 32768(0x8000) KB
Chip ID: 30096(0x7590)
ASIC Revision: 1(0x1)
Cacheline Size: 256(0x100)
Max Clock Freq. (MHz): 2700
BDFID: 17664
Internal Node ID: 2
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 128
SDMA engine uCode:: 662
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16695296(0xfec000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1200
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx12-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
[ 87.853679] amdgpu 0000:45:00.0: amdgpu: Dumping IP State
[ 87.854542] amdgpu 0000:45:00.0: amdgpu: Dumping IP State Completed
[ 87.854603] amdgpu 0000:45:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ 87.854606] amdgpu 0000:45:00.0: amdgpu: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[ 87.854609] amdgpu 0000:45:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=9261, emitted seq=9264
[ 87.854615] amdgpu 0000:45:00.0: amdgpu: Process glretrace pid 12755 thread glretrace:cs0 pid 12756
[ 87.854618] amdgpu 0000:45:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[ 90.096230] amdgpu 0000:45:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[ 90.096238] amdgpu 0000:45:00.0: amdgpu: GPU reset begin!. Source: 1
[ 90.125450] iommu ivhd1: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0000:00:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00]
[ 92.344425] amdgpu 0000:45:00.0: amdgpu: MES(1) failed to respond to msg=REMOVE_QUEUE
[ 92.344432] amdgpu 0000:45:00.0: amdgpu: failed to unmap legacy queue
[ 92.586063] [drm:gfx_v12_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 92.612452] amdgpu 0000:45:00.0: amdgpu: MODE1 reset
[ 92.612456] amdgpu 0000:45:00.0: amdgpu: GPU mode1 reset
[ 92.612526] amdgpu 0000:45:00.0: amdgpu: GPU smu mode1 reset
[ 93.613596] amdgpu 0000:45:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 93.613688] amdgpu 0000:45:00.0: amdgpu: PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 93.613797] amdgpu 0000:45:00.0: amdgpu: VRAM is lost due to GPU reset!
[ 93.613799] amdgpu 0000:45:00.0: amdgpu: PSP is resuming...
[ 93.851610] amdgpu 0000:45:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 93.854920] amdgpu 0000:45:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 93.854923] amdgpu 0000:45:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ 93.854925] amdgpu 0000:45:00.0: amdgpu: SMU is resuming...
[ 93.854928] amdgpu 0000:45:00.0: amdgpu: smu driver if version = 0x0000002e, smu fw if version = 0x00000032, smu fw program = 0, smu fw version = 0x00664500 (102.69.0)
[ 93.854931] amdgpu 0000:45:00.0: amdgpu: SMU driver if version not matched
[ 93.889179] amdgpu 0000:45:00.0: amdgpu: SMU is resumed successfully!
[ 93.889394] amdgpu 0000:45:00.0: amdgpu: program CP_MES_CNTL : 0x4000000
[ 93.889397] amdgpu 0000:45:00.0: amdgpu: program CP_MES_CNTL : 0xc000000
[ 93.899636] amdgpu 0000:45:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x0A000700
[ 94.078609] amdgpu 0000:45:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 94.078614] amdgpu 0000:45:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 94.078616] amdgpu 0000:45:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 94.078618] amdgpu 0000:45:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 94.078620] amdgpu 0000:45:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 94.078621] amdgpu 0000:45:00.0: amdgpu: ring sdma0 uses VM inv eng 9 on hub 0
[ 94.078623] amdgpu 0000:45:00.0: amdgpu: ring sdma1 uses VM inv eng 10 on hub 0
[ 94.078624] amdgpu 0000:45:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 94.078626] amdgpu 0000:45:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 94.083435] amdgpu 0000:45:00.0: amdgpu: GPU reset(1) succeeded!
[ 94.096708] amdgpu 0000:45:00.0: [drm] device wedged, but recovered through reset
Tested 6.16.6, 6.17.10, 6.18.1, 6.19-rc1, with both upstream and amdgpu-dkms, and all firmware releases from your internal git commits:
23451b6e5c1bdcf2183495a8d35bf7e0fb9a8ea4
82c3ef8fd13470434a51b8323a7b33ede62e5c1a
1fb4bba0382a2e9a83831de1cfda9a0097857e86
(Firmware was updated by checking out linux-firmware with the appropriate commit and rebooting. Verified changed via /sys/class/drm/device/fw_version with the following combinations:
asd_fw_version 0x210000fc (33, 0, 0, 252)
dmcub_fw_version 0x0a000400 (10, 0, 4, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b36 (11, 54)
mec_fw_version 0x00000c6c (12, 108)
mes_fw_version 0x00000083 (131,)
mes_kiq_fw_version 0x00000083 (131,)
pfp_fw_version 0x00000b7c (11, 124)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910902e (9, 16, 144, 46)
*reboot*
asd_fw_version 0x21000104 (33, 0, 1, 4)
dmcub_fw_version 0x0a000700 (10, 0, 7, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b40 (11, 64)
mec_fw_version 0x00000c80 (12, 128)
mes_fw_version 0x00000084 (132,)
mes_kiq_fw_version 0x00000084 (132,)
pfp_fw_version 0x00000b86 (11, 134)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910b001 (9, 16, 176, 1)
*reboot*
asd_fw_version 0x21000104 (33, 0, 1, 4)
dmcub_fw_version 0x0a000700 (10, 0, 7, 0)
imu_fw_version 0x0c2c2500 (12, 44, 37, 0)
me_fw_version 0x00000b40 (11, 64)
mec_fw_version 0x00000c80 (12, 128)
mes_fw_version 0x00000084 (132,)
mes_kiq_fw_version 0x00000084 (132,)
pfp_fw_version 0x00000b86 (11, 134)
rlc_fw_version 0x00bde160 (189, 225, 96)
sdma2_fw_version 0x00798e96 (121, 142, 150)
sdma_fw_version 0x00798e96 (121, 142, 150)
smc_fw_version 0x00664500 (102, 69, 0)
sos_fw_version 0x003b0f0d (59, 15, 13)
vcn_fw_version 0x0910b001 (9, 16, 176, 1)
I believe the middle is from the amdgpu-dkms-firmware package 30.20.1.0.30200100-2255209.24.04
devcoredump: