-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Problem Description
I am rather robustly getting a kernel error and hung system with any kind of unclean shutdown of Pytorch. A syntax error in the python file after rocm init, doing python -m pydoc transformers and then a quit from the pager, etc. It is quite trivial to reproduce.
Operating System
Ubuntu 24.04.3 LTS (Noble Numbat)
CPU
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU
gfx1151
ROCm Version
ROCm 7.1.1
ROCm Component
No response
Steps to Reproduce
Any non-trivial model running that crashes seems to produce this. For example: download the mnist example from the pytorch examples repo and run it:
git clone https://github.com/pytorch/examples.git
cd examples/mnist
python3 main.py
Run sudo dmesg -w to see new kernel messages in another window, then start the mnist training.
Let it run for an epoch or two, then ^C to interrupt. You will almost certainly get the null-pointer deference and system instability. If you are lucky, you will see it reported by dmesg before things crash.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.18
Runtime Ext Version: 1.14
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Uuid: CPU-XX
Marketing Name: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5187
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1151
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 5510(0x1586)
ASIC Revision: 0(0x0)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2900
BDFID: 50432
Internal Node ID: 1
Compute Unit: 40
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 32
SDMA engine uCode:: 14
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67108864(0x4000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67108864(0x4000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1151
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: aie2
Uuid: AIE-XX
Marketing Name: AIE-ML
Vendor Name: AMD
Feature: AGENT_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 1(0x1)
Queue Min Size: 64(0x40)
Queue Max Size: 64(0x40)
Queue Type: SINGLE
Node: 0
Device Type: DSP
Cache Info:
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 0(0x0)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 0
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:0
Memory Properties:
Features: AGENT_DISPATCH
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, COARSE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65536(0x10000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:0KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65468656(0x3e6f8f0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*** Done ***
Additional Information
Install (vanilla as of today, on a GMKTek evo X2, bios vers. 1.9 09/13/2025.
Kernel version 6.14.0-1016-oem (linux-oem-24.04c), driver/rocm from amdgpu-install_7.1.1.70101-1_all.deb,
torch-2.9.1+rocm7.1.1.lw.git351ff442-cp312-cp312-linux_x86_64.whl
torchaudio-2.9.0+rocm7.1.1.gite3c6ee2b-cp312-cp312-linux_x86_64.whl
torchvision-0.24.0+rocm7.1.1.gitb919bd0c-cp312-cp312-linux_x86_64.whl
triton-3.5.1+rocm7.1.1.gita272dfa8-cp312-cp312-linux_x86_64.whl
Here is what I see in dmesg in the cases where I don't get a 100% lockup:
[ 412.030231] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ 412.030242] #PF: supervisor read access in kernel mode
[ 412.030244] #PF: error_code(0x0000) - not-present page
[ 412.030247] PGD 0 P4D 0
[ 412.030250] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 412.030254] CPU: 12 UID: 0 PID: 237 Comm: kworker/12:1 Tainted: G OE 6.14.0-1016-oem #16-Ubuntu
[ 412.030259] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 412.030260] Hardware name: GMKtec NucBox_EVO-X2/GMKtec, BIOS EVO-X2 1.09 09/13/2025
[ 412.030262] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ 412.030485] RIP: 0010:interval_tree_remove+0x98/0x2a0
Here's another crash:
[Nov28 09:18] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ +0.000011] #PF: supervisor read access in kernel mode
[ +0.000002] #PF: error_code(0x0000) - not-present page
[ +0.000003] PGD 0 P4D 0
[ +0.000004] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000003] CPU: 8 UID: 0 PID: 250 Comm: kworker/8:1 Tainted: G OE 6.14.0-1016-oem #16-Ubuntu
[ +0.000005] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ +0.000001] Hardware name: GMKtec NucBox_EVO-X2/GMKtec, BIOS EVO-X2 1.09 09/13/2025
[ +0.000001] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ +0.000262] RIP: 0010:__rb_erase_color+0x99/0x270
The kernel stacktrace is not consistent from crash to crash nor the location that it happens from.