-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Problem Description
This issue arises when using multiple MI300A APUs in various scenarios, such as during high GPU memory usage or after loading AI model tensors. We observe a huge performance drop (by a factor of 2 to 10), primarily because some of the hipMalloc allocations on one device are being partially fulfilled by the HBM of another device. We expectthat all allocations on a GPU device should be served exclusively by the memory of that selected device, or the allocation must fail.
Operating System
Red Hat Enterprise Linux 9.4 (Plow)
CPU
AMD Instinct MI300A Accelerator
GPU
4 * AMD Instinct MI300A Accelerator
ROCm Version
ROCm 6.4.0
ROCm Component
No response
Steps to Reproduce
To reproduce we used a python env with torch 2.7.0 and numpy (reproducer.py is below), but you can reproduce using hipMalloc by monitoring the memory usage repartition in /sys/devices/system/node/node*/meminfo.
python reproducer.py --first_gpu_alloc_ratio 0.95 --next_gpus_relative_alloc_ratio 0.90
Output :
Number of available GPUs: 4
-------------------------------
Test 1: This pass will execute the following steps:
1. Show the current memory usage on each NUMA nodes.
2. Allocate tensors on GPU 1 to fill its memory capacity.
3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.
4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).
5. Repeat the same process on each GPU without releasing the memory from the previous GPU.
--
Free memory layout at startup :
Numa node 0 free memory : 121 GiB, (of which pagecache memory : 1 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
Allocation tooks 13 seconds and comes at 93% (10.09 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
NUMA node 2: 7.72 GiB
Bench (any bank) :: 23.666840 TFLOPs (during 1.451455 sec)
Bench (right bank) :: 28.834836 TFLOPs (during 1.191314 sec)
Bench (wrong bank) :: 5.774188 TFLOPs (during 5.949122 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
Allocation tooks 13 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.405166 TFLOPs (during 0.631399 sec)
Bench (right bank) :: 54.367535 TFLOPs (during 0.631836 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
Allocation tooks 12 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.410138 TFLOPs (during 0.631341 sec)
Bench (right bank) :: 54.393623 TFLOPs (during 0.631533 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
Allocation tooks 6 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.396970 TFLOPs (during 0.631494 sec)
Bench (right bank) :: 54.373855 TFLOPs (during 0.631762 sec)
-------------------------------
Test 2: This pass will execute the following steps:
1. Fill the pagecache with a dummy 100 GiB file to demonstrate the effect of pagecache on memory allocation.
2. Retry the allocation and performance evaluation after partially filling the pagecache.
107374182400 bytes (107 GB, 100 GiB) copied, 102 s, 1.1 GB/s
100+0 records in
100+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 101.898 s, 1.1 GB/s
Successfully created 100 GiB file filled with zeros: ./zero_file.bin
Successfully synchronized file system buffers.
--
Free memory layout at startup :
Numa node 0 free memory : 111 GiB, (of which pagecache memory : 101 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
Allocation tooks 13 seconds and comes at 93% (9.93 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
NUMA node 2: 7.55 GiB
Bench (any bank) :: 24.073996 TFLOPs (during 1.426907 sec)
Bench (right bank) :: 28.760786 TFLOPs (during 1.194381 sec)
Bench (wrong bank) :: 5.542506 TFLOPs (during 6.197801 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
Allocation tooks 13 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.412645 TFLOPs (during 0.631312 sec)
Bench (right bank) :: 54.356111 TFLOPs (during 0.631968 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
Allocation tooks 11 seconds and comes mainly from the right memory bank
Bench (any bank) :: 54.441244 TFLOPs (during 0.630980 sec)
Bench (right bank) :: 54.393767 TFLOPs (during 0.631531 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
Allocation tooks 67 seconds and comes at 75% (49.43 GiB still free of which 0.41 GiB of pagecache) from the right numa and from :
NUMA node 3: 19.7 GiB
Bench (any bank) :: 22.395341 TFLOPs (during 1.533861 sec)
Bench (right bank) :: 18.161142 TFLOPs (during 1.891475 sec)
Bench (wrong bank) :: 54.232145 TFLOPs (during 0.633413 sec)
Successfully removed: ./zero_file.bin
reproducer.py :
import os
import torch
import subprocess
import time
import random
import argparse
def main(args):
# Get the number of available GPUs
num_gpus = torch.cuda.device_count()
print(f"Number of available GPUs: {num_gpus}")
# Execute 2 pass of :
# --> Show the memory
# --> allocate tensors on GPU 1 to fill it's memory
# --> Check the real location of the allocated memory (which NUMA node memory increased ?)
# --> Evaluate TFLOPs by selecting 3 random tensors (A, B, C) in the GPU and compute C += A . B (dot product)
# ... same thing on each of the GPU (without releasing the memory of each previous GPU)
#
# the second pass differ only by the fact that we start by partially filling pagecache
# with evictable pages to show the effect of the pagecache on
for l in range(2) :
alloc_ratio = args.first_gpu_alloc_ratio
torch.cuda.empty_cache()
print("-------------------------------")
if l == 0 :
print(f"Test 1: This pass will execute the following steps:")
print("1. Show the current memory usage on each NUMA nodes.")
print(f"2. Allocate tensors on GPU {args.first_gpu_index % num_gpus} to fill its memory capacity.")
print("3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.")
print("4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).")
print("5. Repeat the same process on each GPU without releasing the memory from the previous GPU.")
else :
if not args.do_pagecache_test :
break
print(f"Test 2: This pass will execute the following steps:")
print(f"1. Fill the pagecache with a dummy {args.pagecache_fill_size} GiB file to demonstrate the effect of pagecache on memory allocation.")
print("2. Retry the allocation and performance evaluation after partially filling the pagecache.")
create_zero_file('./zero_file.bin', args.pagecache_fill_size)
print("--")
print("Free memory layout at startup :")
memfree_start, pagecache_start = get_memfree()
for node_id, memfree_bytes in memfree_start.items() :
print(f"Numa node {node_id} free memory : {round(memfree_bytes/1024/1024)} GiB, (of which pagecache memory : {round(pagecache_start[node_id]/1024/1024)} GiB)")
tensors = [None] * num_gpus
for i in range(num_gpus) :
print("--")
device_idx = (i + args.first_gpu_index) % num_gpus
total_memory = torch.cuda.get_device_properties(device_idx).total_memory
# Get initial numa node memory info
memfree_before, pagecache_before = get_memfree()
# Allocate a bunch of tensors (nb_blocks) that fill alloc_ratio of the GPU
nb_blocks = round((total_memory * alloc_ratio) // (args.matrix_size * args.matrix_size * 4))
print(f"GPU {device_idx} : alloc {round(alloc_ratio * 100, 2)}% - {round(total_memory*alloc_ratio / 1024 ** 3, 2)} GiB of the GPU memory in {nb_blocks} tensors")
start_time = time.time()
tensors[device_idx] = [torch.zeros(args.matrix_size, args.matrix_size, dtype=torch.float32, device=f"cuda:{device_idx}") for j in range(nb_blocks)]
torch.cuda.synchronize(device_idx)
duration = time.time() - start_time
# Get numa node memory info after allocation
memfree_after, pagecache_after = get_memfree()
# Calculate the difference in memory
mem_diff = {}
total_diff = 0
secondary_pools = ""
for node in memfree_before.keys():
diff = memfree_after[node] - memfree_before[node]
mem_diff[node] = diff
total_diff += diff
if node != device_idx and diff < - 1024 * 1024:
# Print additional nodes where the allocation difference is greater than 1024 MiB -- which is enough to skip the python memory usage increase
secondary_pools += f"\n NUMA node {node}: {-round(diff/1024/1024, 2)} GiB"
ratio_in_right_numa = mem_diff[device_idx] / total_diff
if len(secondary_pools) == 0 :
print(f" Allocation tooks {round(duration)} seconds and comes mainly from the right memory bank")
else :
print(f" Allocation tooks {round(duration)} seconds and comes at {round(ratio_in_right_numa*100)}% ({round(memfree_after[device_idx] / 1024 / 1024, 2)} GiB still free of which {round(pagecache_after[device_idx]/1024/1024, 2)} GiB of pagecache) from the right numa and from :{secondary_pools}")
# Perform matrix multiplication on each GPU
if args.do_perf_test :
execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa)
# reduce the % of the memory to show different behaviours
alloc_ratio *= args.next_gpus_relative_alloc_ratio
tensors = None
print()
print()
remove_file('./zero_file.bin')
# Get the free and pagecache memory for each numa node from /sys/devices/system/node/node*/meminfo.
def get_memfree():
memfree = {}
page_cache = {}
try:
output = subprocess.check_output("grep -hi 'MemFree' /sys/devices/system/node/node*/meminfo", shell=True)
for line in output.decode().strip().split('\n'):
# assume format is : Node 0 MemFree: 28046496 kB
parts = line.split()
node_id = int(parts[1])
memfree[node_id] = int(parts[3])
output = subprocess.check_output("grep -hi 'FilePages' /sys/devices/system/node/node*/meminfo", shell=True)
for line in output.decode().strip().split('\n'):
# assume format is : Node 0 MemFree: 28046496 kB
parts = line.split()
node_id = int(parts[1])
page_cache[node_id] = int(parts[3])
memfree[node_id] += page_cache[node_id]
except Exception as e:
print(f"Error reading memory info: {e}")
return memfree, page_cache
def pick_random_square_matrix(tensors, mnp, n_blocks) :
if n_blocks < 0 :
idx = random.randint(n_blocks, -1) # pick a random index for tensor
else :
idx = random.randint(0, n_blocks - 1) # pick a random index for tensor
return tensors[idx].view(mnp, mnp) # use the tensor as a square matrix
def random_mm_sum(tensors, mnp, n_blocks, n_mm) :
for l in range(n_mm):
A = pick_random_square_matrix(tensors, mnp, n_blocks)
B = pick_random_square_matrix(tensors, mnp, n_blocks)
C = pick_random_square_matrix(tensors, mnp, n_blocks)
C += torch.mm(A,B)
flops = n_mm * (mnp * mnp * (2*mnp - 1))
return C, flops
def execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa) :
results = []
def bench(test_name, n_blocks, n_mm, from_end = False) :
torch.cuda.synchronize(device=f'cuda:{device_idx}')
start_time = time.time()
C, total_flops = random_mm_sum(tensors[device_idx], args.matrix_size, n_blocks, n_mm)
torch.cuda.synchronize(device=f'cuda:{device_idx}')
end_time = time.time()
elapsed_time = end_time - start_time # Time in seconds
tflops = total_flops / (elapsed_time * 1e12) # Convert to TFLOPs
if test_name is not None :
print(f"{test_name}: {tflops:.6f} TFLOPs (during {elapsed_time:.6f} sec)")
return C
# warmup bench (discarded)
results.append(bench(None, nb_blocks, args.perf_nb_warmup_loop))
# any tensor is candidate
results.append(bench(" Bench (any bank) :", nb_blocks, args.perf_nb_loop))
# assume that the allocation started to use the right memory bank and then fallback to the wrong one
nb_block_in_right_numa = round(ratio_in_right_numa * nb_blocks) - 1
if nb_block_in_right_numa > 100 :
results.append(bench(" Bench (right bank) :", nb_block_in_right_numa, args.perf_nb_loop))
nb_block_in_wrong_numa = round((1 - ratio_in_right_numa) * nb_blocks) - 1
if nb_block_in_wrong_numa > 100 :
results.append(bench(" Bench (wrong bank) :", -nb_block_in_wrong_numa, args.perf_nb_loop))
# Write a file full of zeros that will fill the pagecache
# you can replace it by any non O_DIRECT file read (for instance cat filename > /dev/null)
# we ensure the file is properly sync at the end to be sure pages can be evicted from pagecache easily
def create_zero_file(filename, size_gb):
# Execute the dd command
process = subprocess.run(f'dd if=/dev/zero of={filename} bs=1G count={size_gb} status=progress', shell=True)
# Check if the command was successful
if process.returncode == 0:
print(f"Successfully created {size_gb} GiB file filled with zeros: {filename}")
else:
print(f"Error occurred while creating the file: {process.returncode}")
# Ensure there are no dirty pages
sync_process = subprocess.run('sync', shell=True)
if sync_process.returncode == 0:
print("Successfully synchronized file system buffers.")
else:
print(f"Error occurred while synchronizing: {sync_process.returncode}")
# At the end we want to empty the pagecache to restore the original state
# it is done by destroying the file created by create_zero_file
def remove_file(filename):
try:
os.remove(filename)
print(f"Successfully removed: {filename}")
except FileNotFoundError:
print(f"File not found: {filename}")
except PermissionError:
print(f"Permission denied: {filename}")
except Exception as e:
print(f"Error occurred while removing the file: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="GPU Memory Allocation Parameters")
parser.add_argument('--first_gpu_alloc_ratio', type=float, required=True,
help='Percent of the GPU capacity to allocate on the first GPU')
parser.add_argument('--next_gpus_relative_alloc_ratio', type=float, required=True,
help='Each GPU allocates memory of the previous one multiplied by this ratio')
parser.add_argument('--first_gpu_index', type=int, default=1,
help='Index to start with another GPU than the 0')
parser.add_argument('--matrix_size', type=int, default=2048,
help='Size of the tensors to allocate (matrix_size x matrix_size)')
parser.add_argument('--do_perf_test', type=bool, default=True,
help='Flag to enable performance test')
parser.add_argument('--perf_nb_loop', type=int, default=2000,
help='Number of dot products with 2 matrices of shape (matrix_size, matrix_size)')
parser.add_argument('--perf_nb_warmup_loop', type=int, default=round(2000 * 0.2),
help='Number of warmup loops for performance test')
parser.add_argument('--do_pagecache_test', type=bool, default=True,
help='Flag to enable pagecache test')
parser.add_argument('--pagecache_fill_size', type=int, default=100,
help='Size in GiB to fill in the pagecache for the test')
args = parser.parse_args()
main(args)
### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.15
Runtime Ext Version: 1.7
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: YES
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Instinct MI300A Accelerator
Uuid: CPU-XX
Marketing Name: AMD Instinct MI300A Accelerator
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 130847776(0x7cc9420) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 130847776(0x7cc9420) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 130847776(0x7cc9420) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 130847776(0x7cc9420) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: AMD Instinct MI300A Accelerator
Uuid: CPU-XX
Marketing Name: AMD Instinct MI300A Accelerator
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 1
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131809200(0x7db3fb0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131809200(0x7db3fb0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131809200(0x7db3fb0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131809200(0x7db3fb0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 3
*******
Name: AMD Instinct MI300A Accelerator
Uuid: CPU-XX
Marketing Name: AMD Instinct MI300A Accelerator
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 2
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 2
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131809208(0x7db3fb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131809208(0x7db3fb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131809208(0x7db3fb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131809208(0x7db3fb8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 4
*******
Name: AMD Instinct MI300A Accelerator
Uuid: CPU-XX
Marketing Name: AMD Instinct MI300A Accelerator
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 3
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 3
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131795632(0x7db0ab0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 131795632(0x7db0ab0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131795632(0x7db0ab0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131795632(0x7db0ab0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 5
*******
Name: gfx942
Uuid: GPU-6cdb410c2f219ac0
Marketing Name: AMD Instinct MI300A
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 4
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 24576(0x6000) KB
L3: 262144(0x40000) KB
Chip ID: 29856(0x74a0)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 512
Internal Node ID: 4
Compute Unit: 228
SIMDs per CU: 4
Shader Engines: 24
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 138
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 6
*******
Name: gfx942
Uuid: GPU-a5b43667c73cac87
Marketing Name: AMD Instinct MI300A
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 5
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 24576(0x6000) KB
L3: 262144(0x40000) KB
Chip ID: 29856(0x74a0)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 512
Internal Node ID: 5
Compute Unit: 228
SIMDs per CU: 4
Shader Engines: 24
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 138
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 7
*******
Name: gfx942
Uuid: GPU-655f5d89689c7f67
Marketing Name: AMD Instinct MI300A
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 6
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 24576(0x6000) KB
L3: 262144(0x40000) KB
Chip ID: 29856(0x74a0)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 512
Internal Node ID: 6
Compute Unit: 228
SIMDs per CU: 4
Shader Engines: 24
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 138
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 8
*******
Name: gfx942
Uuid: GPU-5f8fb5cf1f014be1
Marketing Name: AMD Instinct MI300A
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 7
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 24576(0x6000) KB
L3: 262144(0x40000) KB
Chip ID: 29856(0x74a0)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 512
Internal Node ID: 7
Compute Unit: 228
SIMDs per CU: 4
Shader Engines: 24
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 138
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 134217728(0x8000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
no specific message regarding the location of memory in dmesg, nor with AMD_LOG_LEVEL