Skip to content

[Issue]: MI300A : hipMalloc falls in wrong GPU HBM under memory pressure #186

@cboillot

Description

@cboillot

Problem Description

This issue arises when using multiple MI300A APUs in various scenarios, such as during high GPU memory usage or after loading AI model tensors. We observe a huge performance drop (by a factor of 2 to 10), primarily because some of the hipMalloc allocations on one device are being partially fulfilled by the HBM of another device. We expectthat all allocations on a GPU device should be served exclusively by the memory of that selected device, or the allocation must fail.

Operating System

Red Hat Enterprise Linux 9.4 (Plow)

CPU

AMD Instinct MI300A Accelerator

GPU

4 * AMD Instinct MI300A Accelerator

ROCm Version

ROCm 6.4.0

ROCm Component

No response

Steps to Reproduce

To reproduce we used a python env with torch 2.7.0 and numpy (reproducer.py is below), but you can reproduce using hipMalloc by monitoring the memory usage repartition in /sys/devices/system/node/node*/meminfo.
python reproducer.py --first_gpu_alloc_ratio 0.95 --next_gpus_relative_alloc_ratio 0.90

Output :

Number of available GPUs: 4
-------------------------------
Test 1: This pass will execute the following steps:
1. Show the current memory usage on each NUMA nodes.
2. Allocate tensors on GPU 1 to fill its memory capacity.
3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.
4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).
5. Repeat the same process on each GPU without releasing the memory from the previous GPU.
--
Free memory layout at startup :
Numa node 0 free memory : 121 GiB, (of which pagecache memory : 1 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
 Allocation tooks 13 seconds and comes at 93% (10.09 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
  NUMA node 2: 7.72 GiB
 Bench (any bank) :: 23.666840 TFLOPs (during 1.451455 sec)
 Bench (right bank) :: 28.834836 TFLOPs (during 1.191314 sec)
 Bench (wrong bank) :: 5.774188 TFLOPs (during 5.949122 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
 Allocation tooks 13 seconds and comes mainly from the right memory bank
 Bench (any bank) :: 54.405166 TFLOPs (during 0.631399 sec)
 Bench (right bank) :: 54.367535 TFLOPs (during 0.631836 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
 Allocation tooks 12 seconds and comes mainly from the right memory bank
 Bench (any bank) :: 54.410138 TFLOPs (during 0.631341 sec)
 Bench (right bank) :: 54.393623 TFLOPs (during 0.631533 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
 Allocation tooks 6 seconds and comes mainly from the right memory bank
 Bench (any bank) :: 54.396970 TFLOPs (during 0.631494 sec)
 Bench (right bank) :: 54.373855 TFLOPs (during 0.631762 sec)


-------------------------------
Test 2: This pass will execute the following steps:
1. Fill the pagecache with a dummy 100 GiB file to demonstrate the effect of pagecache on memory allocation.
2. Retry the allocation and performance evaluation after partially filling the pagecache.
107374182400 bytes (107 GB, 100 GiB) copied, 102 s, 1.1 GB/s
100+0 records in
100+0 records out
107374182400 bytes (107 GB, 100 GiB) copied, 101.898 s, 1.1 GB/s
Successfully created 100 GiB file filled with zeros: ./zero_file.bin
Successfully synchronized file system buffers.
--
Free memory layout at startup :
Numa node 0 free memory : 111 GiB, (of which pagecache memory : 101 GiB)
Numa node 1 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 2 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
Numa node 3 free memory : 124 GiB, (of which pagecache memory : 0 GiB)
--
GPU 1 : alloc 95.0% - 121.6 GiB of the GPU memory in 7782 tensors
 Allocation tooks 13 seconds and comes at 93% (9.93 GiB still free of which 0.45 GiB of pagecache) from the right numa and from :
  NUMA node 2: 7.55 GiB
 Bench (any bank) :: 24.073996 TFLOPs (during 1.426907 sec)
 Bench (right bank) :: 28.760786 TFLOPs (during 1.194381 sec)
 Bench (wrong bank) :: 5.542506 TFLOPs (during 6.197801 sec)
--
GPU 2 : alloc 85.5% - 109.44 GiB of the GPU memory in 7004 tensors
 Allocation tooks 13 seconds and comes mainly from the right memory bank
 Bench (any bank) :: 54.412645 TFLOPs (during 0.631312 sec)
 Bench (right bank) :: 54.356111 TFLOPs (during 0.631968 sec)
--
GPU 3 : alloc 76.95% - 98.5 GiB of the GPU memory in 6303 tensors
 Allocation tooks 11 seconds and comes mainly from the right memory bank
 Bench (any bank) :: 54.441244 TFLOPs (during 0.630980 sec)
 Bench (right bank) :: 54.393767 TFLOPs (during 0.631531 sec)
--
GPU 0 : alloc 69.25% - 88.65 GiB of the GPU memory in 5673 tensors
 Allocation tooks 67 seconds and comes at 75% (49.43 GiB still free of which 0.41 GiB of pagecache) from the right numa and from :
  NUMA node 3: 19.7 GiB
 Bench (any bank) :: 22.395341 TFLOPs (during 1.533861 sec)
 Bench (right bank) :: 18.161142 TFLOPs (during 1.891475 sec)
 Bench (wrong bank) :: 54.232145 TFLOPs (during 0.633413 sec)


Successfully removed: ./zero_file.bin

reproducer.py :

import os
import torch
import subprocess
import time
import random
import argparse

def main(args):

    # Get the number of available GPUs
    num_gpus = torch.cuda.device_count()
    print(f"Number of available GPUs: {num_gpus}")

    # Execute 2 pass of :
    # --> Show the memory 
    #  --> allocate tensors on GPU 1 to fill it's memory
    #  --> Check the real location of the allocated memory (which NUMA node memory increased ?)
    #  --> Evaluate TFLOPs by selecting 3 random tensors (A, B, C) in the GPU and compute C += A . B (dot product)
    #  ... same thing on each of the GPU (without releasing the memory of each previous GPU)
    #
    # the second pass differ only by the fact that we start by partially filling pagecache 
    # with evictable pages to show the effect of the pagecache on 
    for l in range(2) :
        alloc_ratio = args.first_gpu_alloc_ratio
        torch.cuda.empty_cache()
        print("-------------------------------")
        if l == 0 :
            print(f"Test 1: This pass will execute the following steps:")
            print("1. Show the current memory usage on each NUMA nodes.")
            print(f"2. Allocate tensors on GPU {args.first_gpu_index % num_gpus} to fill its memory capacity.")
            print("3. Check the actual location of the allocated memory by checking in which NUMA node's memory usage has increased.")
            print("4. Evaluate TFLOPs by selecting 3 random tensors (A, B, C) on the GPU and compute C += A . B (dot product).")
            print("5. Repeat the same process on each GPU without releasing the memory from the previous GPU.")
        else :
            if not args.do_pagecache_test :
                break
            print(f"Test 2: This pass will execute the following steps:")
            print(f"1. Fill the pagecache with a dummy {args.pagecache_fill_size} GiB file to demonstrate the effect of pagecache on memory allocation.")
            print("2. Retry the allocation and performance evaluation after partially filling the pagecache.")
            create_zero_file('./zero_file.bin', args.pagecache_fill_size)
        print("--")

        print("Free memory layout at startup :")
        memfree_start, pagecache_start = get_memfree()
        for node_id, memfree_bytes in memfree_start.items() :
            print(f"Numa node {node_id} free memory : {round(memfree_bytes/1024/1024)} GiB, (of which pagecache memory : {round(pagecache_start[node_id]/1024/1024)} GiB)")
        tensors = [None] * num_gpus


        for i in range(num_gpus) :
            print("--")
            device_idx = (i + args.first_gpu_index) % num_gpus
            total_memory = torch.cuda.get_device_properties(device_idx).total_memory

            # Get initial numa node memory info
            memfree_before, pagecache_before = get_memfree()
            
            # Allocate a bunch of tensors (nb_blocks) that fill alloc_ratio of the GPU
            nb_blocks = round((total_memory * alloc_ratio) // (args.matrix_size * args.matrix_size * 4))

            print(f"GPU {device_idx} : alloc {round(alloc_ratio * 100, 2)}% - {round(total_memory*alloc_ratio / 1024 ** 3, 2)} GiB of the GPU memory in {nb_blocks} tensors")
            start_time = time.time()
            tensors[device_idx] = [torch.zeros(args.matrix_size, args.matrix_size, dtype=torch.float32, device=f"cuda:{device_idx}") for j in range(nb_blocks)]
            torch.cuda.synchronize(device_idx)
            duration = time.time() - start_time
            
            # Get numa node memory info after allocation
            memfree_after, pagecache_after = get_memfree()

            # Calculate the difference in memory 
            mem_diff = {}
            total_diff = 0
            secondary_pools = ""
            for node in memfree_before.keys():
                diff = memfree_after[node] - memfree_before[node]
                mem_diff[node] = diff
                total_diff += diff
                if node != device_idx and diff < - 1024 * 1024:
                    # Print additional nodes where the allocation difference is greater than 1024 MiB -- which is enough to skip the python memory usage increase
                    secondary_pools += f"\n  NUMA node {node}: {-round(diff/1024/1024, 2)} GiB"

            ratio_in_right_numa = mem_diff[device_idx] / total_diff

            if len(secondary_pools) == 0 :
                print(f" Allocation tooks {round(duration)} seconds and comes mainly from the right memory bank")
            else :
                print(f" Allocation tooks {round(duration)} seconds and comes at {round(ratio_in_right_numa*100)}% ({round(memfree_after[device_idx] / 1024 / 1024, 2)} GiB still free of which {round(pagecache_after[device_idx]/1024/1024, 2)} GiB of pagecache) from the right numa and from :{secondary_pools}")

            # Perform matrix multiplication on each GPU
            if args.do_perf_test :
                execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa)
            
            # reduce the % of the memory to show different behaviours
            alloc_ratio *= args.next_gpus_relative_alloc_ratio
        tensors = None
        print()
        print()

    remove_file('./zero_file.bin')

# Get the free and pagecache memory for each numa node from /sys/devices/system/node/node*/meminfo.
def get_memfree():
    memfree = {}
    page_cache = {}
    try:
        output = subprocess.check_output("grep -hi 'MemFree' /sys/devices/system/node/node*/meminfo", shell=True)
        for line in output.decode().strip().split('\n'):
            # assume format is : Node 0 MemFree:        28046496 kB 
            parts = line.split()
            node_id = int(parts[1])  
            memfree[node_id] = int(parts[3])

        output = subprocess.check_output("grep -hi 'FilePages' /sys/devices/system/node/node*/meminfo", shell=True)
        for line in output.decode().strip().split('\n'):
            # assume format is : Node 0 MemFree:        28046496 kB 
            parts = line.split()
            node_id = int(parts[1])  
            page_cache[node_id] = int(parts[3])
            memfree[node_id] += page_cache[node_id]

    except Exception as e:
        print(f"Error reading memory info: {e}")
    return memfree, page_cache

def pick_random_square_matrix(tensors, mnp, n_blocks) : 
    if n_blocks < 0 :
        idx = random.randint(n_blocks, -1) # pick a random index for tensor
    else :
        idx = random.randint(0, n_blocks - 1) # pick a random index for tensor
    return tensors[idx].view(mnp, mnp) # use the tensor as a square matrix


def random_mm_sum(tensors, mnp, n_blocks, n_mm) :
    for l in range(n_mm):
        A = pick_random_square_matrix(tensors, mnp, n_blocks)
        B = pick_random_square_matrix(tensors, mnp, n_blocks)
        C = pick_random_square_matrix(tensors, mnp, n_blocks)
        C += torch.mm(A,B)
    flops = n_mm * (mnp * mnp * (2*mnp - 1)) 
    return C, flops

def execute_perf_test(args, tensors, device_idx, nb_blocks, ratio_in_right_numa) :
    results = []
    def bench(test_name, n_blocks, n_mm, from_end = False) :
        torch.cuda.synchronize(device=f'cuda:{device_idx}')
        start_time = time.time()
        C, total_flops = random_mm_sum(tensors[device_idx], args.matrix_size, n_blocks, n_mm)
        torch.cuda.synchronize(device=f'cuda:{device_idx}')
        end_time = time.time()

        elapsed_time = end_time - start_time  # Time in seconds
        tflops = total_flops / (elapsed_time * 1e12)  # Convert to TFLOPs
        if test_name is not None :
            print(f"{test_name}: {tflops:.6f} TFLOPs (during {elapsed_time:.6f} sec)")
        return C

    # warmup bench (discarded)
    results.append(bench(None, nb_blocks, args.perf_nb_warmup_loop))

    # any tensor is candidate
    results.append(bench(" Bench (any bank) :", nb_blocks, args.perf_nb_loop))
    
    # assume that the allocation started to use the right memory bank and then fallback to the wrong one
    nb_block_in_right_numa = round(ratio_in_right_numa * nb_blocks) - 1
    if nb_block_in_right_numa > 100 :
        results.append(bench(" Bench (right bank) :", nb_block_in_right_numa, args.perf_nb_loop))

    nb_block_in_wrong_numa = round((1 - ratio_in_right_numa) * nb_blocks) - 1
    if nb_block_in_wrong_numa > 100 :
        results.append(bench(" Bench (wrong bank) :", -nb_block_in_wrong_numa, args.perf_nb_loop))


# Write a file full of zeros that will fill the pagecache
# you can replace it by any non O_DIRECT file read (for instance cat filename > /dev/null)
# we ensure the file is properly sync at the end to be sure pages can be evicted from pagecache easily
def create_zero_file(filename, size_gb):
    # Execute the dd command
    process = subprocess.run(f'dd if=/dev/zero of={filename} bs=1G count={size_gb} status=progress', shell=True)

    # Check if the command was successful
    if process.returncode == 0:
        print(f"Successfully created {size_gb} GiB file filled with zeros: {filename}")
    else:
        print(f"Error occurred while creating the file: {process.returncode}")
    
    # Ensure there are no dirty pages
    sync_process = subprocess.run('sync', shell=True)
    if sync_process.returncode == 0:
        print("Successfully synchronized file system buffers.")
    else:
        print(f"Error occurred while synchronizing: {sync_process.returncode}")


# At the end we want to empty the pagecache to restore the original state
# it is done by destroying the file created by create_zero_file
def remove_file(filename):
    try:
        os.remove(filename)
        print(f"Successfully removed: {filename}")
    except FileNotFoundError:
        print(f"File not found: {filename}")
    except PermissionError:
        print(f"Permission denied: {filename}")
    except Exception as e:
        print(f"Error occurred while removing the file: {e}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="GPU Memory Allocation Parameters")

    parser.add_argument('--first_gpu_alloc_ratio', type=float, required=True,
                        help='Percent of the GPU capacity to allocate on the first GPU')
    parser.add_argument('--next_gpus_relative_alloc_ratio', type=float, required=True,
                        help='Each GPU allocates memory of the previous one multiplied by this ratio')
    parser.add_argument('--first_gpu_index', type=int, default=1,
                        help='Index to start with another GPU than the 0')
    parser.add_argument('--matrix_size', type=int, default=2048,
                        help='Size of the tensors to allocate (matrix_size x matrix_size)')

    parser.add_argument('--do_perf_test', type=bool, default=True,
                        help='Flag to enable performance test')
    parser.add_argument('--perf_nb_loop', type=int, default=2000,
                        help='Number of dot products with 2 matrices of shape (matrix_size, matrix_size)')
    parser.add_argument('--perf_nb_warmup_loop', type=int, default=round(2000 * 0.2),
                        help='Number of warmup loops for performance test')

    parser.add_argument('--do_pagecache_test', type=bool, default=True,
                        help='Flag to enable pagecache test')
    parser.add_argument('--pagecache_fill_size', type=int, default=100,
                        help='Size in GiB to fill in the pagecache for the test')

    args = parser.parse_args()
    main(args)
### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.15
Runtime Ext Version:     1.7
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
XNACK enabled:           YES
DMAbuf Support:          YES
VMM Support:             YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Instinct MI300A Accelerator    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Instinct MI300A Accelerator    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            48                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    130847776(0x7cc9420) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    130847776(0x7cc9420) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    130847776(0x7cc9420) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    130847776(0x7cc9420) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    AMD Instinct MI300A Accelerator    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Instinct MI300A Accelerator    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            48                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131809200(0x7db3fb0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    131809200(0x7db3fb0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131809200(0x7db3fb0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131809200(0x7db3fb0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    AMD Instinct MI300A Accelerator    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Instinct MI300A Accelerator    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        2                                  
  Compute Unit:            48                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131809208(0x7db3fb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    131809208(0x7db3fb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131809208(0x7db3fb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131809208(0x7db3fb8) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 4                  
*******                  
  Name:                    AMD Instinct MI300A Accelerator    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Instinct MI300A Accelerator    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        3                                  
  Compute Unit:            48                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131795632(0x7db0ab0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    131795632(0x7db0ab0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131795632(0x7db0ab0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 4                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131795632(0x7db0ab0) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 5                  
*******                  
  Name:                    gfx942                             
  Uuid:                    GPU-6cdb410c2f219ac0               
  Marketing Name:          AMD Instinct MI300A                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    4                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      24576(0x6000) KB                   
    L3:                      262144(0x40000) KB                 
  Chip ID:                 29856(0x74a0)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   512                                
  Internal Node ID:        4                                  
  Compute Unit:            228                                
  SIMDs per CU:            4                                  
  Shader Engines:          24                                 
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    TRUE                               
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 138                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 4                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 6                  
*******                  
  Name:                    gfx942                             
  Uuid:                    GPU-a5b43667c73cac87               
  Marketing Name:          AMD Instinct MI300A                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    5                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      24576(0x6000) KB                   
    L3:                      262144(0x40000) KB                 
  Chip ID:                 29856(0x74a0)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   512                                
  Internal Node ID:        5                                  
  Compute Unit:            228                                
  SIMDs per CU:            4                                  
  Shader Engines:          24                                 
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    TRUE                               
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 138                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 4                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 7                  
*******                  
  Name:                    gfx942                             
  Uuid:                    GPU-655f5d89689c7f67               
  Marketing Name:          AMD Instinct MI300A                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    6                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      24576(0x6000) KB                   
    L3:                      262144(0x40000) KB                 
  Chip ID:                 29856(0x74a0)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   512                                
  Internal Node ID:        6                                  
  Compute Unit:            228                                
  SIMDs per CU:            4                                  
  Shader Engines:          24                                 
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    TRUE                               
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 138                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 4                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 8                  
*******                  
  Name:                    gfx942                             
  Uuid:                    GPU-5f8fb5cf1f014be1               
  Marketing Name:          AMD Instinct MI300A                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    7                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      24576(0x6000) KB                   
    L3:                      262144(0x40000) KB                 
  Chip ID:                 29856(0x74a0)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   512                                
  Internal Node ID:        7                                  
  Compute Unit:            228                                
  SIMDs per CU:            4                                  
  Shader Engines:          24                                 
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    TRUE                               
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 138                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    134217728(0x8000000) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 4                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
    ISA 2                    
      Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack+
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done *** 
### Additional Information

no specific message regarding the location of memory in dmesg, nor with AMD_LOG_LEVEL

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions