NUMA claim handling improvements #6809

edwintorok · 2025-12-17T16:57:02Z

See individual commits.

Draft PR, because this is still being tested, together with the Xen side changes to make the allocator more reliable.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Do not mix using claims with not using claims. Xen cannot currently guarantee that it'll honour a VM's memory claim, unless all other VMs also use claims. Global claims have existed since a long time in Xen, so this should be safe to do on both XS8 and XS9. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

On XS8 we always raise an exception when attempting to claim from a single node. We wanted to only use soft affinity when the single node claim succeeded (which is the correct fix on XS9, where this API is available). However this meant that we've effectively completely disabled NUMA support on XS8, without any way to turn it on. Always use soft affinity when the single-node claim API is unavailable, this should keep NUMA working on XS8. On XS9 Xen itself would never raise ENOSYS (it has a `err = errno = 0` on ENOSYS). Fixes: fb66dfc ("CA-421847: set vcpu affinity if node claim succeeded") Signed-off-by: Edwin Török <edwin.torok@citrix.com>

ocaml/xenopsd/xc/domain.ml

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok · 2025-12-18T15:55:02Z

We've found some bugs, the claim is made too late and Xen has already allocated some memory (vcpus, shadow allocation, ..)
Which means that when we make the global claim at least we shouldn't include the entire footprint of the VM, because that will then fail even if the host has enough memory.

I'll try to:

move the claim earlier
the global claim failing is a hard failure, so we should only really claim the VM's actual memory here, not the extra estimates on top

Xen may have already allocated some memory for the domain, and the overhead is only an estimate. A global claim failing is a hard failure, so instead use a more conservative estimate: `memory.build_start_mib`. This is similar to `required_host_free_mib`, but doesn't take overhead into account. Eventually we'd want to have another argument to the create hypercall that tells it what NUMA node(s) to use, and then we can include all the overhead too there. For the single node claim keep the amount as it was, it is only a best effort claim. Fixes: 060d792 ("CA-422188: either always use claims or never use claims") Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok · 2025-12-18T16:52:09Z

Test script:

#!/bin/sh
set -eu
DIV=3

. /etc/xensource-inventory

ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
for i in $(seq 1 3); do
        UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
        xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
        xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done

while true; do
        echo "Start seq ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait

        echo "Start all ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
        done
        wait

        echo "Reboot 1"
        xe vm-reboot name-label="test-${ID}-1"  --force

        echo "Reboot ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-reboot name-label="test-${ID}-${i}"  --force &
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait
done

edwintorok · 2025-12-18T17:23:07Z

Still doesn't work, we're now getting hard failures on a parallel restart of 3 VMs.
With NUMA and claims disabled it all works fine.

When rebooting lots of VMs in parallel we might run out of memory and fail to boot all the VMs again. This is because we overestimate the amount of memory required, and claim too much. That memory is released when the domain build finishes, but when building domains in parallel it'll temporarily result in an out of memory error. Instead try to claim only what is left to be allocated: the p2m map and shadow map have already been allocated by this point. Fixes: 95367e1 ("CA-422187: safer defaults for global claims") Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok · 2025-12-18T18:17:30Z

We still exhaust all the memory on the system when claims are used, but worked without claims. So we probably still claim too much (and now Xen will correctly refuse to use memory claimed by another domain).
Some debugging code is being added to Xen to print out just how much unused memory is released from the claim when the domain build finishes.

When a domain build finishes Xen releases any extra unused memory from the claim. In my tests that is ~544 pages, which is about the amount that got added here, so we're double counting something. Remove the hack, so we allocate just the bare minimum. Fixes: 02c6ed1 ("CA-422187: do not claim shadow_mib, it has already been allocated") Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok · 2025-12-19T16:08:13Z

Updated script, very good at finding bugs when run on a host with 1TiB of RAM.
All fine with NUMA off, all sorts of out of memory errors from xenguest/qemu when on:

#!/bin/sh
set -eu
DIV=3

. /etc/xensource-inventory

ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))

UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))
for i in $(seq 1 3); do
        UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
        xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
        xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done

while true; do
        echo "Start seq ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait


        echo "Start all ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
        done
        wait

        echo "Reboot 1"
        xe vm-reboot name-label="test-${ID}-1"  --force

        echo "Reboot ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-reboot name-label="test-${ID}-${i}"  --force &
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait
done

We believe there is some memory that is released asynchronously after a domain is destroyed, although even adding code in Xen to fully wait for that is not enough to free up all memory.

See how avail_pages is smaller than where we start out with after all 3 domains are fully destroyed:

(XEN) [  759.978882] d20: avail_pages 262827592 claimed_pages 0 request 87139296 domain_pages 0 req_node 1
(XEN) [  769.486970] d20: depleted claim
(XEN) [  769.487173] d20: released claim node 255 pages 0
(XEN) [  771.561975] d21: avail_pages 175004680 claimed_pages 0 request 87139296 domain_pages 0 req_node 0
(XEN) [  780.754931] d21: depleted claim
(XEN) [  780.755102] d21: released claim node 255 pages 0
(XEN) [  783.632261] d22: avail_pages 87181765 claimed_pages 0 request 87139296 domain_pages 0 req_node 255
(XEN) [  793.952242] ../common/memory.c:279:d0v8 Could not allocate order=18 extent: id=22 memflags=0xc0 (0 of 1)
(XEN) [  793.981446] ../common/memory.c:279:d0v8 Could not allocate order=18 extent: id=22 memflags=0xc0 (0 of 1)
(XEN) [  794.028533] d22: depleted claim
(XEN) [  794.028727] d22: released claim node 255 pages 0
(XEN) [  794.803838] d22: released claim node 255 pages 0
(XEN) [  794.821349] d21: released claim node 255 pages 0
(XEN) [  794.821604] d20: released claim node 255 pages 0
(XEN) [  798.033391] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d20
(XEN) [  798.040826] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d21
(XEN) [  798.041407] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d22
(XEN) [  808.045542] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  808.053436] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  808.060035] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  818.065691] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  818.069649] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  818.089679] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  828.093839] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  828.097795] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  828.101887] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  838.117973] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  838.134610] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  838.135777] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  848.142115] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  848.151931] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  848.158714] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  851.887321] d20: fully destroyed
(XEN) [  856.555583] d22: fully destroyed
(XEN) [  857.552832] d21: fully destroyed
(XEN) [  867.602388] ../arch/x86/mm/paging.c:693:d0v2 Paging op on a domain (24) with no vcpus
(XEN) [  869.801491] d23: avail_pages 262161085 claimed_pages 0 request 87139296 domain_pages 0 req_node 1
(XEN) [  872.305139] d24: avail_pages 244769513 claimed_pages 70427648 request 87139296 domain_pages 0 req_node 0
(XEN) [  872.378717] d25: avail_pages 244060138 claimed_pages 156876800 request 87139296 domain_pages 0 req_node 1

Waiting for scrubbing also doesn't help at all, because Xen has scrub_pages = 0 hardcoded, i.e. wait_xen_free_mem doesn't actually do anything. But that shouldn't matter because Xen should know to wait for scrubbing to finish when populating the guest's physical memory.

edwintorok · 2025-12-19T16:11:26Z

Inserting a 60s sleep in the test script after shutdown all increases the number of available pages, although still not quite enough:

XEN) [ 1707.317515] d30: avail_pages 262178370 claimed_pages 0 request 87139296 domain_pages 0 req_node 0

We noticed that xenguest releases 32 unused pages from the domain's claim. These are from the low 1MiB video range, so avoid requesting it. Also always print memory free statistics when `wait_xen_free_mem` is called. Turns out `scrub_pages` is always 0, since this never got implemented in Xen (it is hardcoded to 0). Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Do not let domains fully use up all available memory on the host, we have too many unexplained bugs in this area. As a workaround try to reserve some amount (e.g. 256MiB) that domains cannot normally use from XAPI's point of view. Then during parallel domain construction this emergency reserve can be used by Xen. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok added 3 commits December 17, 2025 16:16

CP-310853: claim the entire footprint of the VM for now

24afdc9

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

lindig approved these changes Dec 17, 2025

View reviewed changes

ocaml/xenopsd/xc/domain.ml Outdated Show resolved Hide resolved

andyhhp reviewed Dec 17, 2025

View reviewed changes

ocaml/xenopsd/xc/domain.ml Outdated Show resolved Hide resolved

psafont approved these changes Dec 18, 2025

View reviewed changes

edwintorok added 2 commits December 18, 2025 13:54

CA-422187: make power of 2 more explicit

cb363f0

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CA-422187: only ENOMEM is retrieable when a single-node NUMA claim fails

112db1f

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok force-pushed the private/edvint/hardclaim branch from 6d07929 to ad41d56 Compare December 18, 2025 16:46

edwintorok force-pushed the private/edvint/hardclaim branch from ad41d56 to 95367e1 Compare December 18, 2025 16:50

edwintorok added 2 commits December 19, 2025 16:35

edwintorok force-pushed the private/edvint/hardclaim branch from 49d2d37 to 577e3a6 Compare December 19, 2025 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NUMA claim handling improvements #6809

NUMA claim handling improvements #6809

Uh oh!

edwintorok commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 18, 2025 •

edited

Loading

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 19, 2025 •

edited

Loading

Uh oh!

edwintorok commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NUMA claim handling improvements #6809

Are you sure you want to change the base?

NUMA claim handling improvements #6809

Uh oh!

Conversation

edwintorok commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 18, 2025

Uh oh!

edwintorok commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edwintorok commented Dec 18, 2025 •

edited

Loading

edwintorok commented Dec 19, 2025 •

edited

Loading