Skip to content

Conversation

@edwintorok
Copy link
Contributor

See individual commits.

Draft PR, because this is still being tested, together with the Xen side changes to make the allocator more reliable.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Do not mix using claims with not using claims.
Xen cannot currently guarantee that it'll honour a VM's memory claim,
unless all other VMs also use claims.

Global claims have existed since a long time in Xen,
so this should be safe to do on both XS8 and XS9.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
On XS8 we always raise an exception when attempting to claim from a single
node.
We wanted to only use soft affinity when the single node claim succeeded (which
is the correct fix on XS9, where this API is available).
However this meant that we've effectively completely disabled NUMA support on
XS8, without any way to turn it on.

Always use soft affinity when the single-node claim API is unavailable, this
should keep NUMA working on XS8.

On XS9 Xen itself would never raise ENOSYS (it has a `err = errno = 0` on
ENOSYS).

Fixes: fb66dfc ("CA-421847: set vcpu affinity if node claim succeeded")

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok
Copy link
Contributor Author

We've found some bugs, the claim is made too late and Xen has already allocated some memory (vcpus, shadow allocation, ..)
Which means that when we make the global claim at least we shouldn't include the entire footprint of the VM, because that will then fail even if the host has enough memory.

I'll try to:

  • move the claim earlier
  • the global claim failing is a hard failure, so we should only really claim the VM's actual memory here, not the extra estimates on top

@edwintorok edwintorok force-pushed the private/edvint/hardclaim branch from 6d07929 to ad41d56 Compare December 18, 2025 16:46
Xen may have already allocated some memory for the domain, and the overhead is
only an estimate.
A global claim failing is a hard failure, so instead use a more conservative
estimate: `memory.build_start_mib`.
This is similar to `required_host_free_mib`, but doesn't take overhead into
account.

Eventually we'd want to have another argument to the create hypercall that
tells it what NUMA node(s) to use, and then we can include all the overhead too
there.

For the single node claim keep the amount as it was, it is only a best effort
claim.

Fixes: 060d792 ("CA-422188: either always use claims or never use claims")

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/hardclaim branch from ad41d56 to 95367e1 Compare December 18, 2025 16:50
@edwintorok
Copy link
Contributor Author

edwintorok commented Dec 18, 2025

Test script:

#!/bin/sh
set -eu
DIV=3

. /etc/xensource-inventory

ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
for i in $(seq 1 3); do
        UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
        xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
        xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done

while true; do
        echo "Start seq ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait

        echo "Start all ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
        done
        wait

        echo "Reboot 1"
        xe vm-reboot name-label="test-${ID}-1"  --force

        echo "Reboot ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-reboot name-label="test-${ID}-${i}"  --force &
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait
done

@edwintorok
Copy link
Contributor Author

Still doesn't work, we're now getting hard failures on a parallel restart of 3 VMs.
With NUMA and claims disabled it all works fine.

When rebooting lots of VMs in parallel we might run out of memory
and fail to boot all the VMs again.
This is because we overestimate the amount of memory required, and claim too
much. That memory is released when the domain build finishes, but when building
domains in parallel it'll temporarily result in an out of memory error.

Instead try to claim only what is left to be allocated: the p2m map and shadow
map have already been allocated by this point.

Fixes: 95367e1 ("CA-422187: safer defaults for global claims")

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok
Copy link
Contributor Author

We still exhaust all the memory on the system when claims are used, but worked without claims. So we probably still claim too much (and now Xen will correctly refuse to use memory claimed by another domain).
Some debugging code is being added to Xen to print out just how much unused memory is released from the claim when the domain build finishes.

When a domain build finishes Xen releases any extra unused memory from the
claim.
In my tests that is ~544 pages, which is about the amount that got added here,
so we're double counting something.

Remove the hack, so we allocate just the bare minimum.

Fixes: 02c6ed1 ("CA-422187: do not claim shadow_mib, it has already been allocated")

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok
Copy link
Contributor Author

edwintorok commented Dec 19, 2025

Updated script, very good at finding bugs when run on a host with 1TiB of RAM.
All fine with NUMA off, all sorts of out of memory errors from xenguest/qemu when on:

#!/bin/sh
set -eu
DIV=3

. /etc/xensource-inventory

ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))

UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))
for i in $(seq 1 3); do
        UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
        xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
        xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done

while true; do
        echo "Start seq ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait


        echo "Start all ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
        done
        wait

        echo "Reboot 1"
        xe vm-reboot name-label="test-${ID}-1"  --force

        echo "Reboot ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-reboot name-label="test-${ID}-${i}"  --force &
        done
        wait

        echo "Force shutdown ${DIV}"
        for i in $(seq 1 "${DIV}"); do
                xe vm-shutdown name-label="test-${ID}-${i}" --force &
        done
        wait
done

We believe there is some memory that is released asynchronously after a domain is destroyed, although even adding code in Xen to fully wait for that is not enough to free up all memory.

See how avail_pages is smaller than where we start out with after all 3 domains are fully destroyed:

(XEN) [  759.978882] d20: avail_pages 262827592 claimed_pages 0 request 87139296 domain_pages 0 req_node 1
(XEN) [  769.486970] d20: depleted claim
(XEN) [  769.487173] d20: released claim node 255 pages 0
(XEN) [  771.561975] d21: avail_pages 175004680 claimed_pages 0 request 87139296 domain_pages 0 req_node 0
(XEN) [  780.754931] d21: depleted claim
(XEN) [  780.755102] d21: released claim node 255 pages 0
(XEN) [  783.632261] d22: avail_pages 87181765 claimed_pages 0 request 87139296 domain_pages 0 req_node 255
(XEN) [  793.952242] ../common/memory.c:279:d0v8 Could not allocate order=18 extent: id=22 memflags=0xc0 (0 of 1)
(XEN) [  793.981446] ../common/memory.c:279:d0v8 Could not allocate order=18 extent: id=22 memflags=0xc0 (0 of 1)
(XEN) [  794.028533] d22: depleted claim
(XEN) [  794.028727] d22: released claim node 255 pages 0
(XEN) [  794.803838] d22: released claim node 255 pages 0
(XEN) [  794.821349] d21: released claim node 255 pages 0
(XEN) [  794.821604] d20: released claim node 255 pages 0
(XEN) [  798.033391] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d20
(XEN) [  798.040826] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d21
(XEN) [  798.041407] ../arch/x86/mm/paging.c:687:d0v0 Tried to do a paging op on dying d22
(XEN) [  808.045542] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  808.053436] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  808.060035] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  818.065691] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  818.069649] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  818.089679] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  828.093839] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  828.097795] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  828.101887] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  838.117973] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  838.134610] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  838.135777] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  848.142115] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d20
(XEN) [  848.151931] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d21
(XEN) [  848.158714] ../arch/x86/mm/paging.c:687:d0v1 Tried to do a paging op on dying d22
(XEN) [  851.887321] d20: fully destroyed
(XEN) [  856.555583] d22: fully destroyed
(XEN) [  857.552832] d21: fully destroyed
(XEN) [  867.602388] ../arch/x86/mm/paging.c:693:d0v2 Paging op on a domain (24) with no vcpus
(XEN) [  869.801491] d23: avail_pages 262161085 claimed_pages 0 request 87139296 domain_pages 0 req_node 1
(XEN) [  872.305139] d24: avail_pages 244769513 claimed_pages 70427648 request 87139296 domain_pages 0 req_node 0
(XEN) [  872.378717] d25: avail_pages 244060138 claimed_pages 156876800 request 87139296 domain_pages 0 req_node 1

Waiting for scrubbing also doesn't help at all, because Xen has scrub_pages = 0 hardcoded, i.e. wait_xen_free_mem doesn't actually do anything. But that shouldn't matter because Xen should know to wait for scrubbing to finish when populating the guest's physical memory.

@edwintorok
Copy link
Contributor Author

Inserting a 60s sleep in the test script after shutdown all increases the number of available pages, although still not quite enough:

XEN) [ 1707.317515] d30: avail_pages 262178370 claimed_pages 0 request 87139296 domain_pages 0 req_node 0

We noticed that xenguest releases 32 unused pages from the domain's claim.
These are from the low 1MiB video range, so avoid requesting it.

Also always print memory free statistics when `wait_xen_free_mem` is called.
Turns out `scrub_pages` is always 0, since this never got implemented in Xen
(it is hardcoded to 0).

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Do not let domains fully use up all available memory on the host,
we have too many unexplained bugs in this area.

As a workaround try to reserve some amount (e.g. 256MiB) that domains cannot
normally use from XAPI's point of view.
Then during parallel domain construction this emergency reserve can be used by
Xen.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/hardclaim branch from 49d2d37 to 577e3a6 Compare December 19, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants