-
Notifications
You must be signed in to change notification settings - Fork 297
NUMA claim handling improvements #6809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
NUMA claim handling improvements #6809
Conversation
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Do not mix using claims with not using claims. Xen cannot currently guarantee that it'll honour a VM's memory claim, unless all other VMs also use claims. Global claims have existed since a long time in Xen, so this should be safe to do on both XS8 and XS9. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
On XS8 we always raise an exception when attempting to claim from a single node. We wanted to only use soft affinity when the single node claim succeeded (which is the correct fix on XS9, where this API is available). However this meant that we've effectively completely disabled NUMA support on XS8, without any way to turn it on. Always use soft affinity when the single-node claim API is unavailable, this should keep NUMA working on XS8. On XS9 Xen itself would never raise ENOSYS (it has a `err = errno = 0` on ENOSYS). Fixes: fb66dfc ("CA-421847: set vcpu affinity if node claim succeeded") Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
|
We've found some bugs, the claim is made too late and Xen has already allocated some memory (vcpus, shadow allocation, ..) I'll try to:
|
6d07929 to
ad41d56
Compare
Xen may have already allocated some memory for the domain, and the overhead is only an estimate. A global claim failing is a hard failure, so instead use a more conservative estimate: `memory.build_start_mib`. This is similar to `required_host_free_mib`, but doesn't take overhead into account. Eventually we'd want to have another argument to the create hypercall that tells it what NUMA node(s) to use, and then we can include all the overhead too there. For the single node claim keep the amount as it was, it is only a best effort claim. Fixes: 060d792 ("CA-422188: either always use claims or never use claims") Signed-off-by: Edwin Török <edwin.torok@citrix.com>
ad41d56 to
95367e1
Compare
|
Test script: #!/bin/sh
set -eu
DIV=3
. /etc/xensource-inventory
ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
for i in $(seq 1 3); do
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done
while true; do
echo "Start seq ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
done
wait
echo "Force shutdown ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-shutdown name-label="test-${ID}-${i}" --force &
done
wait
echo "Start all ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
done
wait
echo "Reboot 1"
xe vm-reboot name-label="test-${ID}-1" --force
echo "Reboot ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-reboot name-label="test-${ID}-${i}" --force &
done
wait
echo "Force shutdown ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-shutdown name-label="test-${ID}-${i}" --force &
done
wait
done |
|
Still doesn't work, we're now getting hard failures on a parallel restart of 3 VMs. |
When rebooting lots of VMs in parallel we might run out of memory and fail to boot all the VMs again. This is because we overestimate the amount of memory required, and claim too much. That memory is released when the domain build finishes, but when building domains in parallel it'll temporarily result in an out of memory error. Instead try to claim only what is left to be allocated: the p2m map and shadow map have already been allocated by this point. Fixes: 95367e1 ("CA-422187: safer defaults for global claims") Signed-off-by: Edwin Török <edwin.torok@citrix.com>
|
We still exhaust all the memory on the system when claims are used, but worked without claims. So we probably still claim too much (and now Xen will correctly refuse to use memory claimed by another domain). |
When a domain build finishes Xen releases any extra unused memory from the claim. In my tests that is ~544 pages, which is about the amount that got added here, so we're double counting something. Remove the hack, so we allocate just the bare minimum. Fixes: 02c6ed1 ("CA-422187: do not claim shadow_mib, it has already been allocated") Signed-off-by: Edwin Török <edwin.torok@citrix.com>
|
Updated script, very good at finding bugs when run on a host with 1TiB of RAM. #!/bin/sh
set -eu
DIV=3
. /etc/xensource-inventory
ID=$(date +%s)
HOST_FREE=$(xe host-param-get uuid="${INSTALLATION_UUID}" param-name=memory-free-computed)
VM_MEM=$(( "${HOST_FREE}" / ${DIV} ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-0")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
MEM_OVERHEAD=$(xe vm-param-get "uuid=${UUID}" param-name=memory-overhead)
VM_MEM=$(( "${VM_MEM}" - "${MEM_OVERHEAD}" ))
VM_MEM=$(( 2 * 1024 * 1024 * ($VM_MEM / (2 * 1024 * 1024)) ))
for i in $(seq 1 3); do
UUID=$(xe vm-install template="Other install media" new-name-label="test-${ID}-${i}")
xe vm-param-set uuid="${UUID}" VCPUs-max=6 VCPUs-at-startup=6
xe vm-memory-set uuid="${UUID}" memory="${VM_MEM}"
done
while true; do
echo "Start seq ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true
done
wait
echo "Force shutdown ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-shutdown name-label="test-${ID}-${i}" --force &
done
wait
echo "Start all ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-start name-label="test-${ID}-${i}" on="${INSTALLATION_UUID}" paused=true &
done
wait
echo "Reboot 1"
xe vm-reboot name-label="test-${ID}-1" --force
echo "Reboot ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-reboot name-label="test-${ID}-${i}" --force &
done
wait
echo "Force shutdown ${DIV}"
for i in $(seq 1 "${DIV}"); do
xe vm-shutdown name-label="test-${ID}-${i}" --force &
done
wait
doneWe believe there is some memory that is released asynchronously after a domain is destroyed, although even adding code in Xen to fully wait for that is not enough to free up all memory. See how Waiting for scrubbing also doesn't help at all, because Xen has |
|
Inserting a 60s sleep in the test script after shutdown all increases the number of available pages, although still not quite enough: |
We noticed that xenguest releases 32 unused pages from the domain's claim. These are from the low 1MiB video range, so avoid requesting it. Also always print memory free statistics when `wait_xen_free_mem` is called. Turns out `scrub_pages` is always 0, since this never got implemented in Xen (it is hardcoded to 0). Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Do not let domains fully use up all available memory on the host, we have too many unexplained bugs in this area. As a workaround try to reserve some amount (e.g. 256MiB) that domains cannot normally use from XAPI's point of view. Then during parallel domain construction this emergency reserve can be used by Xen. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
49d2d37 to
577e3a6
Compare
See individual commits.
Draft PR, because this is still being tested, together with the Xen side changes to make the allocator more reliable.