Skip to content

Conversation

@cplaursen
Copy link
Contributor

Proof of concept for XAPI throttling.
This allows users to specify the user-agent rate limited clients in xapi.conf, which then consume from a token bucket whenever a request is made, and have to wait for it to refill if they exceed its capacity.

psafont and others added 8 commits November 6, 2025 12:13
Power management is an optional feature and not always available, even if the
devices are present.

Commands may fail with different error codes, which correspond to a reason,
it's good to log this to be able to reference the IPMI 2.0 spec.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
…ode set

Could also compute it by multiplying it with [threads_per_core],
but I'm not sure how that'd interact with [smt=false] in Xen.
Also to future-proof this I wouldn't want to rely on an entirely
symmetrical architecture
(although it'd be very rare to have anything other than 2 on x86-64,
 or to have hyperthreading on in one socket, and off in another).

Note that core ids are not unique (there is a core `0` on both socket 0 and
socket 1 for example), so only work with number of cores in the topology code.

Could've created a CoreSocketSet instead (assuming that no higher grouping than
sockets would exist in the future), but for now don't make too many assumptions
about topology.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
The planner explicitly looks at the NUMARequest fields and checks that they are
non-zero.
However if more fields get added in the future this leads to an assertion
failure, where the planner thinks it has found a solution, but NUMARequest.fits
returns false.

Ensure consistency: use `fits` in the planner to check that we've reached a
solution. If the remaining request doesn't fit into an empty node, then the
request is not empty.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
The requested number of cores is still 0, so no functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
…io_mem_only

The current NUMA policy prioritizes reducing cross-NUMA node memory traffic by
picking the smallest set of NUMA nodes that fit a VM.
It doesn't look at how this affects CPU overload within a NUMA node, or whether
the local bandwidth of each NUMA node is balanced or not.

Give this policy an explicit name, `Prio_mem_only`, and when the "compat" setting
in `xenopsd.conf` is used (`numa-placement=true`), then explicitly use this
policy instead of Best-effort.

Currently Best-effort is still equivalent to this policy, but that'll change in
a follow-up commit.
Introduce a new xenopsd.conf entry `numa-best-effort-prio-mem-only`,
which can be used to explicitly revert best effort to the current policy.
(currently this is a no-op, because there is only one best-effort policy).

Future policies should also look at CPU overload.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
NUMA optimized placement can have a large performance hit on machines with
small NUMA nodes and VMs with a large number of vCPUs.
For example a machine that has 2 sockets, which can run at most 32 vCPUs in a
single socket (NUMA node), and a VM with 32 vCPUs.

Usually Xen would try to spread the load across actual cores, and avoid the
hyperthread siblings, e.g. using CPUs 0,2,4,etc.
But when NUMA placement is used all the vCPUs must be in the same NUMA node.
If that NUMA node doesn't have enough cores, then Xen will have no choice but
to use CPUs 0,1,2,3,etc.

Hyperthread siblings share resources, and if you try to use both at the same
time you get a big performance hit, depending on the workload.

Avoid this by "requesting" cores=vcpus for each VM,
which will make the placement algorithm choose the next size up in terms of
NUMA nodes (i.e. instead of 1 NUMA node, use 2,3 as needed, falling back to using
all nodes if needed).

The potential gain from reducing memory latency with a NUMA optimized placement
(~20% on Intel Memory Latency Checker: Idle latency) is outweighed by
the potential loss due to reduced CPU capacity (40%-75% on OpenSSL, POV-Ray, and
OpenVINO), so this is the correct trade-off.

If the NUMA node is large enough, or if the VMs have a small number of vCPUs
then we still try to use a single NUMA node as we did previously.

The performance difference can be reproduced and verified easily by running
`openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that has 2 NUMA
nodes, with 32 PCPUs each, and 2 threads per core.
This introduces a policy that can control whether we want to filter out
NUMA nodes with too few cores.

Although we want to enable this filter by default, we still want
an "escape hatch" to turn it off if we find problems with it.
That is why the "compat" setting (numa_placement=true) in xenopsd.conf
reverts back to the old policy, which is now named explicitly as Prio_mem_only.

There could still be workloads where optimizing for memory bandwidth makes more
sense (although that is a property of the NUMA node, not of individual VMs),
so although it might be desirable for this to be a VM policy, it cannot,
because it affects other VMs too.

TODO: when sched-gran=core this should be turned off. That always has the
performance hit, so might as well use smaller NUMA nodes if available.

For now this isn't exposed yet as a XAPI-level policy, because that requires
more changes (to also sort by free cores on a node, and to also sort at the
pool level by free cpus on a host).
Once we have those changes we can introduce a new policy `prio_core_mem`
to sort by free cores first, then by free memory, and requires cores>=vcpus
(i.e. cpus>=vcpus*threads_per_cores) when choosing a node.

This changes the default to the new setting, which should be equal or an
improvement in the general case.
An "escape hatch" to revert to the previous behaviour is to set
`numa-placement=true` in xenopsd.conf, and the XAPI host-level policy to
'default_policy'.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
This partially applies the following commit to reduce the complexity of a
Xen-4.20 patch:

> xenopsd-xc: do not try keep track of free memory when planning NUMA nodes (CA-411684)
>
> Free memory is now properly accounted for because the memory pages are claimed
> within the NUMA mutex, so there's no need to have double tracking.
>
> On top of that, this code never increased the free memory, which means that it
> always reached a point where it was impossible to allocate a domain into a
> single numa node.
> Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>

However it doesn't actually drop the free memory accounting code, so:
No functional change

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@cplaursen cplaursen changed the title Token bucket XAPI throttling proof of concept Dec 1, 2025
@cplaursen cplaursen requested a review from robhoes December 1, 2025 13:39
Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tough to understand how the datastructure does rate-limitting. I had to find documentation online to undestand it. As such, I would appreciate a more in-depth explanation at the header of the mli file. For example, it would be good to understand when is the "refill" timestamp changed, and explain that some of the methods are only meant to be used for testing.

(** Create token bucket with given parameters and supplied inital timestamp
@param timestamp Initial timestamp
@param burst_size Maximum number of tokens that can fit in the bucket
@param fill_rate Number of tokens added to the bucket per second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fill_rate has an implicit unit (Hz), I think that it can be worth representing the fill rate as the amount of time it takes the bucket from being empty to becoming full. This is a timespan, and we can use a known datatype with explicit unit here: Mtime.span.

How some operations would be changed:

let peek_with_delta time_delta tb =
  let fill_time = Mtime.Span.to_float_ns tb.fill_time in
  let time_delta = Mtime.Span.to_float_ns time_delta in
  min tb.burst_size (tb.tokens +. (time_delta /. fill_time)

let delay_until_available_with_delta delta tb amount =
  let current_tokens = peek_with_delta delta tb in
  let required_tokens = max 0. (amount -. current_tokens) in
  required_tokens *. (Mtime.Span.to_float_ns tb.fill_time)
  |> Float.to_int64
  |> Mtime.Span.of_unit64_ns 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the appeal of adding units to the fill_rate but I'd rather keep the burst size decoupled from fill rate, and I think it's more intuitive as tokens per second than seconds per token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this fail, i.e., return None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in some extreme cases, if you have an arithmetic overflow. Although that'll be like an Mtime span of ~584 years.

burst_size: float
; fill_rate: float
; mutable tokens: float
; mutable last_refill: Mtime.span
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this cannot be a counter? They return the time difference directly: https://ocaml.org/p/mtime/latest/doc/mtime.clock/Mtime_clock/index.html#counters

This would mean that the peek functions would turn into:

let peek_with_delta time_delta tb =
  let time_delta_seconds = Mtime.Span.to_float_ns time_delta *. 1e-9 in
  min tb.burst_size (tb.tokens +. (time_delta_seconds *. tb.fill_rate))

let peek tb = peek_with_delta (Mtime_clock.count tb.last_refill) tb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping a counter does simplify things a bit, but if I understand correctly it forces us to make two system time calls when consuming tokens - one to obtain the difference from the counter, and another to produce a new counter. It probably won't make much of a difference, I can try profiling.

@cplaursen
Copy link
Contributor Author

Added some documentation to the token bucket module to explain the rate limit application.

Lin Liu and others added 2 commits December 2, 2025 02:36
- Introduce https_only argument for Host.create
- Set https_only from configuration for installation
- Keep https_only from joining host during pool join

Signed-off-by: Lin Liu <lin.liu01@citrix.com>
- Introduce https_only argument for Host.create
- Set https_only from configuration for installation
- Keep https_only from joining host during pool join
@cplaursen cplaursen force-pushed the token-bucket branch 4 times, most recently from dd10c1c to a540ca7 Compare December 2, 2025 10:56
…ect#6763)

NUMA optimized placement can have a large performance hit on machines
with small NUMA nodes and VMs with a large number of vCPUs. For example
a machine that has 2 sockets, which can run at most 32 vCPUs in a single
socket (NUMA node), and a VM with 32 vCPUs.

Usually Xen would try to spread the load across actual cores, and avoid
the hyperthread siblings (when the machine is sufficiently idle, or the
workload is bursty), e.g. using CPUs 0,2,4,etc.
But when NUMA placement is used all the vCPUs must be in the same NUMA
node. If that NUMA node doesn't have enough cores, then Xen will have no
choice but to use CPUs 0,1,2,3,etc.

Hyperthread siblings share resources, and if you try to use both at the
same time you get a big performance hit, depending on the workload.
We've also seen this previously with Xen's core-scheduling support
(which is off by default)

Avoid this by "requesting" `threads_per_core` times more vCPUs for each
VM, which will make the placement algorithm choose the next size up in
terms of NUMA nodes (i.e. instead of a single NUMA node use 2,3 as
needed, falling back to using all nodes if needed).

The potential gain from reducing memory latency with a NUMA optimized
placement (~20% on Intel Memory Latency Checker: Idle latency) is
outweighed by the potential loss due to reduced CPU capacity (40%-75% on
OpenSSL, POV-Ray, and OpenVINO), so this is the correct tradeoff.

If the NUMA node is large enough, or if the VMs have a small number of
vCPUs then we still try to use a single NUMA node as we did previously.

The performance difference can be reproduced and verified easily by
running `openssl speed -multi 32 rsa4096` on a 32 vCPU VM on a host that
has 2 NUMA nodes, with 32 PCPUs each, and 2 threads per core.
edwintorok and others added 4 commits December 2, 2025 15:09
This partially applies the following commit to reduce the complexity of
a Xen-4.20 patch:

> xenopsd-xc: do not try keep track of free memory when planning NUMA
nodes (CA-411684)
>
> Free memory is now properly accounted for because the memory pages are
claimed
> within the NUMA mutex, so there's no need to have double tracking.
>
> On top of that, this code never increased the free memory, which means
that it
> always reached a point where it was impossible to allocate a domain
into a
> single numa node.
> Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>

However it doesn't actually drop the free memory accounting code, so: No
functional change
Define and implement an operation that uses Xen's fast resume to reume a
domain. This operation is currently not used but has been tested. It is
accessible from the xenopsd CLI ("xenops-cli") for experiments.

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
The current default value for the NVMe MDTS parameter exposed in QEMU emulated
NMVe devices is 7 (max 512KiB requests).  However there seems to be an
internal Windows Server 2025 issue that possibly triggers when splitting
bigger requests into smaller on in the NVMe Windows driver.

Increase the exposed MDTS value on the emulated QEMU NVMe device to 9 (max 2MiB
request size), as that seems to drop the reproduction rate of the issue.

Discussion is ongoing with Microsoft to get the issue identified and
possibly sorted on their end.  For the time being apply this mitigation in
qemu-wrapper as a workaround.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
The current default value for the NVMe MDTS parameter exposed in QEMU
emulated NMVe devices is 7 (max 512KiB requests). However there seems to
be an internal Windows Server 2025 issue that possibly triggers when
splitting bigger requests into smaller on in the NVMe Windows driver.

Increase the exposed MDTS value on the emulated QEMU NVMe device to 9
(max 2MiB request size), as that seems to drop the reproduction rate of
the issue.

Discussion is ongoing with Microsoft to get the issue identified and
possibly sorted on their end. For the time being apply this mitigation
in qemu-wrapper as a workaround.
@cplaursen cplaursen force-pushed the token-bucket branch 2 times, most recently from 532a6df to d6b68df Compare December 4, 2025 11:25
To be replaced with a proper datamodel. Bucket tables are used for
mapping requests to their respective token bucket so that they can
be rate limited.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Zero or negative rate limits can cause issues in the behaviour of rate
limiting. In particular, zero fill rate leads to a division by zero in time
calculations. Rather than account for this, we forbid the creation of token
buckets with a bad fill rate by returning None.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Make token bucket type abstract to hide Hashtbl.t
Use `replace` rather than `add` for adding a new bucket

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
The current implementation of rate limiting had severe fairness issues.
These have been resolved through the addition of a request queue, to
which rate limited requests are added. A worker thread sleeps until its
associated token bucket has enough tokens to handle the request at the
head of the queue, calls it, and sleeps until the next request is ready.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Creating a token bucket fails if the rate limit supplied is 0 or
negative - this can lead to unexpected and undesirable behaviour, such as
division by 0 or negative token counts.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
The rate limiting can no longer be set from xapi_globs. Instead, the rate
limiter is initialised from the database on startup now.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Synchronous requests can be long-running, which can cause issues if they are all
processed on the same worker thread.

This commit updates the code to process synchronous requests on the original
caller thread - the worker thread is now only responsible for signalling on a
provided channel to wake up the caller.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
@cplaursen
Copy link
Contributor Author

Rebased and added some changes I've been working on. Users can now add and remove rate limiters from the command line, and the RPC handler is rate limited differently in the sync and async cases.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
When an async request is rate limited, we confirm receipt immediately
but enqueue the actual request, rather than rate limiting the first
response too.

Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Signed-off-by: Christian Pardillo Laursen <christian.pardillolaursen@citrix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.