Skip to content

Conversation

@Keno
Copy link
Member

@Keno Keno commented Nov 30, 2025

Introduction

This commit is a first sketch for what I would like to do for robust cancellation
(i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR,
but I think I've done enough of the design for a design discussion.

The first thing I should say is that the goals of this PR is very narrowly to
make ^C work well. As part of that, we're taking a bit of a step towards
structured concurrency, but I am not intending this PR to be a full implementation
of that.

Given that some of this has been beaten to death in previous issues, I will also
not do my usual motivation overview, instead jumping straight into the implementation.
As I said, the motivation is just to make ^C work reliably at this point.

Setting the stage

Broadly when we're trying to cancel a task, it'll be in one of two broad categories:

  1. Waiting for some other operation to complete (e.g. an IO operation, another task,
    an external event, etc.). Here, the actual cancellation itself is not so difficult
    (after all the task is not running, but suspended in a somehwat well-defined place).
    However, robust cancellation requires us to potentially propagate the cancellation
    signal down the wait tree, since the operation we actually want to cancel may not
    be the root task, but may instead be some operation being performed by the task
    we're waiting on (and we'd prefer not to leak those operations and have rogue tasks
    going around performing potentially side-effecting operations).

  2. Currently running and doing some computation. The core problem is not really one of
    propagation (after all the long-running computation is probably what we're wanting
    to cancel), but rather how to do the cancellation without state corruption. A lot of
    the crashiness of our existing ^C implementation is just that we would simply inject
    an exception in places that are not expecting to handle it.

For a full solution to the problem, we need to have an answer for both of these points.
I will begin with the second, since the first builds upon it.

Cancellation points

This PR introduces the concept of a cancellation request and a cancellation point.
Each task has a cancellation_request field that can be set externally (e.g. by ^C).
Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.

For this purpose, the PR provides the @cancel_check macro. This macro turns a pending
cancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.

We attempt to address this with compiler support. Note that this part is currently
incompletely implemented, so the following describes the design rather than the current
state of the PR. Consider the cancel_check macro:

macro cancel_check()
    quote
        local req = Core.cancellation_point!()
        if req !== nothing
            throw(conform_cancellation_request(req))
        end
    end
end

where cancellation_point! is a new intrinsic that defines a cancellation point. The
compiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a cancellation_point!, the system will
set the current task's reset_ctx to this cancellation point. If a cancellation request
occurs before the reset_ctx is cleared, the task's execution will be reset to the
nearest cancellation point. I proposed this mechanism in #52291.

Additionally, the reset_ctx can in principle be used to establish scoped cancellation
handlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).

Note that cancellation_point! is also intended to be a yield point in order to faciliate
the ^C mechanism described below. However, this is not currently implemented.

Structured cancellation

Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing queue reference to become a generic (atomic) "waitee" reference. The queue is
required to be obtainable from with object via the new waitqueue generic function.
To cancel a waiter waiting for a waitable waitee object, we

  1. Set the waiter's cancellation request
  2. Load the waitee and call a new generic function cancel_wait!,
    which shall do whatever synchronization and internal bookkeeping is
    required to remove the task from the wait-queue and then resumes the
    task.
  3. The waiter resumes in the wait code. It may now decide how and whether to
    propagate the cancellation to the object it was just waiting on. Note that
    this may involve re-queing a wait (to wait for the cancellation of waitee
    to complete).

The idea here is that this provides a well-defined context for cancellation-propagation
logic to run. I wanted to avoid having any cancellation propagation logic run in parallel
with actual wait code.

How the cancellation propagates is a bit of a policy question and not one that I fully
intend to address in this PR. My plan is to implement a basic state machine that works
well for ^C (by requesting safe cancellation immediately and then requesting increasingly
unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external
libraries will want to create their own cancellation request state machines, which the
system supports. The implementation is incomplete, so I will not describe it here yet.

One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least waitqueue and cancel_wait! and possibly in the future).
However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).

^C handling

The handling of ^C is delegated to a dedicated task that then gets notified from the
signal handler when a SIGINT is received (similar to the existing profile listener)
task. There is a little bit of an additional wrinkle in that we need some logic to
kick out a computational-task to its nearset cancellation point if we do not have
any idle threads. This logic is not yet implemented.

Examples to try

julia> sleep(1000)
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] _trywait(t::Timer)
   @ Base ./asyncevent.jl:195
 [3] wait
   @ ./asyncevent.jl:204 [inlined]
 [4] sleep(sec::Int64)
   @ Base ./asyncevent.jl:322
 [5] top-level scope
   @ REPL[1]:1

julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)

julia> function find_collatz_counterexample()
          i = 1
          while true
             j = i
             while true
                @Base.cancel_check 
                j = collatz(j)
                j == 1 && break
                j == i && error("$j is a collatz counterexample")
             end
             i += 1
          end
       end
find_collatz_counterexample (generic function with 1 method)

julia> find_collatz_counterexample()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] find_collatz_counterexample()
   @ Main ./REPL[2]:6
 [3] top-level scope
   @ REPL[3]:1

julia> wait(@async sleep(100))
^CERROR: TaskFailedException
Stacktrace:
 [1] wait(t::Task; throw::Bool)
   @ Base ./task.jl:367
 [2] wait(t::Task)
   @ Base ./task.jl:360
 [3] top-level scope
   @ REPL[4]:0
 [4] macro expansion
   @ task.jl:729 [inlined]

    nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./condition.jl:134 [inlined]
     [2] _trywait(t::Timer)
       @ Base ./asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:204 [inlined]
     [4] sleep
       @ ./asyncevent.jl:322 [inlined]
     [5] (::var"#2#3")()
       @ Main ./REPL[4]:1

julia> @sync begin
         @async sleep(100)
         @async find_collatz_counterexample()
     end
^CERROR:     nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./task.jl:1234 [inlined]
     [2] _trywait(t::Timer)
       @ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:203 [inlined]
     [4] sleep
       @ ./asyncevent.jl:321 [inlined]
     [5] (::var"#45#46")()
       @ Main ./REPL[26]:3

...and 1 more exception.

Stacktrace:
 [1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException)
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454
 [2] sync_end(c::Channel{Any})
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:608
 [3] macro expansion
   @ ./task.jl:663 [inlined]
 [4] (::var"#43#44")()
   @ Main ./REPL[5]

As noted above, the @Base.cancel_check is not intended to be required in the inner loop.
Rather, the compiler is expected to extend the cancellation point from the start of the loop
to the entire function. However, this is not yet implemented.

I hope this will fix (more of a checklist for me at this point) #4037 #6283 #25790 #29369 #36379 #42072 #43451 #43451 #45055 #47839 #50045 #56462 #56545 #58105 #58849 #58689
Closes #49541
Part of #33248 #52291
Refs, but yet to be decided #58259 #35026 #39699

TODO

  • Audit all uses of wait()
  • Look into libuv write cancellation
  • Implement libuv write cancellation on Windows
  • Submit libuv write cancellation PR upstream
  • Look into BLAS cancellation punted
  • (Maybe in a future PR) More compiler optimizations for cancellation_point!
  • Merge assymetric fences
  • (Maybe in a future PR) Optimize reset_ctx establishment speed
  • (Separate PR) Pointer-int-union optimizations
  • Implement the "unfriendly" cancellation types
  • Early interruption of inference
  • Outlined cancellation handlers

@jakobnissen

This comment has been minimized.

@jpsamaroo
Copy link
Member

Just want to say that I really like this approach! ❤️

I especially like the design of making Base.@cancel_check a point that the task can longjmp back to from effect-free code, as it (presumably) doesn't require waiting until said code finishes to hit a cancellation exit (as shown with the Collatz example). I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region? Is the idea that reset_ctx (which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request), and so the ^C handling logic knows when it can just forcibly suspend and reset the task via longjmp to reset_ctx?

Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee? Should we expect that all async resources from Base will cancel on/within any async call, for predictability? That is to say, if we have a producer-consumer setup on a Channel, can I expect that both sides will receive an exception once/while they interact with the Channel? Similarly wondering about this for Threads.Condition, Base.Event, etc. If the answer is "yes", will there be a way to opt-out of this (in the case that the resource needs to continue operating normally to allow surrounding library logic to cancel itself)?

Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals) will one day invoke this logic as well? I can imagine that when trying to terminate a complex application, having the first course of action be to cancel ongoing work is conducive to a safe and expedient shutdown. We would of course want to then do the finalizer and atexit dance, which should hopefully now be able to do their jobs without concern for resources being locked or otherwise unavailable for cleanup.

Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task? I can imagine that this will avoid having to implement APIs like wait(obj; timeout), as we can just wrap the wait(obj) call with some logic that will send a cancellation request to just that task if the timeout expires before wait returns. This would also make it easy for libraries like Dagger to request arbitrary user code (running within a Dagger-launched task) to cancel when Dagger decides it's desirable (possibly without direct user input).

Aside: I do think it's worth thinking more on whether we want users to have a way to target the cancellation at a library/task/arbitrary machinery, but as mentioned, this is mostly an orthogonal concern.

@Keno
Copy link
Member Author

Keno commented Nov 30, 2025

I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region?

Kind of. The canceler checks reset_ctx. If non-null, it sends a signal to the thread, which then longjmps to reset_ctx if still non-null and if cancellation_request is set on the currently running task.

Is the idea that reset_ctx (which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request)

Yes

Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee?

Policy decision by the async library, so I'm not really expressing a preference. For now, wait cancels the waitee and there's
wait_nocancel to opt out. In the future there could be fancier APIs for cancellation scope.

Should we expect that all async resources from Base will cancel on/within any async call, for predictability?

I was at this point not expecting cancellation to propagate through channels and conditions - rather I was expecting that it would cancel the wait on those objects and then the thrown exception might potentially cancel the expected producer in its cleanup scope - however, that's a bit of an orthogonal API design question that I don't have a strong opinion on.

Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals)

Maybe - I could imagine SIGTERM trying to cancel all tasks in the system simultaneously with this mechanism - I don't know if the tree-based cancellation makes sense there, but it could be useful for graceful shutdown.

Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task?

PR provides a cancel! API.

@Keno
Copy link
Member Author

Keno commented Nov 30, 2025

I guess I should have said that I want the cancellation point to be a preemption point rather than a yield point. We don't currently have that concept, so a bit of an open question whether those are different, but I wanted to be precise.

@Keno
Copy link
Member Author

Keno commented Dec 2, 2025

Capturing some slack discussion with @vtjnash. This reflects my best understanding, but @vtjnash was trying to make a larger point that I don't quite understand.

  1. Are plain wait, yield, yieldto, etc. cancellation points?

Probably not. The correctness of these functions depends on being paired with a unique schedule that resumes it and has correctness guarantees that need to be enforced at a higher level. There does not seem to be any good to way to make these automatic cancellation points.

@vtjnash provided the example ct = current_task(); t = @task(yieldto(ct); nothing); yieldto(t); wait(t). This task t cannot be canceled, because it's not at a cancellation point.

  1. Are locks cancellation points?

I think we need to have both versions, with the user selecting the appropriate one.

  1. How is the cancellation guaranteed without seq_cst on the waitee field (which would be expensive).

I think some variant of the following works:

Cancelling thread:

while (true)
jl_atomic_store_relaxed(&t->cancellation_request, req);
SYS_membarrier
if cancel_wait!(jl_atomic_load_acquire(&t->waitee), t)
     break
end
end

Waiting thread:

jl_atomic_store_release(&t->waitee, wait);
barrier();
if (cancelled(jl_atomic_load_relaxed(&t->cancellation_request))) throw();
wait();
  1. What happens to threads that haven't started yet.

I think this is a cancellation point and the thread dies. This is different from @async wait() because you can't put try/catch around it.

  1. Does having the waitee field prevent GC of events that will never fire?

I think it can be weakref.

  1. Libuv does not support write cancellation

Probably should be fixed, ignore for now.

  1. The term "safe cancellation" is inappropriate, because it can fail and hang, which doesn't feel very safe.

Not attached to the term. The naming was due to the possibility of introducing more unsafe cancellation variants (to be used on timeout or repeated ^C) that, while being more likely to succeed, could leave the system in an inconsistent state. Useful for looking around for debugging, but not semantically sound.

@Keno
Copy link
Member Author

Keno commented Dec 4, 2025

Now with compiler and reset_ctx support, courtesy of claude (note the absence of explicit cancellation points inside the inner loop):

julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)

julia> function find_collatz_counterexample_inner()
           i = 1
           while true
               j = i
               while true
                   j = collatz(j)
                   j == 1 && break
                   j == i && return j
               end
               i += 1
           end
       end
find_collatz_counterexample_inner (generic function with 1 method)

julia> function find_collatz_counterexample2()
           @Base.cancel_check
           return find_collatz_counterexample_inner()
       end
find_collatz_counterexample2 (generic function with 1 method)

julia> find_collatz_counterexample2()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] handle_cancellation!(_req::Any)
   @ Base ./task.jl:1423
 [2] macro expansion
   @ ./condition.jl:133 [inlined]
 [3] find_collatz_counterexample2()
   @ Main ./REPL[2]:2
 [4] top-level scope
   @ REPL[3]:1

@vtjnash vtjnash added needs nanosoldier run This PR should have benchmarks run on it needs pkgeval Tests for all registered packages should be run with this change labels Dec 4, 2025
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added a commit that referenced this pull request Dec 4, 2025
The `preserve_none` calling convention is a new calling convention in
clang (>= 19) and gcc that preserves a more minimal set of registers
(rsp, rbp on x86_64; lr, fp on aarch64). As a result, if this calling
convention is used with setjmp, those registers do not need to be stored
in the setjmp buffer, allowing us to reduce the size of this buffer and
use fewer instructions to save the buffer. The tradeoff of course is
that these registers may need to be saved anyway, in which case
both the stack usage and the instructions just move to the caller
(which is strictly worse). It is not clear that this is useful for
exceptions (which already have a fair bit of state anyway, so even
in the happy path the savings are not necessarily that big), but
I am thinking about using it for #60281, which has different
characteristics, so this is an easy way to try out whether there
are any unexpected challenges.

Note that preserve_none is a very recent compiler feature, so most
compilers out there do not have it yet. For compatibility, this PR
supports using different jump buffer formats in the runtime and
the generated code.
return FunctionType::get(getInt32Ty(C), {}, false);
},
[](LLVMContext &C) { return AttributeList::get(C,
Attributes(C, {Attribute::ReturnsTwice}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All stores in the function after this call but before a matching longjmp need to changed to use volatile=true if they can be observed after the longjmp

Asymmetric atomic fences are a performance optimization of regular
atomic fences (the seq_cst version of which we expose as
`Base.Threads.atomic_fence`). The problem with these regular fences
is that they require a CPU fence instruction, which can be very
expensive and is thus unsuitable for code in the hot path.
Asymmetric fences on the other hand split an ordinary fence into
two: A `light` side where the fence is extremely cheap (only a
compiler reordering barrier) and a `heavy` side where the fence
is very expensive.

Basically the way it works is that the heavy side does a system call
that issues an inter-processor-interrupt (IPI) which then issues
the appropriate barrier instruction on the other CPU (i.e. both
CPUs will have issues a barrier instruction, one of them
just does it asynchronously due to interrupt).

The `light` and `heavy` naming here is taken from C++ PR1202R5 [1],
which is the proposal for the same feature in the C++ standard
library (to appear in the next iteration of the C++ concurrency
spec).

On the julia side, these functions are exposed as
`Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`.
The light side lowers to `fence singlethread` in llvm IR (the
Core.Intrinsic atomic_fence is adjusted appropriately to faciliate
this). The heavy side has OS-specifc implementations, where:

1. Linux/FreeBSD try to use the `membarrier` syscall or a
   fallback to `mprotect` for systems that don't have it.
2. Windows uses the `FlushProcessWriteBuffers` syscall.
3. macOS uses an implementation from the dotnet runtime
   (dotnet/runtime#44670), which
   the dotnet folks have checked with Apple does the right
   thing by happenstance (i.e. an IPI/memory barrier is needed
   to execute the syscall), but looks a little nonsensical by itself.
   However, since it's what Apple recommended to dotnet, I don't
   see much risk here, though I wouldn't be surprised if Apple added
   a proper syscall for this in the future (since freebsd has it now).

Note that unlike the C++ spec, I have specified that
`atomic_fence_heavy` does synchronize with `atomic_fence`. This
matches the underlying system call. I suspect C++ chose to omit
this for a hypothetical future architecture that has instruction
support for doing this from userspace that would then not
synchronize with ordinary barriers, but I think I would rather
cross that bridge when we get there.

I intend to use this in #60281, but it's an independently useful
feature.

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Keno added 7 commits December 4, 2025 22:45
This commit is a first sketch for what I would like to do for robust cancellation
(i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR,
but I think I've done enough of the design for a design discussion.

The first thing I should say is that the goals of this PR is very narrowly to
make ^C work well. As part of that, we're taking a bit of a step towards
structured concurrency, but I am not intending this PR to be a full implementation
of that.

Given that some of this has been beaten to death in previous issues, I will also
not do my usual motivation overview, instead jumping straight into the implementation.
As I said, the motivation is just to make ^C work reliably at this point.

Broadly when we're trying to cancel a task, it'll be in one of two broad categories:

1. Waiting for some other operation to complete (e.g. an IO operation, another task,
   an external event, etc.). Here, the actual cancellation itself is not so difficult
   (after all the task is not running, but suspended in a somehwat well-defined place).
   However, robust cancellation requires us to potentially propagate the cancellation
   signal down the wait tree, since the operation we actually want to cancel may not
   be the root task, but may instead be some operation being performed by the task
   we're waiting on (and we'd prefer not to leak those operations and have rogue tasks
   going around performing potentially side-effecting operations).

2. Currently running and doing some computation. The core problem is not really one of
   propagation (after all the long-running computation is probably what we're wanting
   to cancel), but rather how to do the cancellation without state corruption. A lot of
   the crashiness of our existing ^C implementation is just that we would simply inject
   an exception in places that are not expecting to handle it.

For a full solution to the problem, we need to have an answer for both of these points.
I will begin with the second, since the first builds upon it.

This PR introduces the concept of a `cancellation request` and a `cancellation point`.
Each task has a `cancellation_request` field that can be set externally (e.g. by ^C).
Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.

For this purpose, the PR provides the `@cancel_check` macro. This macro turns a pending
cancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.

We attempt to address this with compiler support. Note that this part is currently
incompletely implemented, so the following describes the design rather than the current
state of the PR. Consider the cancel_check macro:
```
macro cancel_check()
    quote
        local req = Core.cancellation_point!()
        if req !== nothing
            throw(conform_cancellation_request(req))
        end
    end
end
```

where `cancellation_point!` is a new intrinsic that defines a cancellation point. The
compiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a `cancellation_point!`, the system will
set the current task's `reset_ctx` to this cancellation point. If a cancellation request
occurs before the `reset_ctx` is cleared, the task's execution will be reset to the
nearest cancellation point. I proposed this mechanism in #52291.

Additionally, the `reset_ctx` can in principle be used to establish scoped cancellation
handlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).

Note that `cancellation_point!` is also intended to be a yield point in order to faciliate
the ^C mechanism described below. However, this is not currently implemented.

Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing `queue` reference to become a generic (atomic) "waitee" reference. The queue is
required to be obtainable from with object via the new `waitqueue` generic function.
To cancel a `waiter` waiting for a waitable `waitee` object, we

1. Set the waiter's cancellation request
2. Load the `waitee` and call a new generic function `cancel_wait!`,
   which shall do whatever synchronization and internal bookkeeping is
   required to remove the task from the wait-queue and then resumes the
   task.
3. The `waiter` resumes in the wait code. It may now decide how and whether to
   propagate the cancellation to the object it was just waiting on. Note that
   this may involve re-queing a wait (to wait for the cancellation of `waitee`
   to complete).

The idea here is that this provides a well-defined context for cancellation-propagation
logic to run. I wanted to avoid having any cancellation propagation logic run in parallel
with actual wait code.

How the cancellation propagates is a bit of a policy question and not one that I fully
intend to address in this PR. My plan is to implement a basic state machine that works
well for ^C (by requesting safe cancellation immediately and then requesting increasingly
unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external
libraries will want to create their own cancellation request state machines, which the
system supports. The implementation is incomplete, so I will not describe it here yet.

One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least `waitqueue` and `cancel_wait!` and possibly in the future).
However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).

The handling of ^C is delegated to a dedicated task that then gets notified from the
signal handler when a SIGINT is received (similar to the existing profile listener)
task. There is a little bit of an additional wrinkle in that we need some logic to
kick out a computational-task to its nearset cancellation point if we do not have
any idle threads. This logic is not yet implemented.

```
julia> sleep(1000)
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] _trywait(t::Timer)
   @ Base ./asyncevent.jl:195
 [3] wait
   @ ./asyncevent.jl:204 [inlined]
 [4] sleep(sec::Int64)
   @ Base ./asyncevent.jl:322
 [5] top-level scope
   @ REPL[1]:1

julia> function find_collatz_counterexample()
          i = 1
          while true
             j = i
             while true
                @Base.cancel_check
                j = collatz(j)
                j == 1 && break
                j == i && error("$j is a collatz counterexample")
             end
             i += 1
          end
       end
find_collatz_counterexample (generic function with 1 method)

julia> find_collatz_counterexample()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] find_collatz_counterexample()
   @ Main ./REPL[2]:6
 [3] top-level scope
   @ REPL[3]:1

julia> wait(@async sleep(100))
^CERROR: TaskFailedException
Stacktrace:
 [1] wait(t::Task; throw::Bool)
   @ Base ./task.jl:367
 [2] wait(t::Task)
   @ Base ./task.jl:360
 [3] top-level scope
   @ REPL[4]:0
 [4] macro expansion
   @ task.jl:729 [inlined]

    nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./condition.jl:134 [inlined]
     [2] _trywait(t::Timer)
       @ Base ./asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:204 [inlined]
     [4] sleep
       @ ./asyncevent.jl:322 [inlined]
     [5] (::var"#2#3")()
       @ Main ./REPL[4]:1

julia> @sync begin
         @async sleep(100)
         @async find_collatz_counterexample()
     end
^CERROR:     nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./task.jl:1234 [inlined]
     [2] _trywait(t::Timer)
       @ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:203 [inlined]
     [4] sleep
       @ ./asyncevent.jl:321 [inlined]
     [5] (::var"#45#46")()
       @ Main ./REPL[26]:3

...and 1 more exception.

Stacktrace:
 [1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException)
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454
 [2] sync_end(c::Channel{Any})
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:608
 [3] macro expansion
   @ ./task.jl:663 [inlined]
 [4] (::var"#43#44")()
   @ Main ./REPL[5]
```

As noted above, the `@Base.cancel_check` is not intended to be required in the inner loop.
Rather, the compiler is expected to extend the cancelation point from the start of the loop
to the entire function. However, this is not yet implemented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs nanosoldier run This PR should have benchmarks run on it needs pkgeval Tests for all registered packages should be run with this change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants