Very WIP: Architecture for robust cancellation #60281

Keno · 2025-11-30T08:40:37Z

Introduction

This commit is a first sketch for what I would like to do for robust cancellation
(i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR,
but I think I've done enough of the design for a design discussion.

The first thing I should say is that the goals of this PR is very narrowly to
make ^C work well. As part of that, we're taking a bit of a step towards
structured concurrency, but I am not intending this PR to be a full implementation
of that.

Given that some of this has been beaten to death in previous issues, I will also
not do my usual motivation overview, instead jumping straight into the implementation.
As I said, the motivation is just to make ^C work reliably at this point.

Setting the stage

Broadly when we're trying to cancel a task, it'll be in one of two broad categories:

Waiting for some other operation to complete (e.g. an IO operation, another task,
an external event, etc.). Here, the actual cancellation itself is not so difficult
(after all the task is not running, but suspended in a somehwat well-defined place).
However, robust cancellation requires us to potentially propagate the cancellation
signal down the wait tree, since the operation we actually want to cancel may not
be the root task, but may instead be some operation being performed by the task
we're waiting on (and we'd prefer not to leak those operations and have rogue tasks
going around performing potentially side-effecting operations).
Currently running and doing some computation. The core problem is not really one of
propagation (after all the long-running computation is probably what we're wanting
to cancel), but rather how to do the cancellation without state corruption. A lot of
the crashiness of our existing ^C implementation is just that we would simply inject
an exception in places that are not expecting to handle it.

For a full solution to the problem, we need to have an answer for both of these points.
I will begin with the second, since the first builds upon it.

Cancellation points

This PR introduces the concept of a cancellation request and a cancellation point.
Each task has a cancellation_request field that can be set externally (e.g. by ^C).
Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.

For this purpose, the PR provides the @cancel_check macro. This macro turns a pending
cancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.

We attempt to address this with compiler support. Note that this part is currently
incompletely implemented, so the following describes the design rather than the current
state of the PR. Consider the cancel_check macro:

macro cancel_check()
    quote
        local req = Core.cancellation_point!()
        if req !== nothing
            throw(conform_cancellation_request(req))
        end
    end
end

where cancellation_point! is a new intrinsic that defines a cancellation point. The
compiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a cancellation_point!, the system will
set the current task's reset_ctx to this cancellation point. If a cancellation request
occurs before the reset_ctx is cleared, the task's execution will be reset to the
nearest cancellation point. I proposed this mechanism in #52291.

Additionally, the reset_ctx can in principle be used to establish scoped cancellation
handlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).

Note that cancellation_point! is also intended to be a yield point in order to faciliate
the ^C mechanism described below. However, this is not currently implemented.

Structured cancellation

Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing queue reference to become a generic (atomic) "waitee" reference. The queue is
required to be obtainable from with object via the new waitqueue generic function.
To cancel a waiter waiting for a waitable waitee object, we

Set the waiter's cancellation request
Load the waitee and call a new generic function cancel_wait!,
which shall do whatever synchronization and internal bookkeeping is
required to remove the task from the wait-queue and then resumes the
task.
The waiter resumes in the wait code. It may now decide how and whether to
propagate the cancellation to the object it was just waiting on. Note that
this may involve re-queing a wait (to wait for the cancellation of waitee
to complete).

The idea here is that this provides a well-defined context for cancellation-propagation
logic to run. I wanted to avoid having any cancellation propagation logic run in parallel
with actual wait code.

How the cancellation propagates is a bit of a policy question and not one that I fully
intend to address in this PR. My plan is to implement a basic state machine that works
well for ^C (by requesting safe cancellation immediately and then requesting increasingly
unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external
libraries will want to create their own cancellation request state machines, which the
system supports. The implementation is incomplete, so I will not describe it here yet.

One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least waitqueue and cancel_wait! and possibly in the future).
However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).

^C handling

The handling of ^C is delegated to a dedicated task that then gets notified from the
signal handler when a SIGINT is received (similar to the existing profile listener)
task. There is a little bit of an additional wrinkle in that we need some logic to
kick out a computational-task to its nearset cancellation point if we do not have
any idle threads. This logic is not yet implemented.

Examples to try

julia> sleep(1000)
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] _trywait(t::Timer)
   @ Base ./asyncevent.jl:195
 [3] wait
   @ ./asyncevent.jl:204 [inlined]
 [4] sleep(sec::Int64)
   @ Base ./asyncevent.jl:322
 [5] top-level scope
   @ REPL[1]:1

julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)

julia> function find_collatz_counterexample()
          i = 1
          while true
             j = i
             while true
                @Base.cancel_check 
                j = collatz(j)
                j == 1 && break
                j == i && error("$j is a collatz counterexample")
             end
             i += 1
          end
       end
find_collatz_counterexample (generic function with 1 method)

julia> find_collatz_counterexample()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] macro expansion
   @ ./condition.jl:134 [inlined]
 [2] find_collatz_counterexample()
   @ Main ./REPL[2]:6
 [3] top-level scope
   @ REPL[3]:1

julia> wait(@async sleep(100))
^CERROR: TaskFailedException
Stacktrace:
 [1] wait(t::Task; throw::Bool)
   @ Base ./task.jl:367
 [2] wait(t::Task)
   @ Base ./task.jl:360
 [3] top-level scope
   @ REPL[4]:0
 [4] macro expansion
   @ task.jl:729 [inlined]

    nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./condition.jl:134 [inlined]
     [2] _trywait(t::Timer)
       @ Base ./asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:204 [inlined]
     [4] sleep
       @ ./asyncevent.jl:322 [inlined]
     [5] (::var"#2#3")()
       @ Main ./REPL[4]:1

julia> @sync begin
         @async sleep(100)
         @async find_collatz_counterexample()
     end
^CERROR:     nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
    Stacktrace:
     [1] macro expansion
       @ ./task.jl:1234 [inlined]
     [2] _trywait(t::Timer)
       @ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195
     [3] wait
       @ ./asyncevent.jl:203 [inlined]
     [4] sleep
       @ ./asyncevent.jl:321 [inlined]
     [5] (::var"#45#46")()
       @ Main ./REPL[26]:3

...and 1 more exception.

Stacktrace:
 [1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException)
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454
 [2] sync_end(c::Channel{Any})
   @ Base ~/julia-cancel/usr/share/julia/base/task.jl:608
 [3] macro expansion
   @ ./task.jl:663 [inlined]
 [4] (::var"#43#44")()
   @ Main ./REPL[5]

As noted above, the @Base.cancel_check is not intended to be required in the inner loop.
Rather, the compiler is expected to extend the cancellation point from the start of the loop
to the entire function. However, this is not yet implemented.

I hope this will fix (more of a checklist for me at this point) #4037 #6283 #25790 #29369 #36379 #42072 #43451 #43451 #45055 #47839 #50045 #56462 #56545 #58105 #58849 #58689
Closes #49541
Part of #33248 #52291
Refs, but yet to be decided #58259 #35026 #39699

TODO

jpsamaroo · 2025-11-30T15:28:46Z

Just want to say that I really like this approach! ❤️

I especially like the design of making Base.@cancel_check a point that the task can longjmp back to from effect-free code, as it (presumably) doesn't require waiting until said code finishes to hit a cancellation exit (as shown with the Collatz example). I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region? Is the idea that reset_ctx (which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request), and so the ^C handling logic knows when it can just forcibly suspend and reset the task via longjmp to reset_ctx?

Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee? Should we expect that all async resources from Base will cancel on/within any async call, for predictability? That is to say, if we have a producer-consumer setup on a Channel, can I expect that both sides will receive an exception once/while they interact with the Channel? Similarly wondering about this for Threads.Condition, Base.Event, etc. If the answer is "yes", will there be a way to opt-out of this (in the case that the resource needs to continue operating normally to allow surrounding library logic to cancel itself)?

Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals) will one day invoke this logic as well? I can imagine that when trying to terminate a complex application, having the first course of action be to cancel ongoing work is conducive to a safe and expedient shutdown. We would of course want to then do the finalizer and atexit dance, which should hopefully now be able to do their jobs without concern for resources being locked or otherwise unavailable for cleanup.

Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task? I can imagine that this will avoid having to implement APIs like wait(obj; timeout), as we can just wrap the wait(obj) call with some logic that will send a cancellation request to just that task if the timeout expires before wait returns. This would also make it easy for libraries like Dagger to request arbitrary user code (running within a Dagger-launched task) to cancel when Dagger decides it's desirable (possibly without direct user input).

Aside: I do think it's worth thinking more on whether we want users to have a way to target the cancellation at a library/task/arbitrary machinery, but as mentioned, this is mostly an orthogonal concern.

Keno · 2025-11-30T17:58:25Z

I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region?

Kind of. The canceler checks reset_ctx. If non-null, it sends a signal to the thread, which then longjmps to reset_ctx if still non-null and if cancellation_request is set on the currently running task.

Is the idea that reset_ctx (which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request)

Yes

Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee?

Policy decision by the async library, so I'm not really expressing a preference. For now, wait cancels the waitee and there's
wait_nocancel to opt out. In the future there could be fancier APIs for cancellation scope.

Should we expect that all async resources from Base will cancel on/within any async call, for predictability?

I was at this point not expecting cancellation to propagate through channels and conditions - rather I was expecting that it would cancel the wait on those objects and then the thrown exception might potentially cancel the expected producer in its cleanup scope - however, that's a bit of an orthogonal API design question that I don't have a strong opinion on.

Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals)

Maybe - I could imagine SIGTERM trying to cancel all tasks in the system simultaneously with this mechanism - I don't know if the tree-based cancellation makes sense there, but it could be useful for graceful shutdown.

Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task?

PR provides a cancel! API.

Keno · 2025-11-30T20:44:08Z

I guess I should have said that I want the cancellation point to be a preemption point rather than a yield point. We don't currently have that concept, so a bit of an open question whether those are different, but I wanted to be precise.

Keno · 2025-12-02T04:22:39Z

Capturing some slack discussion with @vtjnash. This reflects my best understanding, but @vtjnash was trying to make a larger point that I don't quite understand.

Are plain wait, yield, yieldto, etc. cancellation points?

Probably not. The correctness of these functions depends on being paired with a unique schedule that resumes it and has correctness guarantees that need to be enforced at a higher level. There does not seem to be any good to way to make these automatic cancellation points.

@vtjnash provided the example ct = current_task(); t = @task(yieldto(ct); nothing); yieldto(t); wait(t). This task t cannot be canceled, because it's not at a cancellation point.

Are locks cancellation points?

I think we need to have both versions, with the user selecting the appropriate one.

How is the cancellation guaranteed without seq_cst on the waitee field (which would be expensive).

I think some variant of the following works:

Cancelling thread:

while (true)
jl_atomic_store_relaxed(&t->cancellation_request, req);
SYS_membarrier
if cancel_wait!(jl_atomic_load_acquire(&t->waitee), t)
     break
end
end

Waiting thread:

jl_atomic_store_release(&t->waitee, wait);
barrier();
if (cancelled(jl_atomic_load_relaxed(&t->cancellation_request))) throw();
wait();

What happens to threads that haven't started yet.

I think this is a cancellation point and the thread dies. This is different from @async wait() because you can't put try/catch around it.

Does having the waitee field prevent GC of events that will never fire?

I think it can be weakref.

Libuv does not support write cancellation

Probably should be fixed, ignore for now.

The term "safe cancellation" is inappropriate, because it can fail and hang, which doesn't feel very safe.

Not attached to the term. The naming was due to the possibility of introducing more unsafe cancellation variants (to be used on timeout or repeated ^C) that, while being more likely to succeed, could leave the system in an inconsistent state. Useful for looking around for debugging, but not semantically sound.

Keno · 2025-12-04T01:18:13Z

Now with compiler and reset_ctx support, courtesy of claude (note the absence of explicit cancellation points inside the inner loop):

julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)

julia> function find_collatz_counterexample_inner()
           i = 1
           while true
               j = i
               while true
                   j = collatz(j)
                   j == 1 && break
                   j == i && return j
               end
               i += 1
           end
       end
find_collatz_counterexample_inner (generic function with 1 method)

julia> function find_collatz_counterexample2()
           @Base.cancel_check
           return find_collatz_counterexample_inner()
       end
find_collatz_counterexample2 (generic function with 1 method)

julia> find_collatz_counterexample2()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
 [1] handle_cancellation!(_req::Any)
   @ Base ./task.jl:1423
 [2] macro expansion
   @ ./condition.jl:133 [inlined]
 [3] find_collatz_counterexample2()
   @ Main ./REPL[2]:2
 [4] top-level scope
   @ REPL[3]:1

Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf

Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf

The `preserve_none` calling convention is a new calling convention in clang (>= 19) and gcc that preserves a more minimal set of registers (rsp, rbp on x86_64; lr, fp on aarch64). As a result, if this calling convention is used with setjmp, those registers do not need to be stored in the setjmp buffer, allowing us to reduce the size of this buffer and use fewer instructions to save the buffer. The tradeoff of course is that these registers may need to be saved anyway, in which case both the stack usage and the instructions just move to the caller (which is strictly worse). It is not clear that this is useful for exceptions (which already have a fair bit of state anyway, so even in the happy path the savings are not necessarily that big), but I am thinking about using it for #60281, which has different characteristics, so this is an easy way to try out whether there are any unexpected challenges. Note that preserve_none is a very recent compiler feature, so most compilers out there do not have it yet. For compatibility, this PR supports using different jump buffer formats in the runtime and the generated code.

vtjnash · 2025-12-04T17:29:41Z

src/codegen.cpp

+        return FunctionType::get(getInt32Ty(C), {}, false);
+    },
+    [](LLVMContext &C) { return AttributeList::get(C,
+            Attributes(C, {Attribute::ReturnsTwice}),


All stores in the function after this call but before a matching longjmp need to changed to use volatile=true if they can be observed after the longjmp

Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf

@sync

This commit is a first sketch for what I would like to do for robust cancellation (i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR, but I think I've done enough of the design for a design discussion. The first thing I should say is that the goals of this PR is very narrowly to make ^C work well. As part of that, we're taking a bit of a step towards structured concurrency, but I am not intending this PR to be a full implementation of that. Given that some of this has been beaten to death in previous issues, I will also not do my usual motivation overview, instead jumping straight into the implementation. As I said, the motivation is just to make ^C work reliably at this point. Broadly when we're trying to cancel a task, it'll be in one of two broad categories: 1. Waiting for some other operation to complete (e.g. an IO operation, another task, an external event, etc.). Here, the actual cancellation itself is not so difficult (after all the task is not running, but suspended in a somehwat well-defined place). However, robust cancellation requires us to potentially propagate the cancellation signal down the wait tree, since the operation we actually want to cancel may not be the root task, but may instead be some operation being performed by the task we're waiting on (and we'd prefer not to leak those operations and have rogue tasks going around performing potentially side-effecting operations). 2. Currently running and doing some computation. The core problem is not really one of propagation (after all the long-running computation is probably what we're wanting to cancel), but rather how to do the cancellation without state corruption. A lot of the crashiness of our existing ^C implementation is just that we would simply inject an exception in places that are not expecting to handle it. For a full solution to the problem, we need to have an answer for both of these points. I will begin with the second, since the first builds upon it. This PR introduces the concept of a `cancellation request` and a `cancellation point`. Each task has a `cancellation_request` field that can be set externally (e.g. by ^C). Any task performing computation should regularly check this field and abort its computation if a cancellation request is pending. For this purpose, the PR provides the `@cancel_check` macro. This macro turns a pending cancellation request into a well-modeled exception. Package authors should insert a call to the macro into any long-running loops. However, there is of course some overhead to the check and it is therefor inappropriate for tight inner loops. We attempt to address this with compiler support. Note that this part is currently incompletely implemented, so the following describes the design rather than the current state of the PR. Consider the cancel_check macro: ``` macro cancel_check() quote local req = Core.cancellation_point!() if req !== nothing throw(conform_cancellation_request(req)) end end end ``` where `cancellation_point!` is a new intrinsic that defines a cancellation point. The compiler is semantically permitted to extend the cancellation point across any following effect_free calls (note for transitivity reasons, the effect is not exactly the same, but is morally equivalent). Upon passing a `cancellation_point!`, the system will set the current task's `reset_ctx` to this cancellation point. If a cancellation request occurs before the `reset_ctx` is cleared, the task's execution will be reset to the nearest cancellation point. I proposed this mechanism in #52291. Additionally, the `reset_ctx` can in principle be used to establish scoped cancellation handlers for external C libraries as well although I suspect that there are not many C libraries that are actually reset safe in the required manner (since allocation is not). Note that `cancellation_point!` is also intended to be a yield point in order to faciliate the ^C mechanism described below. However, this is not currently implemented. Turning our attention now to the first of the two cases mentioned above, we tweak the task's existing `queue` reference to become a generic (atomic) "waitee" reference. The queue is required to be obtainable from with object via the new `waitqueue` generic function. To cancel a `waiter` waiting for a waitable `waitee` object, we 1. Set the waiter's cancellation request 2. Load the `waitee` and call a new generic function `cancel_wait!`, which shall do whatever synchronization and internal bookkeeping is required to remove the task from the wait-queue and then resumes the task. 3. The `waiter` resumes in the wait code. It may now decide how and whether to propagate the cancellation to the object it was just waiting on. Note that this may involve re-queing a wait (to wait for the cancellation of `waitee` to complete). The idea here is that this provides a well-defined context for cancellation-propagation logic to run. I wanted to avoid having any cancellation propagation logic run in parallel with actual wait code. How the cancellation propagates is a bit of a policy question and not one that I fully intend to address in this PR. My plan is to implement a basic state machine that works well for ^C (by requesting safe cancellation immediately and then requesting increasingly unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external libraries will want to create their own cancellation request state machines, which the system supports. The implementation is incomplete, so I will not describe it here yet. One may note that there are a significant number of additional fully dynamic dispatches in this scheme (at least `waitqueue` and `cancel_wait!` and possibly in the future). However, note that these dynamic dispatches are confined to the cancellation path, which is not throughput-sensitive (but is latency sensitive). The handling of ^C is delegated to a dedicated task that then gets notified from the signal handler when a SIGINT is received (similar to the existing profile listener) task. There is a little bit of an additional wrinkle in that we need some logic to kick out a computational-task to its nearset cancellation point if we do not have any idle threads. This logic is not yet implemented. ``` julia> sleep(1000) ^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] _trywait(t::Timer) @ Base ./asyncevent.jl:195 [3] wait @ ./asyncevent.jl:204 [inlined] [4] sleep(sec::Int64) @ Base ./asyncevent.jl:322 [5] top-level scope @ REPL[1]:1 julia> function find_collatz_counterexample() i = 1 while true j = i while true @Base.cancel_check j = collatz(j) j == 1 && break j == i && error("$j is a collatz counterexample") end i += 1 end end find_collatz_counterexample (generic function with 1 method) julia> find_collatz_counterexample() ^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] find_collatz_counterexample() @ Main ./REPL[2]:6 [3] top-level scope @ REPL[3]:1 julia> wait(@async sleep(100)) ^CERROR: TaskFailedException Stacktrace: [1] wait(t::Task; throw::Bool) @ Base ./task.jl:367 [2] wait(t::Task) @ Base ./task.jl:360 [3] top-level scope @ REPL[4]:0 [4] macro expansion @ task.jl:729 [inlined] nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] _trywait(t::Timer) @ Base ./asyncevent.jl:195 [3] wait @ ./asyncevent.jl:204 [inlined] [4] sleep @ ./asyncevent.jl:322 [inlined] [5] (::var"#2#3")() @ Main ./REPL[4]:1 julia> @sync begin @async sleep(100) @async find_collatz_counterexample() end ^CERROR: nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./task.jl:1234 [inlined] [2] _trywait(t::Timer) @ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195 [3] wait @ ./asyncevent.jl:203 [inlined] [4] sleep @ ./asyncevent.jl:321 [inlined] [5] (::var"#45#46")() @ Main ./REPL[26]:3 ...and 1 more exception. Stacktrace: [1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException) @ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454 [2] sync_end(c::Channel{Any}) @ Base ~/julia-cancel/usr/share/julia/base/task.jl:608 [3] macro expansion @ ./task.jl:663 [inlined] [4] (::var"#43#44")() @ Main ./REPL[5] ``` As noted above, the `@Base.cancel_check` is not intended to be required in the inner loop. Rather, the compiler is expected to extend the cancelation point from the start of the loop to the entire function. However, this is not yet implemented.

Keno marked this pull request as draft November 30, 2025 08:41

Keno mentioned this pull request Nov 30, 2025

Decide on usability of non-IPO information of directly invoked CodeInstances #58556

Open

This comment has been minimized.

Sign in to view

vtjnash added needs nanosoldier run This PR should have benchmarks run on it needs pkgeval Tests for all registered packages should be run with this change labels Dec 4, 2025

Keno mentioned this pull request Dec 4, 2025

threads: Implement asymmetric atomic fences #60311

Open

vtjnash reviewed Dec 4, 2025

View reviewed changes

Keno mentioned this pull request Dec 5, 2025

WIP: Try using preserve_none for setjmp #60320

Draft

Keno added 7 commits December 4, 2025 22:45

Address review

6911fa0

WIP

29f503b

Claude's compiler support WIP

e8ea4a7

Fix missing reset

759d56c

Fix rebase mistake

7b52667

Implement asymmetric barriers

044f2ca

Keno force-pushed the kf/cancel branch from cfb31db to 044f2ca Compare December 5, 2025 08:31

Support write cancellation

0c85647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Very WIP: Architecture for robust cancellation #60281

Very WIP: Architecture for robust cancellation #60281

Uh oh!

Keno commented Nov 30, 2025 •

edited

Loading

Uh oh!

This comment has been minimized.

jpsamaroo commented Nov 30, 2025

Uh oh!

Keno commented Nov 30, 2025

Uh oh!

Keno commented Nov 30, 2025

Uh oh!

Keno commented Dec 2, 2025

Uh oh!

Keno commented Dec 4, 2025 •

edited

Loading

Uh oh!

vtjnash Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Very WIP: Architecture for robust cancellation #60281

Are you sure you want to change the base?

Very WIP: Architecture for robust cancellation #60281

Uh oh!

Conversation

Keno commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Setting the stage

Cancellation points

Structured cancellation

^C handling

Examples to try

TODO

Uh oh!

This comment has been minimized.

jpsamaroo commented Nov 30, 2025

Uh oh!

Keno commented Nov 30, 2025

Uh oh!

Keno commented Nov 30, 2025

Uh oh!

Keno commented Dec 2, 2025

Uh oh!

Keno commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Keno commented Nov 30, 2025 •

edited

Loading

Keno commented Dec 4, 2025 •

edited

Loading