-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Very WIP: Architecture for robust cancellation #60281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
|
Just want to say that I really like this approach! ❤️ I especially like the design of making Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee? Should we expect that all async resources from Base will cancel on/within any async call, for predictability? That is to say, if we have a producer-consumer setup on a Regarding more than just ^C, can we expect that Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task? I can imagine that this will avoid having to implement APIs like Aside: I do think it's worth thinking more on whether we want users to have a way to target the cancellation at a library/task/arbitrary machinery, but as mentioned, this is mostly an orthogonal concern. |
Kind of. The canceler checks
Yes
Policy decision by the async library, so I'm not really expressing a preference. For now,
I was at this point not expecting cancellation to propagate through channels and conditions - rather I was expecting that it would cancel the wait on those objects and then the thrown exception might potentially cancel the expected producer in its cleanup scope - however, that's a bit of an orthogonal API design question that I don't have a strong opinion on.
Maybe - I could imagine SIGTERM trying to cancel all tasks in the system simultaneously with this mechanism - I don't know if the tree-based cancellation makes sense there, but it could be useful for graceful shutdown.
PR provides a |
|
I guess I should have said that I want the cancellation point to be a preemption point rather than a yield point. We don't currently have that concept, so a bit of an open question whether those are different, but I wanted to be precise. |
|
Capturing some slack discussion with @vtjnash. This reflects my best understanding, but @vtjnash was trying to make a larger point that I don't quite understand.
Probably not. The correctness of these functions depends on being paired with a unique @vtjnash provided the example
I think we need to have both versions, with the user selecting the appropriate one.
I think some variant of the following works: Cancelling thread: Waiting thread:
I think this is a cancellation point and the thread dies. This is different from
I think it can be weakref.
Probably should be fixed, ignore for now.
Not attached to the term. The naming was due to the possibility of introducing more unsafe cancellation variants (to be used on timeout or repeated ^C) that, while being more likely to succeed, could leave the system in an inconsistent state. Useful for looking around for debugging, but not semantically sound. |
|
Now with compiler and reset_ctx support, courtesy of claude (note the absence of explicit cancellation points inside the inner loop): |
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
The `preserve_none` calling convention is a new calling convention in clang (>= 19) and gcc that preserves a more minimal set of registers (rsp, rbp on x86_64; lr, fp on aarch64). As a result, if this calling convention is used with setjmp, those registers do not need to be stored in the setjmp buffer, allowing us to reduce the size of this buffer and use fewer instructions to save the buffer. The tradeoff of course is that these registers may need to be saved anyway, in which case both the stack usage and the instructions just move to the caller (which is strictly worse). It is not clear that this is useful for exceptions (which already have a fair bit of state anyway, so even in the happy path the savings are not necessarily that big), but I am thinking about using it for #60281, which has different characteristics, so this is an easy way to try out whether there are any unexpected challenges. Note that preserve_none is a very recent compiler feature, so most compilers out there do not have it yet. For compatibility, this PR supports using different jump buffer formats in the runtime and the generated code.
| return FunctionType::get(getInt32Ty(C), {}, false); | ||
| }, | ||
| [](LLVMContext &C) { return AttributeList::get(C, | ||
| Attributes(C, {Attribute::ReturnsTwice}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All stores in the function after this call but before a matching longjmp need to changed to use volatile=true if they can be observed after the longjmp
Asymmetric atomic fences are a performance optimization of regular atomic fences (the seq_cst version of which we expose as `Base.Threads.atomic_fence`). The problem with these regular fences is that they require a CPU fence instruction, which can be very expensive and is thus unsuitable for code in the hot path. Asymmetric fences on the other hand split an ordinary fence into two: A `light` side where the fence is extremely cheap (only a compiler reordering barrier) and a `heavy` side where the fence is very expensive. Basically the way it works is that the heavy side does a system call that issues an inter-processor-interrupt (IPI) which then issues the appropriate barrier instruction on the other CPU (i.e. both CPUs will have issues a barrier instruction, one of them just does it asynchronously due to interrupt). The `light` and `heavy` naming here is taken from C++ PR1202R5 [1], which is the proposal for the same feature in the C++ standard library (to appear in the next iteration of the C++ concurrency spec). On the julia side, these functions are exposed as `Threads.atomic_fence_light` and `Threads.atomic_fence_heavy`. The light side lowers to `fence singlethread` in llvm IR (the Core.Intrinsic atomic_fence is adjusted appropriately to faciliate this). The heavy side has OS-specifc implementations, where: 1. Linux/FreeBSD try to use the `membarrier` syscall or a fallback to `mprotect` for systems that don't have it. 2. Windows uses the `FlushProcessWriteBuffers` syscall. 3. macOS uses an implementation from the dotnet runtime (dotnet/runtime#44670), which the dotnet folks have checked with Apple does the right thing by happenstance (i.e. an IPI/memory barrier is needed to execute the syscall), but looks a little nonsensical by itself. However, since it's what Apple recommended to dotnet, I don't see much risk here, though I wouldn't be surprised if Apple added a proper syscall for this in the future (since freebsd has it now). Note that unlike the C++ spec, I have specified that `atomic_fence_heavy` does synchronize with `atomic_fence`. This matches the underlying system call. I suspect C++ chose to omit this for a hypothetical future architecture that has instruction support for doing this from userspace that would then not synchronize with ordinary barriers, but I think I would rather cross that bridge when we get there. I intend to use this in #60281, but it's an independently useful feature. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1202r5.pdf
This commit is a first sketch for what I would like to do for robust cancellation
(i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR,
but I think I've done enough of the design for a design discussion.
The first thing I should say is that the goals of this PR is very narrowly to
make ^C work well. As part of that, we're taking a bit of a step towards
structured concurrency, but I am not intending this PR to be a full implementation
of that.
Given that some of this has been beaten to death in previous issues, I will also
not do my usual motivation overview, instead jumping straight into the implementation.
As I said, the motivation is just to make ^C work reliably at this point.
Broadly when we're trying to cancel a task, it'll be in one of two broad categories:
1. Waiting for some other operation to complete (e.g. an IO operation, another task,
an external event, etc.). Here, the actual cancellation itself is not so difficult
(after all the task is not running, but suspended in a somehwat well-defined place).
However, robust cancellation requires us to potentially propagate the cancellation
signal down the wait tree, since the operation we actually want to cancel may not
be the root task, but may instead be some operation being performed by the task
we're waiting on (and we'd prefer not to leak those operations and have rogue tasks
going around performing potentially side-effecting operations).
2. Currently running and doing some computation. The core problem is not really one of
propagation (after all the long-running computation is probably what we're wanting
to cancel), but rather how to do the cancellation without state corruption. A lot of
the crashiness of our existing ^C implementation is just that we would simply inject
an exception in places that are not expecting to handle it.
For a full solution to the problem, we need to have an answer for both of these points.
I will begin with the second, since the first builds upon it.
This PR introduces the concept of a `cancellation request` and a `cancellation point`.
Each task has a `cancellation_request` field that can be set externally (e.g. by ^C).
Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.
For this purpose, the PR provides the `@cancel_check` macro. This macro turns a pending
cancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.
We attempt to address this with compiler support. Note that this part is currently
incompletely implemented, so the following describes the design rather than the current
state of the PR. Consider the cancel_check macro:
```
macro cancel_check()
quote
local req = Core.cancellation_point!()
if req !== nothing
throw(conform_cancellation_request(req))
end
end
end
```
where `cancellation_point!` is a new intrinsic that defines a cancellation point. The
compiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a `cancellation_point!`, the system will
set the current task's `reset_ctx` to this cancellation point. If a cancellation request
occurs before the `reset_ctx` is cleared, the task's execution will be reset to the
nearest cancellation point. I proposed this mechanism in #52291.
Additionally, the `reset_ctx` can in principle be used to establish scoped cancellation
handlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).
Note that `cancellation_point!` is also intended to be a yield point in order to faciliate
the ^C mechanism described below. However, this is not currently implemented.
Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing `queue` reference to become a generic (atomic) "waitee" reference. The queue is
required to be obtainable from with object via the new `waitqueue` generic function.
To cancel a `waiter` waiting for a waitable `waitee` object, we
1. Set the waiter's cancellation request
2. Load the `waitee` and call a new generic function `cancel_wait!`,
which shall do whatever synchronization and internal bookkeeping is
required to remove the task from the wait-queue and then resumes the
task.
3. The `waiter` resumes in the wait code. It may now decide how and whether to
propagate the cancellation to the object it was just waiting on. Note that
this may involve re-queing a wait (to wait for the cancellation of `waitee`
to complete).
The idea here is that this provides a well-defined context for cancellation-propagation
logic to run. I wanted to avoid having any cancellation propagation logic run in parallel
with actual wait code.
How the cancellation propagates is a bit of a policy question and not one that I fully
intend to address in this PR. My plan is to implement a basic state machine that works
well for ^C (by requesting safe cancellation immediately and then requesting increasingly
unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external
libraries will want to create their own cancellation request state machines, which the
system supports. The implementation is incomplete, so I will not describe it here yet.
One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least `waitqueue` and `cancel_wait!` and possibly in the future).
However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).
The handling of ^C is delegated to a dedicated task that then gets notified from the
signal handler when a SIGINT is received (similar to the existing profile listener)
task. There is a little bit of an additional wrinkle in that we need some logic to
kick out a computational-task to its nearset cancellation point if we do not have
any idle threads. This logic is not yet implemented.
```
julia> sleep(1000)
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] _trywait(t::Timer)
@ Base ./asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:204 [inlined]
[4] sleep(sec::Int64)
@ Base ./asyncevent.jl:322
[5] top-level scope
@ REPL[1]:1
julia> function find_collatz_counterexample()
i = 1
while true
j = i
while true
@Base.cancel_check
j = collatz(j)
j == 1 && break
j == i && error("$j is a collatz counterexample")
end
i += 1
end
end
find_collatz_counterexample (generic function with 1 method)
julia> find_collatz_counterexample()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] find_collatz_counterexample()
@ Main ./REPL[2]:6
[3] top-level scope
@ REPL[3]:1
julia> wait(@async sleep(100))
^CERROR: TaskFailedException
Stacktrace:
[1] wait(t::Task; throw::Bool)
@ Base ./task.jl:367
[2] wait(t::Task)
@ Base ./task.jl:360
[3] top-level scope
@ REPL[4]:0
[4] macro expansion
@ task.jl:729 [inlined]
nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] _trywait(t::Timer)
@ Base ./asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:204 [inlined]
[4] sleep
@ ./asyncevent.jl:322 [inlined]
[5] (::var"#2#3")()
@ Main ./REPL[4]:1
julia> @sync begin
@async sleep(100)
@async find_collatz_counterexample()
end
^CERROR: nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./task.jl:1234 [inlined]
[2] _trywait(t::Timer)
@ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:203 [inlined]
[4] sleep
@ ./asyncevent.jl:321 [inlined]
[5] (::var"#45#46")()
@ Main ./REPL[26]:3
...and 1 more exception.
Stacktrace:
[1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException)
@ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454
[2] sync_end(c::Channel{Any})
@ Base ~/julia-cancel/usr/share/julia/base/task.jl:608
[3] macro expansion
@ ./task.jl:663 [inlined]
[4] (::var"#43#44")()
@ Main ./REPL[5]
```
As noted above, the `@Base.cancel_check` is not intended to be required in the inner loop.
Rather, the compiler is expected to extend the cancelation point from the start of the loop
to the entire function. However, this is not yet implemented.
Introduction
This commit is a first sketch for what I would like to do for robust cancellation
(i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR,
but I think I've done enough of the design for a design discussion.
The first thing I should say is that the goals of this PR is very narrowly to
make ^C work well. As part of that, we're taking a bit of a step towards
structured concurrency, but I am not intending this PR to be a full implementation
of that.
Given that some of this has been beaten to death in previous issues, I will also
not do my usual motivation overview, instead jumping straight into the implementation.
As I said, the motivation is just to make ^C work reliably at this point.
Setting the stage
Broadly when we're trying to cancel a task, it'll be in one of two broad categories:
Waiting for some other operation to complete (e.g. an IO operation, another task,
an external event, etc.). Here, the actual cancellation itself is not so difficult
(after all the task is not running, but suspended in a somehwat well-defined place).
However, robust cancellation requires us to potentially propagate the cancellation
signal down the wait tree, since the operation we actually want to cancel may not
be the root task, but may instead be some operation being performed by the task
we're waiting on (and we'd prefer not to leak those operations and have rogue tasks
going around performing potentially side-effecting operations).
Currently running and doing some computation. The core problem is not really one of
propagation (after all the long-running computation is probably what we're wanting
to cancel), but rather how to do the cancellation without state corruption. A lot of
the crashiness of our existing ^C implementation is just that we would simply inject
an exception in places that are not expecting to handle it.
For a full solution to the problem, we need to have an answer for both of these points.
I will begin with the second, since the first builds upon it.
Cancellation points
This PR introduces the concept of a
cancellation requestand acancellation point.Each task has a
cancellation_requestfield that can be set externally (e.g. by ^C).Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.
For this purpose, the PR provides the
@cancel_checkmacro. This macro turns a pendingcancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.
We attempt to address this with compiler support. Note that this part is currently
incompletely implemented, so the following describes the design rather than the current
state of the PR. Consider the cancel_check macro:
where
cancellation_point!is a new intrinsic that defines a cancellation point. Thecompiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a
cancellation_point!, the system willset the current task's
reset_ctxto this cancellation point. If a cancellation requestoccurs before the
reset_ctxis cleared, the task's execution will be reset to thenearest cancellation point. I proposed this mechanism in #52291.
Additionally, the
reset_ctxcan in principle be used to establish scoped cancellationhandlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).
Note that
cancellation_point!is also intended to be a yield point in order to faciliatethe ^C mechanism described below. However, this is not currently implemented.
Structured cancellation
Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing
queuereference to become a generic (atomic) "waitee" reference. The queue isrequired to be obtainable from with object via the new
waitqueuegeneric function.To cancel a
waiterwaiting for a waitablewaiteeobject, wewaiteeand call a new generic functioncancel_wait!,which shall do whatever synchronization and internal bookkeeping is
required to remove the task from the wait-queue and then resumes the
task.
waiterresumes in the wait code. It may now decide how and whether topropagate the cancellation to the object it was just waiting on. Note that
this may involve re-queing a wait (to wait for the cancellation of
waiteeto complete).
The idea here is that this provides a well-defined context for cancellation-propagation
logic to run. I wanted to avoid having any cancellation propagation logic run in parallel
with actual wait code.
How the cancellation propagates is a bit of a policy question and not one that I fully
intend to address in this PR. My plan is to implement a basic state machine that works
well for ^C (by requesting safe cancellation immediately and then requesting increasingly
unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external
libraries will want to create their own cancellation request state machines, which the
system supports. The implementation is incomplete, so I will not describe it here yet.
One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least
waitqueueandcancel_wait!and possibly in the future).However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).
^C handling
The handling of ^C is delegated to a dedicated task that then gets notified from the
signal handler when a SIGINT is received (similar to the existing profile listener)
task. There is a little bit of an additional wrinkle in that we need some logic to
kick out a computational-task to its nearset cancellation point if we do not have
any idle threads. This logic is not yet implemented.
Examples to try
As noted above, the
@Base.cancel_checkis not intended to be required in the inner loop.Rather, the compiler is expected to extend the cancellation point from the start of the loop
to the entire function. However, this is not yet implemented.
I hope this will fix (more of a checklist for me at this point) #4037 #6283 #25790 #29369 #36379 #42072 #43451 #43451 #45055 #47839 #50045 #56462 #56545 #58105 #58849 #58689
Closes #49541
Part of #33248 #52291
Refs, but yet to be decided #58259 #35026 #39699
TODO
Look into BLAS cancellationpunted