Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,7 @@ void ShenandoahGenerationalControlThread::stop_service() {
log_debug(gc, thread)("Stopping control thread");
MonitorLocker ml(&_control_lock, Mutex::_no_safepoint_check_flag);
_heap->cancel_gc(GCCause::_shenandoah_stop_vm);
_requested_gc_cause = GCCause::_shenandoah_stop_vm;
notify_cancellation(ml, GCCause::_shenandoah_stop_vm);
notify_control_thread(ml, GCCause::_shenandoah_stop_vm);
// We can't wait here because it may interfere with the active cycle's ability
// to reach a safepoint (this runs on a java thread).
}
Expand Down Expand Up @@ -140,7 +139,8 @@ void ShenandoahGenerationalControlThread::check_for_request(ShenandoahGCRequest&
}

ShenandoahGenerationalControlThread::GCMode ShenandoahGenerationalControlThread::prepare_for_allocation_failure_gc(ShenandoahGCRequest &request) {

// Important: not all paths update the request.generation. This is intentional.
// A degenerated cycle must use the same generation carried over from the previous request.
if (_degen_point == ShenandoahGC::_degenerated_unset) {
_degen_point = ShenandoahGC::_degenerated_outside_cycle;
request.generation = _heap->young_generation();
Expand Down Expand Up @@ -633,9 +633,7 @@ void ShenandoahGenerationalControlThread::service_stw_degenerated_cycle(const Sh

void ShenandoahGenerationalControlThread::request_gc(GCCause::Cause cause) {
if (ShenandoahCollectorPolicy::is_allocation_failure(cause)) {
// GC should already be cancelled. Here we are just notifying the control thread to
// wake up and handle the cancellation request, so we don't need to set _requested_gc_cause.
notify_cancellation(cause);
notify_control_thread(cause);
} else if (ShenandoahCollectorPolicy::should_handle_requested_gc(cause)) {
handle_requested_gc(cause);
}
Expand All @@ -661,7 +659,7 @@ bool ShenandoahGenerationalControlThread::request_concurrent_gc(ShenandoahGenera
log_info(gc)("Preempting old generation mark to allow %s GC", generation->name());
while (gc_mode() == servicing_old) {
ShenandoahHeap::heap()->cancel_gc(GCCause::_shenandoah_concurrent_gc);
notify_cancellation(ml, GCCause::_shenandoah_concurrent_gc);
notify_control_thread(ml, GCCause::_shenandoah_concurrent_gc);
ml.wait();
}
return true;
Expand Down Expand Up @@ -701,14 +699,15 @@ void ShenandoahGenerationalControlThread::notify_control_thread(MonitorLocker& m
ml.notify();
}

void ShenandoahGenerationalControlThread::notify_cancellation(GCCause::Cause cause) {
void ShenandoahGenerationalControlThread::notify_control_thread(GCCause::Cause cause) {
MonitorLocker ml(&_control_lock, Mutex::_no_safepoint_check_flag);
notify_cancellation(ml, cause);
notify_control_thread(ml, cause);
}

void ShenandoahGenerationalControlThread::notify_cancellation(MonitorLocker& ml, GCCause::Cause cause) {
assert(_heap->cancelled_gc(), "GC should already be cancelled");
log_debug(gc,thread)("Notify control (%s): %s", gc_mode_name(gc_mode()), GCCause::to_string(cause));
void ShenandoahGenerationalControlThread::notify_control_thread(MonitorLocker& ml, GCCause::Cause cause) {
assert(_control_lock.is_locked(), "Request lock must be held here");
log_debug(gc, thread)("Notify control (%s): %s", gc_mode_name(gc_mode()), GCCause::to_string(cause));
_requested_gc_cause = cause;
ml.notify();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,16 +135,13 @@ class ShenandoahGenerationalControlThread: public ShenandoahController {
// Return printable name for the given gc mode.
static const char* gc_mode_name(GCMode mode);

// Takes the request lock and updates the requested cause and generation, then notifies the control thread.
// The overloaded variant should be used when the _control_lock is already held.
// These notify the control thread after updating _requested_gc_cause and (optionally) _requested_generation.
// Updating the requested generation is not necessary for allocation failures nor when stopping the thread.
void notify_control_thread(GCCause::Cause cause);
void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause);
void notify_control_thread(GCCause::Cause cause, ShenandoahGeneration* generation);
void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation);

// Notifies the control thread, but does not update the requested cause or generation.
// The overloaded variant should be used when the _control_lock is already held.
void notify_cancellation(GCCause::Cause cause);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods were the root cause here. ShenandoahHeap::_canceled_gc is read/written atomically, but ShenandoahGenerationalControlThread::_requested_gc_cause is read/written under a lock. These notify_cancellation methods did not update _requested_gc_cause at all. So, in the failure I observed we had:

  1. Control thread finishes cycle and sees no cancellation is requested (no lock used).
  2. Mutator thread fails allocation, cancels GC (again, no lock used), and does not change _requested_gc_cause.
  3. Control thread takes _control_lock and checks _requested_gc_cause and sees _no_gc (because notify_cancellation didn't change it) and waits forever now.

The fix here is to replace notify_cancellation with notify_control_thread which serializes updates to _requested_gc_cause under _control_lock.

Copy link

@pengxiaolong pengxiaolong Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the places where ShenandoahHeap::clear_cancelled_gc is called, I feel the problem is more likely from op_final_update_refs:

void ShenandoahConcurrentGC::op_final_update_refs() {
  ShenandoahHeap* const heap = ShenandoahHeap::heap();
   ... 
  ...
  // Clear cancelled GC, if set. On cancellation path, the block before would handle
  // everything.
  if (heap->cancelled_gc()) {
    heap->clear_cancelled_gc();
  }
  ...
  ...
}

Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:

  1. The mutator tries to cancel the the concurrent GC and notify controller thread.
  2. The mutator block itself at _alloc_failure_waiters_lock, claiming safepoint safe as well.
  3. concurrent GC enter the final update refs (VM operation)
  4. in final update refs, VMThread sees cancelled_gc and clear it.
  5. concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator.

The fix seems to work in generational mode, but may not work in non-generational mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and ShenandoahControlThread, I created a bug to unify the behavior, we could fix the issue in ShenandoahControlThread there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I described wasn't supposition, that is actually what happened in the debugger. The scenario you describe with op_final_update_refs would also be fixed by this PR. The _requested_gc_cause field should always be accessed under a lock. The code change here fixes an issue where an allocation failure might not set _requested_gc_cause at all.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand the fix will solve the issue for genshen and also fix scenario I described.
I'll solve the potential issue in non-generational Shenandoah in the PR to fix the behavior differences in Genshen and non-generational Shenandoah.

void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause);

// Configure the heap to age objects and regions if the aging period has elapsed.
void maybe_set_aging_cycle();

Expand Down