Skip to content

Conversation

@pengxiaolong
Copy link

@pengxiaolong pengxiaolong commented Dec 4, 2025

Chasing the root cause of JDK-8372498, I have narrowed down root cause to the commit f8cf9ca

It is caused by the behavior change from follow code:

Original:

  if (ShenandoahSATBBarrier) {
    T* array = dst;
    HeapWord* array_addr = reinterpret_cast<HeapWord*>(array);
    ShenandoahHeapRegion* r = _heap->heap_region_containing(array_addr);
    if (is_old_marking) {
      // Generational, old marking
      assert(_heap->mode()->is_generational(), "Invariant");
      if (r->is_old() && (array_addr < _heap->marking_context()->top_at_mark_start(r))) {
        arraycopy_work<T, false, false, true>(array, count);
      }
    } else if (_heap->mode()->is_generational()) {
      // Generational, young marking
      if (r->is_old() || (array_addr < _heap->marking_context()->top_at_mark_start(r))) {
        arraycopy_work<T, false, false, true>(array, count);
      }
    } else if (array_addr < _heap->marking_context()->top_at_mark_start(r)) {
      // Non-generational, marking
      arraycopy_work<T, false, false, true>(array, count);
    }
  }

New:

  if (ShenandoahSATBBarrier) {
    if (!_heap->marking_context()->allocated_after_mark_start(reinterpret_cast<HeapWord*>(dst))) {
      arraycopy_work<T, false, false, true>(dst, count);
    }
  }

With the new STAB barrier code for arraycopy_marking, if is it young GC and the array is in old region, but array is above TAMS(Old GC may not be started, TAMS of old region is not captured), arraycopy_work won't be applied anymore, so we may have missed some pointers in SATB in such case during concurrent young marking.

Test

  • hotspot_gc_shenandoah
  • repeat gc/TestAllocHumongousFragment.java#generational and sure it won't crash with the fix
  • GHA

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8373116: Genshen: arraycopy_work should be always done for arrays in old gen during young concurrent marking (Bug - P3)
  • JDK-8372498: [genshen] gc/TestAllocHumongousFragment.java#generational causes intermittent SIGSEGV crashes (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28669/head:pull/28669
$ git checkout pull/28669

Update a local copy of the PR:
$ git checkout pull/28669
$ git pull https://git.openjdk.org/jdk.git pull/28669/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28669

View PR using the GUI difftool:
$ git pr show -t 28669

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28669.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 4, 2025

👋 Welcome back xpeng! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@pengxiaolong This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8373116: Genshen: arraycopy_work should be always done for arrays in old gen during young concurrent marking
8372498: [genshen] gc/TestAllocHumongousFragment.java#generational causes intermittent SIGSEGV crashes

Reviewed-by: wkemper

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 25 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Dec 4, 2025
@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@pengxiaolong The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@pengxiaolong
Copy link
Author

/issue add JDK-8372498

@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@pengxiaolong
Adding additional issue to issue list: 8372498: [genshen] gc/TestAllocHumongousFragment.java#generational causes intermittent SIGSEGV crashes.

@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@pengxiaolong
Updating description of additional solved issue: 8372498: [genshen] gc/TestAllocHumongousFragment.java#generational causes intermittent SIGSEGV crashes.

@pengxiaolong pengxiaolong marked this pull request as ready for review December 5, 2025 00:21
@pengxiaolong
Copy link
Author

/issue add JDK-8372498

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 5, 2025
@openjdk
Copy link

openjdk bot commented Dec 5, 2025

@pengxiaolong
Updating description of additional solved issue: 8372498: [genshen] gc/TestAllocHumongousFragment.java#generational causes intermittent SIGSEGV crashes.

@mlbridge
Copy link

mlbridge bot commented Dec 5, 2025

@pengxiaolong pengxiaolong marked this pull request as draft December 5, 2025 00:28
@openjdk openjdk bot removed the rfr Pull request is ready for review label Dec 5, 2025
@pengxiaolong pengxiaolong marked this pull request as ready for review December 5, 2025 01:57
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 5, 2025
@sendaoYan
Copy link
Member

After apply the proposed patch, the jvm crash do not observed by run the test 1000 times. But there is one "java.lang.OutOfMemoryError: Java heap space" test fails observed.

848.log

@pengxiaolong
Copy link
Author

pengxiaolong commented Dec 5, 2025

After apply the proposed patch, the jvm crash do not observed by run the test 1000 times. But there is one "java.lang.OutOfMemoryError: Java heap space" test fails observed.

848.log

Thank you Sendao for the test and verification!

The OOM should be an unrelated issue, I'm not sure if there is open bug for it or not, please create a new one if there isn't.

@pengxiaolong
Copy link
Author

pengxiaolong commented Dec 5, 2025

After apply the proposed patch, the jvm crash do not observed by run the test 1000 times. But there is one "java.lang.OutOfMemoryError: Java heap space" test fails observed.
848.log

Thank you Sendao for the test and verification!

The OOM should be an unrelated issue, I'm not sure if there is open bug for it or not, please create a new one if there isn't.

Found it, https://bugs.openjdk.org/browse/JDK-8298781

Copy link
Contributor

@earthling-amzn earthling-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue, as I understand it, is that mutators are racing with the concurrent remembered set scan. If a mutator changes a pointer covered by a dirty card, it could prevent the remembered set scan from tracing the original object that was reachable at the beginning of marking. Since we may not be marking old, we cannot rely on the TAMS for objects in old regions and must unconditionally enqueue all of the overwritten pointers in the old array. Should we only do this when young marking is in progress? Perhaps we should have a version of arraycopy_work that only enqueues young pointers here?

@pengxiaolong
Copy link
Author

pengxiaolong commented Dec 5, 2025

The issue, as I understand it, is that mutators are racing with the concurrent remembered set scan. If a mutator changes a pointer covered by a dirty card, it could prevent the remembered set scan from tracing the original object that was reachable at the beginning of marking. Since we may not be marking old, we cannot rely on the TAMS for objects in old regions and must unconditionally enqueue all of the overwritten pointers in the old array. Should we only do this when young marking is in progress? Perhaps we should have a version of arraycopy_work that only enqueues young pointers here?

I don't think it is related the any racing on remembered set, I got some GC logs from which I think we may know how it actually happens.

[15.653s][info][gc,start       ] GC(188) Pause Full
...
[15.763s][info][gc             ] GC(188) Pause Full 913M->707M(1024M) 109.213ms
[15.767s][info][gc,ergo        ] GC(189) Start GC cycle (Young)
...
[15.802s][info][gc             ] GC(189) Concurrent reset after collect (Young) 1.160ms
[15.802s][info][gc,ergo        ] GC(189) At end of Interrupted Concurrent Young GC: Young generation used: 874M, used regions: 874M, humongous waste: 7066K, soft capacity: 1024M, max capacity: 1022M, available: 99071K
[15.802s][info][gc,ergo        ] GC(189) At end of Interrupted Concurrent Young GC: Old generation used: 1273K, used regions: 1536K, humongous waste: 0B, soft capacity: 1024M, max capacity: 1536K, available: 262K
[15.803s][info][gc,metaspace   ] GC(189) Metaspace: 759K(960K)->759K(960K) NonClass: 721K(832K)->721K(832K) Class: 38K(128K)->38K(128K)
[15.803s][info][gc             ] Trigger (Young): Handle Allocation Failure
[15.803s][info][gc,start       ] GC(190) Pause Full
[15.803s][info][gc,task        ] GC(190) Using 8 of 8 workers for full gc
[15.803s][info][gc,phases,start] GC(190) Phase 1: Mark live objects
[15.806s][info][gc,ref         ] GC(190) Clearing All SoftReferences
<crash happend in full GC) 
  1. 188 was a full GC, after 188, all TAMS is reset to bottom in the ShenandoahPostCompactClosure, including all old regions.
  2. 189 was a concurrent young GC, there was array copy barrier executed for an array stored in old, but given TAMS is at the bottom as a result of 188 full GC, it was a no-op due to the bug.
  3. 189 finished making, claiming the garbage, but because of the no-op of array copy barrier, the copy was not marked live and was reclaimed.
  4. Allocation failure cancelled 189 and escalated to full GC again, but now we have corrupted heap and VM is going to crash.

Because the heap was already corrupted, it doesn't matter the last GC is full GC or not, it will alway crash when there is read/write on the object array which is gone.

@earthling-amzn
Copy link
Contributor

At step 2, we have an element in the old array pointing to young, correct? Why is it not represented in the remembered set at the beginning of young mark? If it is because the old -> young pointer was created after init mark, then the young pointer was either reachable when mark started, or it was created after mark started. Either way, the young pointer should have been found without this SATB modification. Unless, it was in the remembered set, but it didn't get scanned because a mutator modified it before it was scanned.

@pengxiaolong
Copy link
Author

At step 2, we have an element in the old array pointing to young, correct? Why is it not represented in the remembered set at the beginning of young mark? If it is because the old -> young pointer was created after init mark, then the young pointer was either reachable when mark started, or it was created after mark started. Either way, the young pointer should have been found without this SATB modification. Unless, it was in the remembered set, but it didn't get scanned because a mutator modified it before it was scanned.

array copy involves two array object, src and dst. The dst array is an old array, the src may not be an old, it could be young.

Copy link
Contributor

@earthling-amzn earthling-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline. The assertion must be weakened to account for dirty write cards because the young pointer could be put in the old array after init mark. We cannot expect the read card to be dirty in this case.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 6, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Dec 7, 2025
@pengxiaolong pengxiaolong changed the title 8373116: Genshen: arraycopy_work should be done unconditionally by arraycopy_marking if the array is in an old region 8373116: Genshen: arraycopy_work should be always done for arrays in old gen during young concurrent marking Dec 8, 2025
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-gc hotspot-gc-dev@openjdk.org ready Pull request is ready to be integrated rfr Pull request is ready for review shenandoah shenandoah-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

3 participants