Skip to content

Conversation

@marc-chevalier
Copy link
Member

@marc-chevalier marc-chevalier commented Dec 18, 2025

Let's proceed piece by piece.

G1BarrierSetAssembler::g1_write_barrier_pre in g1BarrierSetAssembler_aarch64.cpp

// TODO 8366717 This came with 8284161: Implementation of Virtual Threads (Preview) later in May 2022
// Check if it's sufficient
//__ push_call_clobbered_registers();
assert_different_registers(rscratch1, pre_val); // push_CPU_state trashes rscratch1
__ push_CPU_state(true);

push_call_clobbered_registers/pop_call_clobbered_registers should be enough around a runtime call, that's what clobbered registers are.

But let's dig a bit, that will be useful later!

   push_call_clobbered_registers()
=> push_call_clobbered_registers_except(exclude = empty set)
=> push(call_clobbered_gp_registers() - exclude, sp)  // with exclude = empty set

So, we save at least call_clobbered_gp_registers which is defined as:

RegSet MacroAssembler::call_clobbered_gp_registers() {
RegSet regs = RegSet::range(r0, r17) - RegSet::of(rscratch1, rscratch2);
#ifndef R18_RESERVED
regs += r18_tls;
#endif
return regs;
}

So, we save r0-r7 and r10-r17

CardTableBarrierSetAssembler::oop_store_at in cardTableBarrierSetAssembler_aarch64.cpp

if (tmp3 != noreg) {
// TODO 8366717 This change is from before the 'tmp3' arg was added to mainline, check if it's still needed. Same on x64. Also, this should be a __ lea
// Called by MacroAssembler::pack_inline_helper. We cannot corrupt the dst.base() register
__ mov(tmp3, dst.base());
store_check(masm, tmp3, dst);
} else {
// It's OK to corrupt the dst.base() register.
store_check(masm, dst.base(), dst);
}

Digging the history looks like it still makes sense, it doesn't look like an accident. If we remove the if-branch, tests are totally on fire. Let's keep it.

CardTableBarrierSetAssembler::oop_store_at in cardTableBarrierSetAssembler_x86.cpp

Same as before. But it's not even guarantee that InlineTypePassFieldsAsArgs is true. We can have such a backtrace:

CardTableBarrierSetAssembler::oop_store_at (cardTableBarrierSetAssembler_x86.cpp:184)
CardTableBarrierSetAssembler::store_at (cardTableBarrierSetAssembler_x86.cpp:90)
MacroAssembler::store_heap_oop (macroAssembler_x86.cpp:5515)
do_oop_store (templateTable_x86.cpp:148)
(called, for instance, from TemplateTable::putfield_or_static_helper (templateTable_x86.cpp:2964))

where the last give r8 for tmp3. It is not quite clear to me why we don't have a problem in mainline, because it looks like corrupting address base register is a bad idea given that we use it after.

MacroAssembler::unpack_inline_helper in macroAssembler_aarch64.cpp

First part

// Don't use r14 as tmp because it's used for spilling (see MacroAssembler::spill_reg_for)
// TODO 8366717 We need to make sure that r14 (and potentially other long-life regs) are kept live in slowpath runtime calls in GC barriers
Register tmp1 = r10;
Register tmp2 = r11;

Yes, these registers are saved, as we saw above! I've added some asserts to make sure that they are still in the call_clobbered_gp_registers() set. But it seems a bit redundant: that's what clobber are, registers that we cannot trust to be kept. If we use a register not in this set, the native callee shouldn't touch it. Better safe than sorry?

Second Part

// TODO 8366717 This is probably okay but looks fishy because stream is reset in the "Set null marker to zero" case just above. Same on x64.
sig_index = stream.sig_index();
to_index = stream.regs_index();

This is fine.

Indeed, we might reset the stream just before, but then we exhaust it in the loop.

stream.reset(sig_index, to_index);
while (stream.next(toReg, bt)) {
if (sig->at(stream.sig_index())._offset == -1 ||
bt == T_OBJECT || bt == T_ARRAY) {
if (toReg->is_stack()) {
int st_off = toReg->reg2stack() * VMRegImpl::stack_slot_size;
str(zr, Address(sp, st_off));
} else {
mov(toReg->as_Register(), zr);
}
}
}

This is already what we do before:

stream.reset(sig_index, to_index);
while (stream.next(toReg, bt)) {

(and this loop has no break).

So entering the if and running the second reset will just leave the stream in the same state.

About this piece, I also took the liberty of changing a tiny bit the ctor of ScalarizedInlineArgsStream as this class was not correct if step is not 1 or -1. We could also have fixed it by changing the line 79 in

void reset(int sig_idx, int regs_idx) {
_sig_idx = sig_idx;
_regs_idx = regs_idx;
assert(_sig->at(_sig_idx)._bt == (_step > 0) ? T_METADATA : T_VOID, "should be at inline type delimiter");
_depth = 1;
DEBUG_ONLY(_finished = false);
}

with _depth = abs(_step), but that looks like a recipe for a less explicit constructor, and more obscure code for a feature that is not desirable (skipping elements).

gen_c2i_adapter in sharedRuntime_aarch64.cpp

First part

// TODO 8366717 Is the comment about r13 correct? Isn't that r19_sender_sp?
// Name some registers to be used in the following code. We can use
// anything except r0-r7 which are arguments in the Java calling
// convention, rmethod (r12), and r13 which holds the outgoing sender
// SP for the interpreter.
// TODO 8366717 We need to make sure that buf_array, buf_oop (and potentially other long-life regs) are kept live in slowpath runtime calls in GC barriers
Register buf_array = r10; // Array of buffered inline types
Register buf_oop = r11; // Buffered inline type oop
Register tmp1 = r15;
Register tmp2 = r16;
Register tmp3 = r17;

After some digging, the comment about r13 doesn't seem correct to me. I've changed it.

The second part is again about trusting push_call_clobbered_registers to push enough things. I've added some asserts again.

Second part

// TODO 8366717 Do we need to save vectors here? They could be used as arg registers, right? Same on x64.
RegisterSaver reg_save(true /* save_vectors */);
OopMap* map = reg_save.save_live_registers(masm, 0, &frame_size_in_words);

Yes, but we save too much. save_live_registers calls push_CPU_state(_save_vectors, ...) defined there:

void MacroAssembler::push_CPU_state(bool save_vectors, bool use_sve,
int sve_vector_size_in_bytes, int total_predicate_in_bytes) {
push(RegSet::range(r0, r29), sp); // integer registers except lr & sp
if (save_vectors && use_sve && sve_vector_size_in_bytes > 16) {
sub(sp, sp, sve_vector_size_in_bytes * FloatRegister::number_of_registers);
for (int i = 0; i < FloatRegister::number_of_registers; i++) {
sve_str(as_FloatRegister(i), Address(sp, i));
}
} else {
int step = (save_vectors ? 8 : 4) * wordSize;
mov(rscratch1, -step);
sub(sp, sp, step);
for (int i = 28; i >= 4; i -= 4) {
st1(as_FloatRegister(i), as_FloatRegister(i+1), as_FloatRegister(i+2),
as_FloatRegister(i+3), save_vectors ? T2D : T1D, Address(post(sp, rscratch1)));
}
st1(v0, v1, v2, v3, save_vectors ? T2D : T1D, sp);
}
if (save_vectors && use_sve && total_predicate_in_bytes > 0) {
sub(sp, sp, total_predicate_in_bytes);
for (int i = 0; i < PRegister::number_of_registers; i++) {
sve_str(as_PRegister(i), Address(sp, i));
}
}
}

It will always save registers r0-r29, but also the floating point registers. The difference is that with save_vectors, we save the whole vectors, while without save_vectors, we save only 1 word (8 bytes) per float register (4 words for 4 registers). So, the question is whether we use more that the lower word of each register for arguments, and... I'm not sure. We can keep saving them as it consumes only a bit of stack space, but it runs the same amount of instructions as not saving them.

But I said we save too much, that's because we will save v0-v31 (or pieces of) anyway, while c_farg* and j_farg* are only the registers v0-v7. If I understand well, in this context (we are generating an adapter), around a call (for instance to allocate a buffer) we need to save only the registers that are relevant for the calling convention, like the ones that can hold arguments. If the register v25 (for instance) contained something useful for the caller, it should have been saved by the caller before doing the call, and so before entering the adapter. Since v25 is not used by the adapter or the calling convention, we don't need to save it. Is it worth refining save_live_registers?

gen_c2i_adapter in sharedRuntime_x86_64.cpp

The x86 version of the last thing:

// There is at least an inline type argument: we're coming from
// compiled code so we have no buffers to back the inline types.
// Allocate the buffers here with a runtime call.
OopMap* map = RegisterSaver::save_live_registers(masm, 0, &frame_size_in_words, /*save_vectors*/ false);

/*save_vectors*/ is actually save_wide_vectors. save_live_registers calls push_FPU_state anyway which does fxsave, which saves XMM0-XMM15. If save_wide_vectors, then save_live_registers makes sure to save all of ZMM (the rest of ZMM0-ZMM15, as well as the others ZMM). c_farg* and j_farg* are defined as xmm0-xmm7. XMM registers are already 128 bits (that's 16 bytes), so if we are not using the rest of zmm registers for arguments, it should be fine. Unlike on Aarch64, saving setting save_wide_vectors to true would create a lot more instructions: only one fxsave is enough for the xmm0-15, but the rest must be saved a lot more manually, once piece at a time.

While we probably are good with "only" 16 bytes saved, we need to save only what is used in the calling convention and in the adapter, so my guess would be that we actually only need to save xmm0-xmm7. I'm not sure it's worth improving save_live_registers since saving xmm0-xmm15 is done in a single instruction, we would trade a little of stack space for quite a lot more instructions. Maybe it's just fine to over-save.

G1BarrierSetAssembler::g1_write_barrier_pre in g1BarrierSetAssembler_x86

if (EnableValhalla && InlineTypePassFieldsAsArgs) {
// Barriers might be emitted when converting between (scalarized) calling conventions for inline
// types. Save all argument registers before calling into the runtime.
// TODO 8366717: use push_set() (see JDK-8283327 push/pop_call_clobbered_registers & aarch64 )
__ pusha();
__ subptr(rsp, 64);
__ movdbl(Address(rsp, 0), j_farg0);
__ movdbl(Address(rsp, 8), j_farg1);
__ movdbl(Address(rsp, 16), j_farg2);
__ movdbl(Address(rsp, 24), j_farg3);
__ movdbl(Address(rsp, 32), j_farg4);
__ movdbl(Address(rsp, 40), j_farg5);
__ movdbl(Address(rsp, 48), j_farg6);
__ movdbl(Address(rsp, 56), j_farg7);
} else {
// Determine and save the live input values
__ push_call_clobbered_registers();
}

This branch was introduced in 4c3231c "Merge tag 'jdk-20+8' into lworld" on 2022-08-31.

The non-Valhalla side:

  • was introduced in 9583e36 (8284161: Implementation of Virtual Threads (Preview)) on 2022-05-07
  • to replace what a push_set from eb4849e (8283327: Add methods to save/restore registers when calling into the VM from C1/interpreter barrier code) on 2022-03-21
  • that was itself replacing a few push from 2a0986b (8199417: Modularize interpreter GC barriers) on 2018-04-11 (which introduced the whole code)

The Valhalla side:

  • saving the j_farg* was added in 2455c8e (8251398: [lworld] TestCallingConvention::test36 spuriously fails due to incorrect field value) on 2020-08-11
  • the pusha comes from d9f3aaa (8242210: [lworld] TestCallingConvention::test36 spuriously fails) on 2020-04-09
  • which placed the code from 2a0986b (8199417: Modularize interpreter GC barriers) mentioned before.

So it seems the merge jdk was done in a very conservative way, keeping both sides that evolved separately. Yet, the Valhalla side is probably overkill.

   push_call_clobbered_registers(save_fpu = true)
=> push_call_clobbered_registers(exlude = empty set, save_fpu = true)
=> push_set(call_clobbered_xmm_registers(), gp_area_size) if save_fpu

call_clobbered_xmm_registers() is all the xmm on all os, except on windows where it's xmm0-xmm5 + xmm16-xmm_max. That covers the c_farg* (since on windows it only goes from 0 to 3), but it doesn't cover the j_farg* (that are from 0 to 7, even on windows). But on Windows, only xmm0-xmm5 are considered volatile (clobbers). Since we are calling a native function, it should be fine (that is call_clobbered_xmm_registers() and its comment are correct).

Thanks,
Marc


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8366717: [lworld] Cleanup defensive fixing of JDK-8365996 (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/1824/head:pull/1824
$ git checkout pull/1824

Update a local copy of the PR:
$ git checkout pull/1824
$ git pull https://git.openjdk.org/valhalla.git pull/1824/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1824

View PR using the GUI difftool:
$ git pr show -t 1824

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/1824.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 18, 2025

👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into lworld will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 18, 2025

@marc-chevalier This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8366717: [lworld] Cleanup defensive fixing of JDK-8365996

Reviewed-by: thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 94 new commits pushed to the lworld branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld branch, type /integrate in a new comment.

@marc-chevalier marc-chevalier marked this pull request as ready for review December 18, 2025 08:29
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 18, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 18, 2025

Webrevs

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed analysis Marc! Changes look good to me, I just added a few comments.

if (tmp3 != noreg) {
// TODO 8366717 This change is from before the 'tmp3' arg was added to mainline, check if it's still needed. Same on x64. Also, this should be a __ lea
// Called by MacroAssembler::pack_inline_helper. We cannot corrupt the dst.base() register
// When tmp3 is given, we cannot corrupt the dst.base() register (from MacroAssembler::pack_inline_helper or do_oop_store)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look like an accident. If we remove the if-branch, tests are totally on fire. Let's keep it.

It's only the __ mov(tmp3, dst.base()); that could potentially be removed, right? Not the entire branch. If the mov is still needed, should it be guarded by InlineTypePassFieldsAsArgs?

Copy link
Member Author

@marc-chevalier marc-chevalier Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just remove the mov part.

I don't really understand how to interpret that. We have

        __ mov(tmp3, dst.base());
        store_check(masm, tmp3, dst);

do you suggest we write

        store_check(masm, tmp3, dst);

? But then, tmp3 is not set to anything meaningful yet. Or

        store_check(masm, dst.base(), dst);

But that is exactly the else-branch. Moreover, I've tried to guard and it makes some test fails. For instance, we can come from

CardTableBarrierSetAssembler::oop_store_at (cardTableBarrierSetAssembler_x86.cpp:184)
CardTableBarrierSetAssembler::store_at (cardTableBarrierSetAssembler_x86.cpp:90)
MacroAssembler::store_heap_oop (macroAssembler_x86.cpp:5515)
do_oop_store (templateTable_x86.cpp:148)
(called, for instance, from TemplateTable::putfield_or_static_helper (templateTable_x86.cpp:2964))

that doesn't need InlineTypePassFieldsAsArgs but will give tmp3 and if we don't use it here (for instance, if I replace the condition by InlineTypePassFieldsAsArgs && tmp3 != noreg), we get various test failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it's not a problem on mainline, tho. And I agree it's weird. I've looked, but it's hard to investigate why something works rather than not (on a somewhat diverging codebase). I spent some time on that, but eventually decided it's not that critical, as long as we have a way to make it work for Valhalla too.

Copy link
Member

@TobiHartmann TobiHartmann Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting that we should do what mainline does if !InlineTypePassFieldsAsArgs, would that make sense?

I.e. do store_check(masm, dst.base(), dst);

Copy link
Member Author

@marc-chevalier marc-chevalier Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the else-branch, right? Right now we have

if (tmp3 != noreg) {
  __ mov(tmp3, dst.base());
  store_check(masm, tmp3, dst);
} else {
  // It's OK to corrupt the dst.base() register.
  store_check(masm, dst.base(), dst);
}

If I understand well, you're suggesting to write

if (InlineTypePassFieldsAsArgs) {
  if (tmp3 != noreg) {
    __ mov(tmp3, dst.base());
    store_check(masm, tmp3, dst);
  } else {
    // It's OK to corrupt the dst.base() register.
    store_check(masm, dst.base(), dst);
  }
} else {
  // as mainline
  store_check(masm, dst.base(), dst);
}

but to me, it looks equivalent to

if (InlineTypePassFieldsAsArgs && tmp3 != noreg) {
  __ mov(tmp3, dst.base());
  store_check(masm, tmp3, dst);
} else {
  // It's OK to corrupt the dst.base() register.
  store_check(masm, dst.base(), dst);
}

since both else-branch are identical.

And this, I experimentally found that it's failing quite a lot, with backtraces as I mentioned. Is there an obvious mistake in my logic I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right. I missed the outer if (tmp3 != noreg) { which is Valhalla specific. As we discussed off-thread, it would be good to get a better understanding of why this fails in Valhalla. Thanks a lot!

RegSet clobbered_gp_regs = MacroAssembler::call_clobbered_gp_registers();
assert(clobbered_gp_regs.contains(tmp1), "tmp1 must be saved explicitly if it's not a clobber");
assert(clobbered_gp_regs.contains(tmp2), "tmp2 must be saved explicitly if it's not a clobber");
assert(clobbered_gp_regs.contains(r14), "r14 must be saved explicitly if it's not a clobber");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea with these asserts!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready Pull request is ready to be integrated rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants