Skip to content

Conversation

@nickelodean
Copy link

  • Skip modifying/resetting parameters that are already pending-reboot
  • Prevents continuous loop when parameters are in transitional state
  • Applies to both DBClusterParameterGroup and DBParameterGroup
  • Fixes issue where ACK continuously tries to modify parameters after reboot

The loop occurred because:

  1. Parameter set to pending-reboot status
  2. After reboot, parameter at default but status still shows pending-reboot
  3. ACK sees mismatch and tries to modify/reset again
  4. Loop continues indefinitely

Now ACK checks ParameterOverrideStatuses before modifying/resetting parameters and skips any that are already pending-reboot.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- Skip modifying/resetting parameters that are already pending-reboot
- Prevents continuous loop when parameters are in transitional state
- Applies to both DBClusterParameterGroup and DBParameterGroup
- Fixes issue where ACK continuously tries to modify parameters after reboot

The loop occurred because:
1. Parameter set to pending-reboot status
2. After reboot, parameter at default but status still shows pending-reboot
3. ACK sees mismatch and tries to modify/reset again
4. Loop continues indefinitely

Now ACK checks ParameterOverrideStatuses before modifying/resetting
parameters and skips any that are already pending-reboot.
@ack-prow ack-prow bot requested review from jlbutler and knottnt December 16, 2025 16:33
@ack-prow
Copy link

ack-prow bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nickelodean
Once this PR has been reviewed and has the lgtm label, please assign a-hilaly for approval by writing /assign @a-hilaly in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 16, 2025
@ack-prow
Copy link

ack-prow bot commented Dec 16, 2025

Hi @nickelodean. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knottnt
Copy link
Contributor

knottnt commented Dec 17, 2025

/ok-to-test

@ack-prow ack-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 17, 2025
@ack-prow
Copy link

ack-prow bot commented Dec 17, 2025

@nickelodean: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
rds-verify-code-gen a205a83 link false /test rds-verify-code-gen
rds-kind-e2e a205a83 link true /test rds-kind-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@knottnt knottnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickelodean Thanks for the contribution! Would you be able to share a manifest that reproduces the issue or logs (with sensitive info redacted) showing the delta detected by the ACK controller?

Comment on lines +245 to +249
// Filter out parameters that are already pending reboot from both toModify and toDelete.
// When a parameter is reset or modified, it remains as a user override with pending-reboot
// status until the DB cluster is rebooted. Attempting to modify or reset it again
// would cause a reconciliation loop. We skip modifying/resetting parameters that are
// already pending reboot until after the reboot completes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: When a parameter is pending-reboot can it not still have its value modified/reset? For example if I modify a static parameter to have value "x" and can I still change the value to "y" before rebooting? If we can, I think filtering the parameter from the toModify/toDelete will prevent that update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below is manifest used to turn the binary logging/reboot and can see binary logging log_bin is ON.All good.
However on portal see binlog_format=none, and log_bin=off and pending reboot, so ack is continuously trying to reconcile/apply and goes into pending reboot and aws events show the parameter group updates, every 2 hours.
In addition, I have tested another use case where I tried to change the dynamic parameter, with earlier applied static parameters in manifest, the ack is re applying it and goes into pending reboot.Goal here is to have both static/dynamic parameters applied once via manifest (not via portal ) and after reboot, ack know its state.

for your question , yes it depends on the parameter but yes we can change from x to y before reboot.but after reboot we dont want ack to go into a loop.

dbClusterParameterGroup:
  name: "testdbclusterparametergroup"
  description: "test cluster parameters"
  family: "aurora-mysql8.0"
  parameterOverrides:
    binlog_format: "ROW"
    read_only: "{TrueIfClusterReplica}"
dbParameterGroup:
  name: "testdbclusterparametergroup-instance"
  description: "test Instance Parameters"
  family: "aurora-mysql8.0"
  parameterOverrides:
    max_allowed_packet: "1073741824"
    max_execution_time: "600000"
    slow_query_log: "1"
    long_query_time: "2"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree in above comment Filtering all pending-reboot parameters from toModify might blocks legitimate updates. so may be this change is not needed or need tested.

latestOverrides = latest.ko.Spec.ParameterOverrides
}

toModify, _, toDelete := util.GetParametersDifference(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this function and noticed that it doesn't appear to properly compare the desired/latest parameters. From what I can tell lo.Difference() and lo.Intersect() are comparing by value. However, since we're passing map[string]*string the pointer address is what's actually being compared resulting in toModify and toDelete to both contain all values. Is it possible this is what's causing the issue you're seeing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that seems to be the issue from what I see..

Phase 1: Parameter was successfully applied (earlier)
ACK set binlog_format: ROW → pending-reboot
DB was rebooted
Parameter applied: DB now has binlog_format: ROW
Everything was correct

Phase 2: Bug resets it.

Next reconciliation cycle.
ACK reads desired: binlog_format: "ROW" (pointer 0x1000)
ACK reads latest: binlog_format: "ROW" (pointer 0x2000)
Bug: pointer comparison 0x1000 == 0x2000 → FALSE
Even though both are "ROW", ACK treats them as different
ACK calls ResetDBClusterParameterGroup → resets to default (OFF)
Event: "Updated parameter binlog_format to OFF with apply method pending-reboot"
Parameter group now has: OFF (pending-reboot)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for verifying that this is the root cause. I think fixing this comparison logic should allow us to resolve the reconciliation loop issue without blocking legitimate changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants