-
Notifications
You must be signed in to change notification settings - Fork 62
fix: prevent reconciliation loop for RDS parameters pending reboot #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: prevent reconciliation loop for RDS parameters pending reboot #259
Conversation
- Skip modifying/resetting parameters that are already pending-reboot - Prevents continuous loop when parameters are in transitional state - Applies to both DBClusterParameterGroup and DBParameterGroup - Fixes issue where ACK continuously tries to modify parameters after reboot The loop occurred because: 1. Parameter set to pending-reboot status 2. After reboot, parameter at default but status still shows pending-reboot 3. ACK sees mismatch and tries to modify/reset again 4. Loop continues indefinitely Now ACK checks ParameterOverrideStatuses before modifying/resetting parameters and skips any that are already pending-reboot.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nickelodean The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @nickelodean. Thanks for your PR. I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
@nickelodean: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
knottnt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickelodean Thanks for the contribution! Would you be able to share a manifest that reproduces the issue or logs (with sensitive info redacted) showing the delta detected by the ACK controller?
| // Filter out parameters that are already pending reboot from both toModify and toDelete. | ||
| // When a parameter is reset or modified, it remains as a user override with pending-reboot | ||
| // status until the DB cluster is rebooted. Attempting to modify or reset it again | ||
| // would cause a reconciliation loop. We skip modifying/resetting parameters that are | ||
| // already pending reboot until after the reboot completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: When a parameter is pending-reboot can it not still have its value modified/reset? For example if I modify a static parameter to have value "x" and can I still change the value to "y" before rebooting? If we can, I think filtering the parameter from the toModify/toDelete will prevent that update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below is manifest used to turn the binary logging/reboot and can see binary logging log_bin is ON.All good.
However on portal see binlog_format=none, and log_bin=off and pending reboot, so ack is continuously trying to reconcile/apply and goes into pending reboot and aws events show the parameter group updates, every 2 hours.
In addition, I have tested another use case where I tried to change the dynamic parameter, with earlier applied static parameters in manifest, the ack is re applying it and goes into pending reboot.Goal here is to have both static/dynamic parameters applied once via manifest (not via portal ) and after reboot, ack know its state.
for your question , yes it depends on the parameter but yes we can change from x to y before reboot.but after reboot we dont want ack to go into a loop.
dbClusterParameterGroup:
name: "testdbclusterparametergroup"
description: "test cluster parameters"
family: "aurora-mysql8.0"
parameterOverrides:
binlog_format: "ROW"
read_only: "{TrueIfClusterReplica}"
dbParameterGroup:
name: "testdbclusterparametergroup-instance"
description: "test Instance Parameters"
family: "aurora-mysql8.0"
parameterOverrides:
max_allowed_packet: "1073741824"
max_execution_time: "600000"
slow_query_log: "1"
long_query_time: "2"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree in above comment Filtering all pending-reboot parameters from toModify might blocks legitimate updates. so may be this change is not needed or need tested.
| latestOverrides = latest.ko.Spec.ParameterOverrides | ||
| } | ||
|
|
||
| toModify, _, toDelete := util.GetParametersDifference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at this function and noticed that it doesn't appear to properly compare the desired/latest parameters. From what I can tell lo.Difference() and lo.Intersect() are comparing by value. However, since we're passing map[string]*string the pointer address is what's actually being compared resulting in toModify and toDelete to both contain all values. Is it possible this is what's causing the issue you're seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that seems to be the issue from what I see..
Phase 1: Parameter was successfully applied (earlier)
ACK set binlog_format: ROW → pending-reboot
DB was rebooted
Parameter applied: DB now has binlog_format: ROW
Everything was correct
Phase 2: Bug resets it.
Next reconciliation cycle.
ACK reads desired: binlog_format: "ROW" (pointer 0x1000)
ACK reads latest: binlog_format: "ROW" (pointer 0x2000)
Bug: pointer comparison 0x1000 == 0x2000 → FALSE
Even though both are "ROW", ACK treats them as different
ACK calls ResetDBClusterParameterGroup → resets to default (OFF)
Event: "Updated parameter binlog_format to OFF with apply method pending-reboot"
Parameter group now has: OFF (pending-reboot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for verifying that this is the root cause. I think fixing this comparison logic should allow us to resolve the reconciliation loop issue without blocking legitimate changes.
The loop occurred because:
Now ACK checks ParameterOverrideStatuses before modifying/resetting parameters and skips any that are already pending-reboot.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.