Skip to content

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Dec 1, 2025

This commit adds support for strict 1:1 schema synchronization by allowing columns to be dropped from the table when they are not present in the input schema. This is controlled via a new dropUnusedColumns parameter in DynamicIcebergSink.

The default behavior (dropUnusedColumns=false) remains unchanged.

@pvary
Copy link
Contributor

pvary commented Dec 2, 2025

Before I go into the details of the PR, could you please help me understand what to expect in the interim period, when the sink receives records with and without the dropped column?

@mxm
Copy link
Contributor Author

mxm commented Dec 2, 2025

Before I go into the details of the PR, could you please help me understand what to expect in the interim period, when the sink receives records with and without the dropped column?

New records will write with the new schema with any columns not part of the input schema dropped. Old records will continue to write with the old schema, which still exists. If there are any previously unseen schemas which include removed columns, those columns will be re-added as new columns. This is a catch which users will have to accept. That's why the feature is opt-in and disabled by default.

@pvary
Copy link
Contributor

pvary commented Dec 2, 2025

Essentially, there’s a race condition between adding and dropping columns. For example, if a user does the following:

  1. Creates a new schema S1
  2. Sends a record R1 using S1
  3. Creates a new schema S2
  4. Sends a record R2 using S2

If these actions occur within a short time frame and the streams are skewed, the table could end up with either:

  • Schema S2, if R2 arrives later
  • Schema S1, if R1 arrives later

Afterward, querying the table with the “old” schema becomes difficult.

Additionally, users cannot revert the table to any previously created schema using DynamicSink. This behavior is consistent with the current implementation, but with column-dropping support, users might expect this capability.

@Guosmilesmile: Would these restrictions impact your use cases?

@mxm
Copy link
Contributor Author

mxm commented Dec 2, 2025

IMHO this is fine if the user opts in. We deliberately chose not to allow dropping columns because of this race condition; it's important that this remains the default setting. I agree that we should add more documentation around the semantics of removing columns.

@Guosmilesmile
Copy link
Contributor

  1. For my scenario, because there are many downstream consumers, the impact of removing fields is often unknown and hard to assess, so removals are rare — fields are basically deprecated instead.
  2. Since this feature is optional, it's acceptable to me; whether to use it should be left to users, but the limitations must be clearly documented. This is my personal opinion.

@pvary
Copy link
Contributor

pvary commented Dec 2, 2025

Cool! Seems like an ok feature.
Let's move forward with the review then.

@mxm
Copy link
Contributor Author

mxm commented Dec 5, 2025

Rebased to fix merge conflicts in CI after #14406.

mxm added 5 commits December 5, 2025 12:51
This commit adds support for strict 1:1 schema synchronization by allowing
columns to be dropped from the table when they are not present in the input
schema. This is controlled via a new dropUnusedColumns parameter in
DynamicIcebergSink.

The default behavior (dropUnusedColumns=false) remains unchanged.
Comment on lines 45 to 49
* We don't support:
*
* <ul>
* <li>Dropping columns
* <li>Renaming columns
* </ul>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove the list, as it is not needed anymore 😄

Copy link
Contributor Author

@mxm mxm Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, I'll keep this list with one element 😄

@pvary
Copy link
Contributor

pvary commented Dec 5, 2025

Please update the documentation, and highlight the caveats, like what happens when the column is added back with the same name, and what happens when multiple schema changes happen simultaneously

@github-actions github-actions bot added the docs label Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants