Skip to content

Conversation

@rasifr
Copy link
Member

@rasifr rasifr commented Dec 10, 2025

This commit fixes several critical issues to make Spock more stable and prevent apply worker death loops (kill-restart-kill cycles). It also eliminates unhelpful errors such as "exception handling had no exception(s)".

  1. Fix TRANSDISCARD/SUB_DISABLE handling during commit phase in transaction retry mode

    Problem: When a transaction initially errors but succeeds on retry in these modes, it violates TRANSDISCARD semantics (all-or-nothing at transaction level). The current behavior would commit the transaction, then terminate the apply worker with an unhelpful error message, leading to a death loop.

    Solution: Detect this condition (use_try_block=true with no exceptions during replay) before commit. Abort the current transaction (all operations already rolled back in subtransactions), start a new transaction, log the discard to exception_log with the original error message and operation type, then commit the log entry.

    • SUB_DISABLE mode: Throw error to trigger subscription disable in parent PG_CATCH
    • TRANSDISCARD mode: Use goto transdiscard_skip_commit to update progress and continue, ensuring transaction is fully discarded with proper audit trail
  2. Prevent NULL error messages in exception_log

    Added fallback mechanism to initial_error_message for INSERT/UPDATE/DELETE/SQL operations. Ensures context is logged even when operation succeeds on retry in DISCARD mode.

  3. Eliminate "(unknown action)" in error contexts

    Set errcallback_arg.action_name in all protocol message handlers that were missing it: ORIGIN, COMMIT_ORDER, RELATION, DELETE, STARTUP, MESSAGE

  4. Track parent operation for transaction discards

    Added initial_operation field to SpockExceptionLog structure to capture the operation type (INSERT/UPDATE/DELETE/SQL) that caused the initial exception. Shows which specific DML caused the transaction discard in TRANSDISCARD mode.

Regression Test Improvements:
Make regression tests deterministic and cover new behavior

  • Normalize OIDs in spock.exception_log.error_message output using regexp_replace() so expected output is stable across test runs
  • Add TAP test 013_exception_handling to exercise TRANSDISCARD and SUB_DISABLE modes end-to-end, verify exception_log entries have non-NULL error messages, and assert a single SUB_DISABLE entry when subscription is disabled on conflict

This commit fixes several critical issues to make Spock more stable and prevent
apply worker death loops (kill-restart-kill cycles). It also eliminates unhelpful
errors such as "exception handling had no exception(s)".

1. Fix TRANSDISCARD/SUB_DISABLE handling during commit phase in transaction retry mode

   Problem: When a transaction initially errors but succeeds on retry in these modes,
   it violates TRANSDISCARD semantics (all-or-nothing at transaction level). The
   current behavior would commit the transaction, then terminate the apply worker
   with an unhelpful error message, leading to a death loop.

   Solution: Detect this condition (use_try_block=true with no exceptions during
   replay) before commit. Abort the current transaction (all operations already
   rolled back in subtransactions), start a new transaction, log the discard to
   exception_log with the original error message and operation type, then commit
   the log entry.

   - SUB_DISABLE mode: Throw error to trigger subscription disable in parent PG_CATCH
   - TRANSDISCARD mode: Use goto transdiscard_skip_commit to update progress and
      continue, ensuring transaction is fully discarded with proper audit trail

2. Prevent NULL error messages in exception_log

   Added fallback mechanism to initial_error_message for INSERT/UPDATE/DELETE/SQL
   operations. Ensures context is logged even when operation succeeds on retry in
   DISCARD mode.

3. Eliminate "(unknown action)" in error contexts

   Set errcallback_arg.action_name in all protocol message handlers that were missing it:
   ORIGIN, COMMIT_ORDER, RELATION, DELETE, STARTUP, MESSAGE

4. Track parent operation for transaction discards

   Added initial_operation field to SpockExceptionLog structure to capture the
   operation type (INSERT/UPDATE/DELETE/SQL) that caused the initial exception.
   Shows which specific DML caused the transaction discard in TRANSDISCARD mode.

Regression Test Improvements:
   Make regression tests deterministic and cover new behavior
   - Normalize OIDs in spock.exception_log.error_message output using regexp_replace()
      so expected output is stable across test runs
   - Add TAP test 013_exception_handling to exercise TRANSDISCARD and SUB_DISABLE modes
      end-to-end, verify exception_log entries have non-NULL error messages, and assert
      a single SUB_DISABLE entry when subscription is disabled on conflict

Signed-off-by: Asif Rehman <asifr@pgedge.com>
@rasifr rasifr force-pushed the task/SPOC-363/exception_handling branch from b3eee26 to 1268356 Compare December 12, 2025 11:10
When a transaction was successfully skipped using skip_lsn in SUB_DISABLE
mode, the subscription would incorrectly get disabled again instead of
continuing to replicate. This happened because the exception handling state
was not cleared after a successful skip.

Additionally, there was an LSN mismatch when comparing skip_lsn:
- skip_lsn is set using replorigin_session_origin_lsn (BEGIN commit_lsn)
- But clear_subscription_skip_lsn() was called with end_lsn (COMMIT end_lsn)
- These LSNs are different, causing a mismatch warning.
@mason-sharp mason-sharp merged commit 3cf9758 into main Dec 17, 2025
5 checks passed
@mason-sharp mason-sharp deleted the task/SPOC-363/exception_handling branch December 17, 2025 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants