Skip to content

Conversation

@AndreasHolt
Copy link
Contributor

@AndreasHolt AndreasHolt commented Dec 7, 2025

What changed?

  • Removed _heartbeatRefreshRate and the block that early returns in service/sharddistributor/handler/executor.go, so executor heartbeats are always persisted.
  • Also updated TestHeartbeat in service/sharddistributor/handler/executor_test.go. Replaced the “within/after refresh rate” cases that were present before the removal of _heartbeatRereshRate with a single test that expects RecordHeartbeat to be called on a second heartbeat with the same status.

Why?

  • Previously, the handler skipped writing some heartbeats when:
    • the migration mode was 'onboarding', a previous heartbeat existed, the status was unchanged,
    • and most importantly the new heartbeat arrived within **_heartbeatRefreshRate** (2s) of the last one.
  • This meant the stored LastHeartbeat in etcd could lag behind the real heartbeat rate. With a heartbeat ttl being the same value (2s), healthy executors could be misclassified as stale and removed, which matched what we saw in the canary (executors disappearing from GetState, and shards "collapsing" onto only a few executors).
  • Removing the write cooldown gives a simpler and safer behavior:
    • executors heartbeat roughly once a second (in development),
    • every heartbeat is persisted,
    • the staleness check is based on the real heartbeat frequency, and can't be interrupted by another setting somewhere else in the code that may gate it

Two other alternatives were considered, instead of removing the check and cooldown:

  • Increasing the heartbeat TTL (e.g., to 5–10s) while keeping the cooldown.
  • or decreasing _heartbeatRefreshRate (e.g., to 1s).

Both of these alternatives would reduce the chance of misclassifying healthy executors as stale, but they keep a hidden coupling between heartbeat.TTL and the write cooldown. Removing the cooldown entirely makes the behavior easier to reason about and avoids this subtle issue than can happen in configration.

How did you test it?

  • Updated and ran unit tests in service/sharddistributor/handler/executor_test.go (TestHeartbeat).
  • ran the sharddistributor canary with multiple executors and observed stable executor counts and shard distribution.

Potential risks

  • Higher etcd write load for heartbeats: we now persist every heartbeat instead of at most once per _heartbeatRefreshRate.
  • Adopters without external rate limiting (again, not sure if this is relevant) may see more frequent heartbeat writes than before, so they should be aware that this change removes a local write throttle and rely on their own rate limiting

Release notes

Documentation Changes

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>
Copy link
Member

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, tested locally and it indeed looks like it fixes the problem.

Thanks again for the great investigation, it's not easy finding "ghost behaviour" like this in distributed systems :)

@jakobht jakobht merged commit efe044d into cadence-workflow:master Dec 10, 2025
80 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants