fix(shard distributor): remove heartbeat write cooldown #7513

AndreasHolt · 2025-12-07T20:34:44Z

What changed?

Removed _heartbeatRefreshRate and the block that early returns in service/sharddistributor/handler/executor.go, so executor heartbeats are always persisted.
Also updated TestHeartbeat in service/sharddistributor/handler/executor_test.go. Replaced the “within/after refresh rate” cases that were present before the removal of _heartbeatRereshRate with a single test that expects RecordHeartbeat to be called on a second heartbeat with the same status.

Why?

Previously, the handler skipped writing some heartbeats when:
- the migration mode was 'onboarding', a previous heartbeat existed, the status was unchanged,
- and most importantly the new heartbeat arrived within **_heartbeatRefreshRate** (2s) of the last one.
This meant the stored LastHeartbeat in etcd could lag behind the real heartbeat rate. With a heartbeat ttl being the same value (2s), healthy executors could be misclassified as stale and removed, which matched what we saw in the canary (executors disappearing from GetState, and shards "collapsing" onto only a few executors).
Removing the write cooldown gives a simpler and safer behavior:
- executors heartbeat roughly once a second (in development),
- every heartbeat is persisted,
- the staleness check is based on the real heartbeat frequency, and can't be interrupted by another setting somewhere else in the code that may gate it

Two other alternatives were considered, instead of removing the check and cooldown:

Increasing the heartbeat TTL (e.g., to 5–10s) while keeping the cooldown.
or decreasing _heartbeatRefreshRate (e.g., to 1s).

Both of these alternatives would reduce the chance of misclassifying healthy executors as stale, but they keep a hidden coupling between heartbeat.TTL and the write cooldown. Removing the cooldown entirely makes the behavior easier to reason about and avoids this subtle issue than can happen in configration.

How did you test it?

Updated and ran unit tests in service/sharddistributor/handler/executor_test.go (TestHeartbeat).
ran the sharddistributor canary with multiple executors and observed stable executor counts and shard distribution.

Potential risks

Higher etcd write load for heartbeats: we now persist every heartbeat instead of at most once per _heartbeatRefreshRate.
Adopters without external rate limiting (again, not sure if this is relevant) may see more frequent heartbeat writes than before, so they should be aware that this change removes a local write throttle and rely on their own rate limiting

Release notes

Documentation Changes

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>

jakobht

Looks great, tested locally and it indeed looks like it fixes the problem.

Thanks again for the great investigation, it's not easy finding "ghost behaviour" like this in distributed systems :)

fix(shard distributor): remove heartbeat write cooldown

d761bdc

Signed-off-by: Andreas Holt <6665487+AndreasHolt@users.noreply.github.com>

AndreasHolt requested review from 3vilhamster, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners December 7, 2025 20:34

AndreasHolt force-pushed the executor-bug branch from 67e2990 to d761bdc Compare December 7, 2025 21:32

eleonoradgr approved these changes Dec 8, 2025

View reviewed changes

jakobht approved these changes Dec 8, 2025

View reviewed changes

jakobht merged commit efe044d into cadence-workflow:master Dec 10, 2025
80 of 81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(shard distributor): remove heartbeat write cooldown #7513

fix(shard distributor): remove heartbeat write cooldown #7513

AndreasHolt commented Dec 7, 2025 •

edited

Loading

Uh oh!

jakobht left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(shard distributor): remove heartbeat write cooldown #7513

fix(shard distributor): remove heartbeat write cooldown #7513

Conversation

AndreasHolt commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakobht left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreasHolt commented Dec 7, 2025 •

edited

Loading