Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Dec 16, 2025

Description of changes

Fix intermittent Image Builder failures on Ubuntu 22.04 and 24.04 where the build fails after the reboot step with SSM agent connectivity issues.

Failure

Build image intermittently fails on Ubuntu 22.04 and 24.04 because of build instance reboot failure.

Root Cause

Failures are related to dual version of ssm-agent being mounted during the build. Those two versions are installed on the system because snap auto-refresh runs during the AMI build process, updating the SSM agent (installed via snap) in the background. When a reboot occurs while the snap refresh is in progress or has left the system in a transitional state (multiple snap revisions mounted), SSM agent fails to connect to SSM, so SSM marks the reboot as failed.

Interesting details

The commands below have been run on an instance whihc runs the AMI used as parent image for the failed build.
They show that:

  • ssm-agent is installed via snap, so it can be updated in the background when the snap auto-refresh occurs
root@ip-172-31-42-119:~# snap list
Name              Version        Rev    Tracking         Publisher   Notes
amazon-ssm-agent  3.3.2299.0     11797  latest/stable/…  aws✓        classic
core20            20250822       2682   latest/stable    canonical✓  base
core22            20251009       2163   latest/stable    canonical✓  base
lxd               5.0.5-68251b5  36918  5.0/stable/…     canonical✓  -
snapd             2.72           25577  latest/stable    canonical✓  snapd
  • the snap autoi-refresh is not managed by a systemd timer, but internally by the snap auto-refresh mechanism
root@ip-172-31-42-119:~# systemctl list-timers | grep snap
root@ip-172-31-42-119:~# snap get system refresh.timer
error: snap "core" has no "refresh.timer" configuration option
  • the snap refresh is scheduled to occur 4 times a day and the first refresh could happen even immediately when the pcluster build starts
root@ip-172-31-42-119:~# snap refresh --time
timer: 00:00~24:00/4
last: n/a
hold: today at 06:07 UTC
next: n/a

root@ip-172-31-42-119:~# snap get system -d
{
        "cloud": {
                "availability-zone": "us-east-1a",
                "name": "aws",
                "region": "us-east-1"
        },
        "refresh": {
                "hold": "2025-12-16T06:07:39.807634591Z"
        },
        "seed": {
                "loaded": true
        },
        "system": {
                "hostname": "ip-172-31-42-119",
                "network": {},
                "timezone": "UTC"
        }
}

Tests

  • Verified fix on Ubuntu 22.04 and Ubuntu 24.04 builds that were previously failing
  • Confirmed snap refresh hold is applied and removed correctly

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.22%. Comparing base (500faa0) to head (b82487f).
⚠️ Report is 73 commits behind head on release-3.14.

Additional details and impacted files
@@               Coverage Diff                @@
##           release-3.14    #7153      +/-   ##
================================================
+ Coverage         90.18%   90.22%   +0.03%     
================================================
  Files               182      183       +1     
  Lines             16472    16543      +71     
================================================
+ Hits              14856    14926      +70     
- Misses             1616     1617       +1     
Flag Coverage Δ
unittests 90.22% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ground updates.

This is to prevent side effects, such as reboot failure caused by background update onf SSM agent.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3141/build-ubuntu-disable-snap-refresh-1216-1 branch from b82487f to 3f11daa Compare December 16, 2025 18:15
@gmarciani gmarciani changed the title [Build] Ubuntu: disable snap refresh during AMI build to prevent back… [Build] Fix Image Builder reboot failures on Ubuntu 22.04/24.04 by holding snap refreshes during build Dec 16, 2025
@gmarciani gmarciani changed the title [Build] Fix Image Builder reboot failures on Ubuntu 22.04/24.04 by holding snap refreshes during build [Build] Fix Image Builder reboot failures on Ubuntu by holding snap refreshes during build Dec 16, 2025
@gmarciani
Copy link
Contributor Author

Closed in favor of #7159

@gmarciani gmarciani closed this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant