Skip to content

Conversation

@gmarciani
Copy link
Contributor

Description of changes

  • Describe what you're changing and why you're doing these changes.

Tests

  • Describe the automated and/or manual tests executed to validate the patch.
  • Describe the added/modified tests.

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions bot and others added 30 commits September 17, 2025 19:38
Co-authored-by: hgreebe <141743196+hgreebe@users.noreply.github.com>
systemd-networkd is used by default with Ubuntu Server. Installing ubuntu-desktop (as part of DCV installation) installs NetworkManager. NetworkManager is more complex (with WiFi capabilities) and causes confusion to systemd-networkd. When systemd-networkd is confused, it delays the boot by 2 minutes.

This commit instructs NetPlan to use systemd-networkd to manage network interfaces. The code is added at the end of DCV installation because the mitigation is strictly related to the installation of ubuntu-desktop. Always using systemd-networkd also improves consistency between how ParallelCluster handles single-nic instances vs multi-nic instances. With multi-nic instances ParallelCluster has been instructing netplan to use systemd-networkd ([code](https://github.com/aws/aws-parallelcluster-cookbook/blob/develop/cookbooks/aws-parallelcluster-environment/files/ubuntu/network_interfaces/configure_nw_interface.sh#L62))

# Technical details:
## Output of `networkctl list`
### Prior to this commit
Base Ubuntu:
```
IDX LINK TYPE     OPERATIONAL SETUP
  1 lo   loopback carrier     unmanaged
  2 ens5 ether    routable    configured

2 links listed.
```
Ubuntu with ubuntu-desktop
```
IDX LINK TYPE     OPERATIONAL SETUP
  1 lo   loopback carrier     unmanaged
  2 ens5 ether    routable    unmanaged

2 links listed.
```
systemd-networkd got confused because it saw no network interface was setup (because NetworkManager took over control of all network interfaces) and waited until 2 minutes timeout at the beginning of system boot:
```
$ journalctl -b | grep -i &quot;ipv6\|timeout\|waiting&quot;
...
Sep 18 14:51:23 systemd-networkd-wait-online[1602]: Timeout occurred while waiting for network connectivity.
...
Sep 18 14:53:31 systemd-networkd-wait-online[1891]: Timeout occurred while waiting for network connectivity.
...
```
### After this commit
Ubuntu with ubuntu-desktop has the same output as Base Ubuntu and the delay is gone
Signed-off-by: Hanwen <hanwenli@amazon.com>
* we remove /opt/parallelcluster/shared/nvidia-imex directory creation
* We keep default path of `/etc/nvidia-imex/nodes_config.cfg` and `/etc/nvidia-imex/config.cfg` for IMEX configuration
* We override `/etc/nvidia-imex/nodes_config.cfg` only if it is missing to avoid Imex start failures.
* Update unit test

Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
…ws#3011)

* [Isolated] Install cfn-dependencies only for AL2

* Revert "[Isolated] Install cfn-dependencies only for AL2"

This reverts commit 976d479.

* [Isolated] USe latest cfn-dependencies

* [Isolated] Using Git REF for Uploading cookbook

* [Isolated] Rename the cfn-dependecies files

* [Isolated] Chnage the name of Cookbook Dependencies and the folder name inside the Tar

* [Isolated] Chnage the name of CFN Dependencies and the folder name inside the Tar

* [Isolated] Installing Cfn-bootstrap using `--no-build-isolation` as 3.12.8 uses setup.py based installation where it uses a isolated build instead of looking at existing site-packages

* [Isolated] Install efs-proxy cargo dependecies for isolated environment

* [Isolated] Install new node pypi dependencies and move efs-proxy installation

* [Isolated] Only install efs-proxy deps when in adc regions

* [Isolated] Only install efs-proxy-deps in adc

* [Isolated] Fix unit tests

* [Isolated] Test python pacakges are installed when in an ADC region

---------

Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
…so affected by false positive.

Rule Description: Exception class with `__init__` should pass all args to `super().__init__()` in order to work with `copy.copy()`.

False Positive: PyCQA/flake8-bugbear#525
…abled` to disable in-place updates on compute and login nodes by disabling cfn-hup on those nodes.

As a consequence, it also disables the cluster readiness checks executed by the head node on cluster update.

Disabling cfn-hup mitigates a relevant performance degradation that may occur with tightly coupled workload st scale.
…ion of NVIDIA driver, if the module is available on the kernel.

Starting kernel `5.14.0-611`, some DRM symbols required by the NVIDIA driver are exported by new client modules.
Himani Anil Deshpande and others added 7 commits November 20, 2025 13:42
* Fix DCV on Ubuntu 22.04+ on DLAMI by disabling Wayland

Disable Wayland protocol in GDM3 for Ubuntu 22.04+ to force the use of Xorg on GPU instances running without a display. Ubuntu 22.04+ defaults to Wayland which causes GDM startup issues with NVIDIA drivers and NICE DCV. Force Xorg by setting `WaylandEnable=false` in `/etc/gdm3/custom.conf`.

* Add kitchen test to check if GDM is using X11 session type
and fix race condition making compute node deploy wrong cluster config version on update failure.

Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.

On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes

On failure on the head node, execute recovery actions:
  - Clean up DNA files shared with compute nodes to prevent them from
    deploying a config version that is about to be rolled back
  - Restart clustermgtd if scontrol reconfigure succeeded, ensuring
    cluster management resumes after update/rollback failures
…eck (aws#3067)

* Do not consider missing records as a cluster readiness check failure

(cherry picked from commit 75c5867)

* Update CHANGELOG

(cherry picked from commit 94943dd)

* Add note that missing records don't cause failure

(cherry picked from commit 16ad89f)
… CloudWatch Agent (aws#3068)

The CloudWatch Agent configuration was using the `default` timestamp format (%Y-%m-%d %H:%M:%S,%f) for chef-client.log, but Chef/Cinc outputs ISO 8601 timestamps in format: `[YYYY-MM-DDTHH:MM:SS+TZ]`.

This mismatch caused CloudWatch to fail parsing timestamps, resulting in log lines being associated with incorrect timestamps.

- Add new 'chef' timestamp format: `[%Y-%m-%dT%H:%M:%S`
  (Note: CloudWatch Agent's %z only supports timezone without colon like -0700, but Chef outputs +02:00 format. We only match up to seconds and let CloudWatch handle the rest.)
- Update chef-client.log configuration to use the new 'chef' format
- Rename timestamp format keys to use consistent naming convention (iso8610, default_seconds)
- Update CloudWatch agent config to use iso8610 format for JSON event logs (clustermgtd, slurm_resume)
- Consolidate Slurm log timestamp formats (slurmd, slurmctld, slurmdbd) to use iso8610
- Update SSSD log timestamp format from default to default_seconds for consistency
- Change DCV authenticator log format from bracket_default to default
- Add millisecond precision to PS4 prompt in generate_ssh_key.sh for better debug logging
- Add millisecond precision to pcluster_dcv_connect.sh log timestamps for improved log accuracy
- Improves log parsing consistency and debugging capabilities across all services
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants