-
Notifications
You must be signed in to change notification settings - Fork 110
Wip/mgiacomo/3150/fix cw log timestamp 1215 1 #3071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gmarciani
wants to merge
37
commits into
aws:release-3.14
Choose a base branch
from
gmarciani:wip/mgiacomo/3150/fix-cw-log-timestamp-1215-1
base: release-3.14
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Wip/mgiacomo/3150/fix cw log timestamp 1215 1 #3071
gmarciani
wants to merge
37
commits into
aws:release-3.14
from
gmarciani:wip/mgiacomo/3150/fix-cw-log-timestamp-1215-1
+1,408
−225
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: hgreebe <141743196+hgreebe@users.noreply.github.com>
systemd-networkd is used by default with Ubuntu Server. Installing ubuntu-desktop (as part of DCV installation) installs NetworkManager. NetworkManager is more complex (with WiFi capabilities) and causes confusion to systemd-networkd. When systemd-networkd is confused, it delays the boot by 2 minutes. This commit instructs NetPlan to use systemd-networkd to manage network interfaces. The code is added at the end of DCV installation because the mitigation is strictly related to the installation of ubuntu-desktop. Always using systemd-networkd also improves consistency between how ParallelCluster handles single-nic instances vs multi-nic instances. With multi-nic instances ParallelCluster has been instructing netplan to use systemd-networkd ([code](https://github.com/aws/aws-parallelcluster-cookbook/blob/develop/cookbooks/aws-parallelcluster-environment/files/ubuntu/network_interfaces/configure_nw_interface.sh#L62)) # Technical details: ## Output of `networkctl list` ### Prior to this commit Base Ubuntu: ``` IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 ens5 ether routable configured 2 links listed. ``` Ubuntu with ubuntu-desktop ``` IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 ens5 ether routable unmanaged 2 links listed. ``` systemd-networkd got confused because it saw no network interface was setup (because NetworkManager took over control of all network interfaces) and waited until 2 minutes timeout at the beginning of system boot: ``` $ journalctl -b | grep -i "ipv6\|timeout\|waiting" ... Sep 18 14:51:23 systemd-networkd-wait-online[1602]: Timeout occurred while waiting for network connectivity. ... Sep 18 14:53:31 systemd-networkd-wait-online[1891]: Timeout occurred while waiting for network connectivity. ... ``` ### After this commit Ubuntu with ubuntu-desktop has the same output as Base Ubuntu and the delay is gone Signed-off-by: Hanwen <hanwenli@amazon.com>
* we remove /opt/parallelcluster/shared/nvidia-imex directory creation * We keep default path of `/etc/nvidia-imex/nodes_config.cfg` and `/etc/nvidia-imex/config.cfg` for IMEX configuration * We override `/etc/nvidia-imex/nodes_config.cfg` only if it is missing to avoid Imex start failures. * Update unit test Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
…ws#3011) * [Isolated] Install cfn-dependencies only for AL2 * Revert "[Isolated] Install cfn-dependencies only for AL2" This reverts commit 976d479. * [Isolated] USe latest cfn-dependencies * [Isolated] Using Git REF for Uploading cookbook * [Isolated] Rename the cfn-dependecies files * [Isolated] Chnage the name of Cookbook Dependencies and the folder name inside the Tar * [Isolated] Chnage the name of CFN Dependencies and the folder name inside the Tar * [Isolated] Installing Cfn-bootstrap using `--no-build-isolation` as 3.12.8 uses setup.py based installation where it uses a isolated build instead of looking at existing site-packages * [Isolated] Install efs-proxy cargo dependecies for isolated environment * [Isolated] Install new node pypi dependencies and move efs-proxy installation * [Isolated] Only install efs-proxy deps when in adc regions * [Isolated] Only install efs-proxy-deps in adc * [Isolated] Fix unit tests * [Isolated] Test python pacakges are installed when in an ADC region --------- Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
… default value set
…so affected by false positive. Rule Description: Exception class with `__init__` should pass all args to `super().__init__()` in order to work with `copy.copy()`. False Positive: PyCQA/flake8-bugbear#525
… suffix where it was missing.
…abled` to disable in-place updates on compute and login nodes by disabling cfn-hup on those nodes. As a consequence, it also disables the cluster readiness checks executed by the head node on cluster update. Disabling cfn-hup mitigates a relevant performance degradation that may occur with tightly coupled workload st scale.
…ion of NVIDIA driver, if the module is available on the kernel. Starting kernel `5.14.0-611`, some DRM symbols required by the NVIDIA driver are exported by new client modules.
…mmon rather than sssd.
* Fix DCV on Ubuntu 22.04+ on DLAMI by disabling Wayland Disable Wayland protocol in GDM3 for Ubuntu 22.04+ to force the use of Xorg on GPU instances running without a display. Ubuntu 22.04+ defaults to Wayland which causes GDM startup issues with NVIDIA drivers and NICE DCV. Force Xorg by setting `WaylandEnable=false` in `/etc/gdm3/custom.conf`. * Add kitchen test to check if GDM is using X11 session type
and fix race condition making compute node deploy wrong cluster config version on update failure.
Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.
On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes
On failure on the head node, execute recovery actions:
- Clean up DNA files shared with compute nodes to prevent them from
deploying a config version that is about to be rolled back
- Restart clustermgtd if scontrol reconfigure succeeded, ensuring
cluster management resumes after update/rollback failures
… CloudWatch Agent (aws#3068) The CloudWatch Agent configuration was using the `default` timestamp format (%Y-%m-%d %H:%M:%S,%f) for chef-client.log, but Chef/Cinc outputs ISO 8601 timestamps in format: `[YYYY-MM-DDTHH:MM:SS+TZ]`. This mismatch caused CloudWatch to fail parsing timestamps, resulting in log lines being associated with incorrect timestamps. - Add new 'chef' timestamp format: `[%Y-%m-%dT%H:%M:%S` (Note: CloudWatch Agent's %z only supports timezone without colon like -0700, but Chef outputs +02:00 format. We only match up to seconds and let CloudWatch handle the rest.) - Update chef-client.log configuration to use the new 'chef' format
- Rename timestamp format keys to use consistent naming convention (iso8610, default_seconds) - Update CloudWatch agent config to use iso8610 format for JSON event logs (clustermgtd, slurm_resume) - Consolidate Slurm log timestamp formats (slurmd, slurmctld, slurmdbd) to use iso8610 - Update SSSD log timestamp format from default to default_seconds for consistency - Change DCV authenticator log format from bracket_default to default - Add millisecond precision to PS4 prompt in generate_ssh_key.sh for better debug logging - Add millisecond precision to pcluster_dcv_connect.sh log timestamps for improved log accuracy - Improves log parsing consistency and debugging capabilities across all services
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
Tests
References
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.