Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7d3ed57
Bump version to 3.15.0 (#3025)
github-actions[bot] Sep 17, 2025
84fe969
Update changelog to be inline with release notes (#3028)
hgreebe Sep 19, 2025
3850535
Instruct NetPlan to use systemd-networkd
hanwen-cluster Sep 19, 2025
8788fff
[Gb200] Support IMEX configuration to be local to a node (#3029)
himani2411 Sep 19, 2025
d52a3fd
[Isolated] Install Pypi dependencies for boto3 and cfn-bootstrap scripts
Apr 17, 2025
c5515b0
[Isolated] Update Pypi dependencies and install efs-proxy dependency …
hgreebe Aug 21, 2025
d4bd35a
[Isolated] Remove use of new_resource as this will be empty without a…
Oct 7, 2025
b62c816
[Isolated] Update unit tests
Oct 7, 2025
76f9a00
[Bug] Install cfn dependencies in all regions
Oct 16, 2025
b99018d
[Bug] Install node dependencies in all regions
Oct 16, 2025
468b742
[Bug] Install cookbook dependencies in all regions
Oct 16, 2025
2eace6f
[IMEX] Install Nvidia-imex in all regions
Oct 16, 2025
0113335
[CodeLinters] Disable Flake8 rule B042 as it is a minor, and it is al…
gmarciani Oct 28, 2025
c89f4a2
[CodeLinters] Addressed linter error about extra whitespaces.
gmarciani Oct 28, 2025
500d972
[Tools] In the utility to upload cookbook: include GitRef as artifact…
gmarciani Oct 24, 2025
6eda378
[Performance] Add chef attribute `cluster/in_place_update_on_fleet_en…
gmarciani Oct 28, 2025
c3c60a6
[SlurmDbd] Adding a message to make sure that we do not use # in Data…
Oct 31, 2025
b646045
[BuildImage] Load kernel module `drm_client_lib` before the installat…
gmarciani Nov 13, 2025
0bebf8a
[Docs] Created changelog entry for 3.14.1.
gmarciani Nov 14, 2025
97e1612
[Dependencies] Reduce dependency footprint by installing only sssd-co…
gmarciani Nov 4, 2025
4d1d0b4
Fix github system test on Ubuntu22 and 24
hanwen-cluster Nov 17, 2025
decf009
[GH] Update the version Bump workflow to mention how to run GH actions
Nov 18, 2025
bfbff9f
[EFS] Upgrade EFS utils from 2.3.1 to 2.4.0
Nov 11, 2025
cceeada
[EFS] Upgrade EFS and unit tests
Nov 11, 2025
7f39ed2
[PMIX] Upgrade PMIx from 5.0.6 to 5.0.9
Nov 11, 2025
5f281e6
[Libjwt] Upgrade libjwt from 1.17.0 to 1.18.4
Nov 11, 2025
1b7b21c
[Slurm] Upgrade Slurm from 24.11.6-1 to 2.11.7-1
Nov 11, 2025
6f59e39
[EfS-Utils] Add Go/GoLang which is efs-utils pre-requisite
Nov 12, 2025
a62e365
[LibJwt] Update libJWt version to v1.18.4 for all OS except for AL2
Nov 14, 2025
82f644e
[EFA] Upgrade EFA utils from 1.43.2 to 1.44.0
Nov 17, 2025
8140818
[Changelog] Update Changelog for 3.14.1
Nov 17, 2025
85a0bc7
[EFS] Add cmake and perl which are pre-requisite for efs-utils
Nov 17, 2025
48044c2
[develop] Fix DCV on Ubuntu 22.04+ by disabling Wayland (#3057)
hehe7318 Dec 5, 2025
8167f39
[UpdateWorkflow] Ensure clustermgtd runs after cluster update
gmarciani Dec 11, 2025
0854f39
Do not count missing records as a failure of the cluster readiness ch…
hgreebe Dec 15, 2025
695ff9a
[Develop][Bug] Fix incorrect timestamp parsing for chef-client.log in…
hehe7318 Dec 16, 2025
a50c468
[Logging] Standardize timestamp formats across log configurations
gmarciani Dec 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ ignore =
# B028: Consider replacing f"'{foo}'" with f"{foo!r}".
# Currently being disabled by flake8-bugbear. See https://github.com/PyCQA/flake8-bugbear/pull/333
B028
# B042: Exception class with `__init__` should pass all args to `super().__init__()` in order to work with `copy.copy()`.
# Affected by false positive, https://github.com/PyCQA/flake8-bugbear/issues/525
B042
exclude =
.tox,
.git,
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/bump_version.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
title: 'Bump version to ${{ inputs.pcluster-version }}'
body: |
This PR contains version bump.
Please close and re-open the PR for Github Actions to run.
Auto-generated by Github Action
branch: versionbump${{ inputs.branch }}${{ inputs.pcluster-version }}
delete-branch: true
Expand Down
67 changes: 55 additions & 12 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,59 @@ aws-parallelcluster-cookbook CHANGELOG

This file is used to list changes made in each version of the AWS ParallelCluster cookbook.

3.15.0
------

3.14.1
------

**ENHANCEMENTS**
- Ensure clustermgtd runs after cluster update. On success, start it unconditionally. On failure, start it if the queue reconfiguration succeeded.

**CHANGES**
- Add chef attribute `cluster/in_place_update_on_fleet_enabled` to disable in-place updates on compute and login nodes
and achieve better performance at scale.
- Load kernel module `drm_client_lib` before installation of NVIDIA driver, if available on the kernel.
- Reduce dependency footprint by installing the package `sssd-common` rather than `sssd`.
- Disable Wayland protocol in GDM3 for Ubuntu 22.04+ to force the use of Xorg on GPU instances running without a display.
- Upgrade Slurm to version 24.11.7 (from 24.11.6).
- Upgrade Pmix to 5.0.9 (from 5.0.6).
- Upgrade libjwt to version 1.18.4 (from 1.17.0) for all OSs except Amazon Linux 2.
- Upgrade amazon-efs-utils to version 2.4.0 (from v2.3.1).
- Upgrade EFA installer to 1.44.0 (from 1.43.2).
- Efa-driver: efa-2.17.3-1
- Efa-config: efa-config-1.18-1
- Efa-profile: efa-profile-1.7-1
- Libfabric-aws: libfabric-aws-2.3.1-1
- Rdma-core: rdma-core-59.0-1
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.8-11

**BUG FIXES**
- Fix incorrect timestamp parsing for chef-client.log in CloudWatch Agent configuration.
- Prevent cluster readiness check failures due to instances launched while the check is in progress.
- Fix race condition where compute nodes could deploy the wrong cluster config version after an update failure.

3.14.0
------

**ENHANCEMENTS**
- Add support for P6e-GB200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements.
- Add support for P6-B200 instances for all OSs except AL2.
- Include drivers for P6e-GB200 and P6-B200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements.
- Support `prioritized` and `capacity-optimized-prioritized` Allocation Strategy. This allows users to prioritize subnets for instance placement to optimize costs and performance.
- Add `build-image` support for Amazon Linux 2023 AMIs based on kernel 6.12 (in addition to 6.1).
- Support DCV on Amazon Linux 2023.
- Echo chef-client logs in the instance console when a node fails to bootstrap. This helps with investigating bootstrap failures in cases CloudWatch logs are not available.

**LIMITATIONS**
- P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 22.04 and Ubuntu 24.04.
- Using IMEX on P6e-GB200 requires additional setup. Please refer to <PLACE_HOLDER for the tutorial link>.
- Using IMEX on P6e-GB200 requires additional setup. Please refer to the dedicated tutorial in our public documentation.
- P6-B200 instances are only tested on Amazon Linux 2023, RHEL9, Ubuntu 22.04 and Ubuntu 24.04.

**CHANGES**
- Install nvidia-imex for all OSs except AL2.
- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management.
- Install nvidia-imex for all OSs except Amazon Linux 2.
- Remove `UnkillableStepTimeout` from slurm.conf and let slurm set this value.
- Upgrade Python runtime used by Lambda functions to Python 3.12 (from 3.9). See Lambda Documentation for important information about Python 3.9 EOL: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html
- Support encryption of EFS file system used for the head node internal shared storage via a new configuration parameter `HeadNode/SharedStorageEfsSettings/Encrypted`
- Add validator that warns against using non GPU instances with DCV.
- Upgrade Slurm to version 24.11.6 (from 24.05.8).
- Upgrade EFA installer to 1.43.2 (from 1.41.0).
- Efa-driver: efa-2.17.2-1
Expand All @@ -28,20 +65,26 @@ This file is used to list changes made in each version of the AWS ParallelCluste
- Rdma-core: rdma-core-58.0-1
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
- Upgrade Cinc Client to version 18.4.12 (from 18.2.7).
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
- Upgrade DCGM to version 4.4.1 (from 3.3.6) for all OSs except AL2.
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
- Upgrade Python to 3.9.23 (from 3.9.20) for AL2.
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except Amazon Linux 2.
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except Amazon Linux 2.
- Upgrade DCGM to version 4.4.1 (from 3.3.6) for all OSs except Amazon Linux 2.
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except Amazon Linux 2.
- Upgrade Python to 3.9.23 (from 3.9.20) for Amazon Linux 2.
- Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).
- Upgrade DCV to version 2024.0-19030.
- Upgrade the official ParallelCluster Amazon Linux 2023 AMIs to kernel 6.12 (from 6.1).

**BUG FIXES**
- Fix a race condition in CloudWatch Agent startup that could cause nodes bootstrap failures.
- Fix cluster id mismatch issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting.
- Prevent `build-image` stack deletion failures by deploying a global role that automatically deletes the `build-image` stack after images either succeed or fail the build.
The role is meant to exist even after the stack has been deleted. See https://github.com/aws/aws-parallelcluster/issues/5914.
- Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs).
- Fix `build-image` failure on Rocky 9, occurring when the parent image does not ship the latest kernel version on the latest Rocky minor version.
- Fix cluster id mismatch issue which causes cluster update failures when slurm accounting is used.
- Fix a race condition in CloudWatch Agent startup that could cause node bootstrap failures.

**DEPRECATIONS**
- The configuration parameter `LoginNodes/Pools/Ssh/KeyName` has been deprecated, and it will be removed in future releases. The CLI now returns a warning message when it is used in the cluster configuration.
See https://github.com/aws/aws-parallelcluster/issues/6811.
- Ubuntu 20.04 is no longer supported.

3.13.2
Expand Down
4 changes: 2 additions & 2 deletions cookbooks/aws-parallelcluster-awsbatch/metadata.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
issues_url 'https://github.com/aws/aws-parallelcluster/issues'
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
chef_version '>= 18'
version '3.14.0'
version '3.15.0'

depends 'iptables', '~> 8.0.0'
depends 'nfs', '~> 5.1.5'
depends 'line', '~> 4.5.21'
depends 'openssh', '~> 2.11.14'
depends 'yum', '~> 7.4.20'
depends 'yum-epel', '~> 5.0.8'
depends 'aws-parallelcluster-shared', '~> 3.14.0'
depends 'aws-parallelcluster-shared', '~> 3.15.0'
4 changes: 2 additions & 2 deletions cookbooks/aws-parallelcluster-computefleet/metadata.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
issues_url 'https://github.com/aws/aws-parallelcluster-cookbook/issues'
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
chef_version '>= 18'
version '3.14.0'
version '3.15.0'

depends 'aws-parallelcluster-shared', '~> 3.14.0'
depends 'aws-parallelcluster-shared', '~> 3.15.0'
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,34 @@

# TODO: once the pyenv Chef resource supports installing packages from a path (e.g. `pip install .`), convert the
# bash block to a recipe that uses the pyenv resource.
command = "pip install . --no-build-isolation"

dependency_package_name = "pypi-node-dependencies-#{node['cluster']['python-major-minor-version']}-#{node['kernel']['machine']}"
dependency_folder_name = dependency_package_name
if platform?('amazon') && node['platform_version'] == "2"
dependency_package_name = "node-dependencies"
dependency_folder_name = "node"
end

remote_file "#{node['cluster']['base_dir']}/node-dependencies.tgz" do
source "#{node['cluster']['artifacts_s3_url']}/dependencies/PyPi/#{node['kernel']['machine']}/#{dependency_package_name}.tgz"
mode '0644'
retries 3
retry_delay 5
action :create_if_missing
end

bash 'pip install' do
user 'root'
group 'root'
cwd "#{node['cluster']['base_dir']}"
code <<-REQ
set -e
tar xzf node-dependencies.tgz
cd #{dependency_folder_name}
#{node_virtualenv_path}/bin/pip install * -f ./ --no-index
REQ
end

bash "install custom aws-parallelcluster-node" do
cwd Chef::Config[:file_cache_path]
Expand All @@ -38,7 +66,7 @@
mkdir aws-parallelcluster-custom-node
tar -xzf aws-parallelcluster-node.tgz --directory aws-parallelcluster-custom-node
cd aws-parallelcluster-custom-node/*aws-parallelcluster-node*
pip install .
#{command}
deactivate
NODE
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
require 'spec_helper'

describe 'aws-parallelcluster-computefleet::custom_parallelcluster_node' do
for_all_oses do |platform, version|
context "on #{platform}#{version}" do
cached(:s3_url) { 's3://url' }
cached(:base_dir) { 'base_dir' }
cached(:arch) { 'x86_64' }
cached(:region) { 'any-region' }
cached(:python_version) { 'python_version' }
cached(:dependency_pkg_name_suffix) do
if platform == 'amazon' && version == '2'
'node-dependencies'
else
"pypi-node-dependencies-#{python_version}-#{arch}"
end
end
cached(:dependency_folder_name_suffix) do
if platform == 'amazon' && version == '2'
"node"
else
dependency_pkg_name_suffix
end
end
cached(:virtualenv_path) { "#{base_dir}/pyenv/versions/#{python_version}/envs/node_virtualenv" }
cached(:cookbook_virtualenv_path) { "#{base_dir}/pyenv/versions/#{python_version}/envs/cookbook_virtualenv" }
cached(:custom_node_s3_url) { "#{s3_url}/pyenv/versions/#{python_version}/envs/node_virtualenv" }
cached(:pip_install_bash_code) do
<<-REQ
set -e
tar xzf node-dependencies.tgz
cd #{dependency_folder_name_suffix}
#{virtualenv_path}/bin/pip install * -f ./ --no-index
REQ
end
cached(:node_bash_code) do
<<-NODE
set -e
[[ ":$PATH:" != *":/usr/local/bin:"* ]] && PATH="/usr/local/bin:${PATH}"
echo "PATH is $PATH"
source #{virtualenv_path}/bin/activate
pip uninstall --yes aws-parallelcluster-node
if [[ "#{custom_node_s3_url}" =~ ^s3:// ]]; then
custom_package_url=$(#{cookbook_virtualenv_path}/bin/aws s3 presign #{custom_node_s3_url} --region #{region})
else
custom_package_url=#{custom_node_s3_url}
fi
curl --retry 3 -L -o aws-parallelcluster-node.tgz ${custom_package_url}
rm -fr aws-parallelcluster-custom-node
mkdir aws-parallelcluster-custom-node
tar -xzf aws-parallelcluster-node.tgz --directory aws-parallelcluster-custom-node
cd aws-parallelcluster-custom-node/*aws-parallelcluster-node*
pip install . --no-build-isolation
deactivate
NODE
end
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
node.override['kernel']['machine'] = arch
node.override['cluster']['python-major-minor-version'] = python_version
node.override['cluster']['python-version'] = python_version
node.override['cluster']['base_dir'] = base_dir
node.override['cluster']['region'] = region
node.override['cluster']['artifacts_s3_url'] = s3_url
node.override['cluster']['custom_node_package'] = custom_node_s3_url
end
allow(File).to receive(:exist?).with("#{virtualenv_path}/bin/activate").and_return(true)
runner.converge(described_recipe)
end

it 'downloads tarball' do
is_expected.to create_if_missing_remote_file("base_dir/node-dependencies.tgz")
.with(source: "#{s3_url}/dependencies/PyPi/#{arch}/#{dependency_pkg_name_suffix}.tgz")
.with(mode: '0644')
.with(retries: 3)
.with(retry_delay: 5)
end

it 'pip installs' do
is_expected.to run_bash('pip install')
.with(cwd: base_dir)
.with(code: pip_install_bash_code.gsub(/^ /, ' '))
end

it 'install custom aws-parallelcluster-node' do
is_expected.to run_bash('install custom aws-parallelcluster-node')
.with(code: node_bash_code.gsub(/^ /, ' '))
end
end
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# frozen_string_literal: true

#
# Copyright:: 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
# License. A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.

module ErrorHandlers
# Executes shell commands with retry logic and logging.
class CommandRunner
include Chef::Mixin::ShellOut

DEFAULT_RETRIES = 10
DEFAULT_RETRY_DELAY = 90
DEFAULT_TIMEOUT = 30

def initialize(log_prefix:)
@log_prefix = log_prefix
end

def run_with_retries(command, description:, retries: DEFAULT_RETRIES, retry_delay: DEFAULT_RETRY_DELAY, timeout: DEFAULT_TIMEOUT)
Chef::Log.info("#{@log_prefix} Executing: #{description}")
max_attempts = retries + 1

max_attempts.times do |attempt|
attempt_num = attempt + 1
Chef::Log.info("#{@log_prefix} Running command (attempt #{attempt_num}/#{max_attempts}): #{command}")
result = shell_out(command, timeout: timeout)
Chef::Log.info("#{@log_prefix} Command stdout: #{result.stdout}")
Chef::Log.info("#{@log_prefix} Command stderr: #{result.stderr}")

if result.exitstatus == 0
Chef::Log.info("#{@log_prefix} Successfully executed: #{description}")
return true
end

Chef::Log.warn("#{@log_prefix} Failed to #{description} (attempt #{attempt_num}/#{max_attempts})")

if attempt_num < max_attempts
Chef::Log.info("#{@log_prefix} Retrying in #{retry_delay} seconds...")
sleep(retry_delay)
end
end

Chef::Log.error("#{@log_prefix} Failed to #{description} after #{max_attempts} attempts")
false
end
end
end
Loading