Skip to content

Conversation

@hanwen-cluster
Copy link
Contributor

Problem

ParallelCluster clusters should be able to be created in a network without Internet access. However, when the following items are all true, cluster creation fails:

  1. RHEL/Rocky
  2. x86 GPU instances for head node and/or login nodes
  3. DCV enabled

The failure can be seen in chef-client log:

      ================================================================================
      Error executing action `install` on resource 'dnf_package[/opt/parallelcluster/sources/nice-dcv-2024.0-19030-el9-x86_64/nice-dcv-gl-2024.0.1096-1.el9.x86_64.rpm]'
      ================================================================================

      RuntimeError
      ------------
      dnf-helper.py had stderr/stdout output:

      Errors during downloading metadata for repository 'epel':
        - Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Errors during downloading metadata for repository 'rhel-9-appstream-rhui-rpms':
        - Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
      Error: Failed to download metadata for repo 'rhel-9-appstream-rhui-rpms': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]

Workaround

This commit creates a script to download any missing transitive dependencies of DCV GL. This commit modifies the cookbook to install the transitive dependencies, and use --disablerepo=* to avoid yum/dnf contacting Internet for repo Metadata

How to use the script:

  1. Launch an instance with official ParallelCluster RHEL/Rocky AMI
  2. On the instance, run the script as root (e.g. ./fix_dcv_gl_offline_installation.gl)
  3. Create an image from the instance
  4. Use the created image as the CustomAmi when creating clusters

Testing

The following test is successful, using the outcome AMI as CustomAmi from step 1-3:

test-suites:
  networking:
    test_cluster_networking.py::test_cluster_in_no_internet_subnet:
      dimensions:
        - regions: ["us-east-1"]
          instances: ["g5.xlarge"]
          oss: ["rhel9"]
          schedulers: ["slurm"]

Note

This commit should only be merged in integ-tests-3.14.0. Long term fix will be done in the future for other branches

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…line installation

## Problem
ParallelCluster clusters should be able to be created in a network without Internet access. However, when the following items are all true, cluster creation fails:
1. RHEL/Rocky
2. x86 GPU instances for head node and/or login nodes
3. DCV enabled

The failure can be seen in chef-client log:
```
      ================================================================================
      Error executing action `install` on resource 'dnf_package[/opt/parallelcluster/sources/nice-dcv-2024.0-19030-el9-x86_64/nice-dcv-gl-2024.0.1096-1.el9.x86_64.rpm]'
      ================================================================================

      RuntimeError
      ------------
      dnf-helper.py had stderr/stdout output:

      Errors during downloading metadata for repository 'epel':
        - Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/mirrorlist?repo=epel-9&arch=x86_64 [Failed to connect to mirrors.fedoraproject.org port 443: Connection timed out]
      Errors during downloading metadata for repository 'rhel-9-appstream-rhui-rpms':
        - Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
      Error: Failed to download metadata for repo 'rhel-9-appstream-rhui-rpms': Cannot prepare internal mirrorlist: Curl error (28): Timeout was reached for https://rhui.us-east-1.aws.ce.redhat.com/pulp/mirror/content/dist/rhel9/rhui/9/x86_64/appstream/os [Failed to connect to rhui.us-east-1.aws.ce.redhat.com port 443: Connection timed out]
```

## Workaround
This commit creates a script to download any missing transitive dependencies of DCV GL. This commit modifies the cookbook to install the transitive dependencies, and use `--disablerepo=*` to avoid yum/dnf contacting Internet for repo Metadata

### How to use the script:
1. Launch an instance with official ParallelCluster RHEL/Rocky AMI
2. On the instance, run the script as root (e.g. `./fix_dcv_gl_offline_installation.gl`)
3. Create an image from the instance
4. Use the created image as the [CustomAmi](https://docs.aws.amazon.com/parallelcluster/latest/ug/Image-v3.html#yaml-Image-CustomAmi) when creating clusters

## Testing

The following test is successful, using the outcome AMI as CustomAmi from step 1-3:
```
test-suites:
  networking:
    test_cluster_networking.py::test_cluster_in_no_internet_subnet:
      dimensions:
        - regions: ["us-east-1"]
          instances: ["g5.xlarge"]
          oss: ["rhel9"]
          schedulers: ["slurm"]
```

## Note

This commit should only be merged in integ-tests-3.14.0. Long term fix will be done in the future for other branches
@hanwen-cluster hanwen-cluster requested review from a team as code owners December 18, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant