Skip to content

aws/sagemaker-hyperpod-checkpointless-training

Checkpointless training on Amazon SageMaker HyperPod

Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.

Key Features

  • In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
  • Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
  • Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
  • Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
  • NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training

Getting Started Examples

Model Method Size Nodes Instance Accelerator Recipe Script
GPT OSS Full finetune example 120b 16 p5.48xlarge GPU H100 link link
GPT OSS LoRA-example 120b 2 p5.48xlarge GPU H100 link link
Llama3 Pretrain example 70b 16 p5.48xlarge GPU H100 link link
Llama3 LoRA-example 70b 2 p5.48xlarge GPU H100 link link

User Guide

For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

bash launcher_scripts/gpt_oss/run_checkpointless_nemo_gpt_oss_120b_fine_tuning.sh

Launch Using kubectl

Alternatively, you can deploy the training job directly using kubectl:

kubectl apply -f <path_to_config>.yaml

Monitor Job Status

kubectl get pods
kubectl logs <pod-name>

For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.

Recommended Requirements

Component Version
Python >=3.12
PyTorch >=2.6.0
NeMo Toolkit 2.6.0rc0
CUDA 12.5+
Infrastructure AWS HyperPod Kubernetes cluster
Storage Shared storage (FSx/NFS)

Security

See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.

License

This project is licensed under the Apache-2.0 License.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages