Amazon SageMaker HyperPod Slurm clusters now support specifying minimum capacity requirements with continuous provisioning

Amazon SageMaker HyperPod
Amazon SageMaker HyperPod now supports minimum capacity requirements (MinCount) for clusters using Slurm orchestration with continuous provisioning. This feature allows you to specify the minimum number of instances that must be successfully provisioned before an instance group transitions to InService status, providing greater control over when your cluster becomes available for job scheduling.
This is particularly useful for distributed training workloads using frameworks such as PyTorch FSDP, Megatron-LM, or NVIDIA NeMo, where training jobs are configured with a fixed number of participating nodes and may not start efficiently or correctly with partial cluster capacity.
What to do
- Specify MinInstanceCount in the
CreateClusterorUpdateClusterAPI request to set a minimum capacity threshold for an instance group. - Monitor the instance group status until it transitions to InService and nodes become available for Slurm job scheduling.
- Review the Minimum capacity requirements (MinCount) documentation for more details.
Source: AWS release notes
If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.



