Amazon SageMaker HyperPod Slurm clusters now support specifying minimum capacity requirements with continuous provisioning

Published
May 27, 2026
https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-sagemaker-hyperpod-mincount/

Amazon SageMaker HyperPod

Amazon SageMaker HyperPod now supports minimum capacity requirements (MinCount) for clusters using Slurm orchestration with continuous provisioning. This feature allows you to specify the minimum number of instances that must be successfully provisioned before an instance group transitions to InService status, providing greater control over when your cluster becomes available for job scheduling.

This is particularly useful for distributed training workloads using frameworks such as PyTorch FSDP, Megatron-LM, or NVIDIA NeMo, where training jobs are configured with a fixed number of participating nodes and may not start efficiently or correctly with partial cluster capacity.

What to do

  • Specify MinInstanceCount in the CreateCluster or UpdateCluster API request to set a minimum capacity threshold for an instance group.
  • Monitor the instance group status until it transitions to InService and nodes become available for Slurm job scheduling.
  • Review the Minimum capacity requirements (MinCount) documentation for more details.

Source: AWS release notes




If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.

Follow our blog

Get the latest insights and advice on AWS services from our experts.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.