Amazon SageMaker HyperPod now supports automatic Slurm topology management

Amazon SageMaker HyperPod Updates
Amazon SageMaker HyperPod now automatically selects and maintains the optimal network topology configuration for Slurm clusters based on GPU instance types. This improves distributed training performance by ensuring faster GPU-to-GPU communication and more efficient NCCL collective operations.
HyperPod inspects instance types at cluster creation, identifies networking characteristics, and selects the best-fit topology model. It supports tree topology for hierarchical interconnects and block topology for uniform high-bandwidth connectivity. For mixed instance types, HyperPod selects a compatible topology.
As the cluster evolves through scaling operations and node replacements, HyperPod automatically updates the topology configuration without manual intervention.
What to do
- Create a SageMaker HyperPod Slurm cluster with supported GPU instance types.
- Topology-aware scheduling is enabled by default and requires no configuration.
This feature is available in all AWS Regions where Amazon SageMaker HyperPod is supported. To learn more, visit the Amazon SageMaker HyperPod documentation.
If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.



