SageMaker HyperPod now supports gang scheduling for distributed training workloads

Amazon SageMaker HyperPod Gang Scheduling
Amazon SageMaker HyperPod now supports gang scheduling for distributed training jobs, ensuring all required pods are ready before training starts. This feature helps prevent wasted compute and avoids deadlocks from jobs waiting for resources.
What to do
- Configure gang scheduling settings on the HyperPod Console.
- Adjust settings such as wait time for pods, handling node failures, and workload admission.
Regions Available
This feature is available in the following AWS Regions:
- US East (N. Virginia)
- US East (Ohio)
- US West (N. California)
- US West (Oregon)
- Asia Pacific (Mumbai)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Tokyo)
- Asia Pacific (Jakarta)
- Europe (Frankfurt)
- Europe (Ireland)
- Europe (London)
- Europe (Stockholm)
- Europe (Spain)
- South America (São Paulo)
Source: AWS release notes
If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.



