Amazon SageMaker HyperPod now supports programmatic node reboot and replacement

Amazon SageMaker HyperPod API Updates
Amazon SageMaker HyperPod has introduced new APIs for programmatic rebooting and replacement of cluster nodes, enhancing node management for ML workloads. These new APIs, BatchRebootClusterNodes and BatchReplaceClusterNodes, provide a consistent approach to node recovery operations for both Slurm and EKS orchestrated clusters.
New Features
- Programmatic Node Reboot and Replacement: Enables programmatically rebooting or replacing unresponsive or degraded cluster nodes.
- Batch Operations: Each API supports batch operations of up to 25 instances, facilitating efficient management of large-scale recovery scenarios.
- Orchestrator Agnostic: Provides a consistent approach to node recovery operations, independent of the orchestrator used.
Regional Availability
The new APIs are currently available in the following regions:
- US East (Ohio)
- Asia Pacific (Mumbai)
- Asia Pacific (Tokyo)
What to do
- Use the new APIs to manage node recovery operations programmatically.
- Monitor the progress of reboot and replacement operations to ensure nodes return to operational status.
- Access the APIs through the AWS CLI, SDK, or API calls.
Source: AWS release notes
If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.



