Amazon SageMaker HyperPod now supports programmatic node reboot and replacement

Published
November 26, 2025
https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-sagemaker-hyperpod-programmatic-node-reboot-replacement

Amazon SageMaker HyperPod API Updates

Amazon SageMaker HyperPod has introduced new APIs for programmatic rebooting and replacement of cluster nodes, enhancing node management for ML workloads. These new APIs, BatchRebootClusterNodes and BatchReplaceClusterNodes, provide a consistent approach to node recovery operations for both Slurm and EKS orchestrated clusters.

New Features

  • Programmatic Node Reboot and Replacement: Enables programmatically rebooting or replacing unresponsive or degraded cluster nodes.
  • Batch Operations: Each API supports batch operations of up to 25 instances, facilitating efficient management of large-scale recovery scenarios.
  • Orchestrator Agnostic: Provides a consistent approach to node recovery operations, independent of the orchestrator used.

Regional Availability

The new APIs are currently available in the following regions:

  • US East (Ohio)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Tokyo)

What to do

  • Use the new APIs to manage node recovery operations programmatically.
  • Monitor the progress of reboot and replacement operations to ensure nodes return to operational status.
  • Access the APIs through the AWS CLI, SDK, or API calls.

Source: AWS release notes




If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.

Follow our blog

Get the latest insights and advice on AWS services from our experts.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.