Amazon SageMaker HyperPod now supports checkpointless training

Published
December 3, 2025
https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-sagemaker-hyperpod-checkpointless-training

Amazon SageMaker HyperPod Checkpointless Training

Amazon SageMaker HyperPod now supports checkpointless training, a new capability that maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This feature preserves the model training state across the distributed cluster, automatically swapping out faulty training nodes and using peer-to-peer state transfer for failure recovery.

Key Benefits

  • Reduced Recovery Time: From hours to minutes.
  • Cost Savings: Saves on idle AI accelerator costs.
  • High Training Goodput: Enables upwards of 95% training goodput on large clusters.

Availability

Checkpointless training is available in all AWS Regions where Amazon SageMaker HyperPod is available.

Getting Started

What to do

  • Enable checkpointless training with zero code changes using HyperPod recipes for popular models.
  • Integrate checkpointless training components with minimal modifications for custom PyTorch-based workflows.

Source: AWS release notes




If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.

Follow our blog

Get the latest insights and advice on AWS services from our experts.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.