Introducing elastic training on Amazon SageMaker HyperPod

Published
December 3, 2025
https://aws.amazon.com/about-aws/whats-new/2025/12/elastic-training-amazon-sagemaker-hyperpod/

Amazon SageMaker HyperPod Elastic Training

Amazon SageMaker HyperPod now supports elastic training, enabling organizations to accelerate foundation model training by automatically scaling training workloads based on resource availability and workload priorities. This represents a fundamental shift from training with a fixed set of resources, as it saves hours of engineering time spent reconfiguring training jobs based on compute availability.

Any change in compute availability previously required manually halting training, reconfiguring training parameters, and restarting jobs—a process that requires distributed training expertise and leaves expensive AI accelerators sitting idle during training job reconfiguration. Elastic training automatically expands training jobs to absorb idle AI accelerators and seamlessly contracting when higher-priority workloads need resources—all without halting training entirely.

By eliminating manual reconfiguration overhead and ensuring continuous utilization of available compute, elastic training can help save time previously spent on infrastructure management, reduce costs by maximizing cluster utilization, and accelerate time-to-market. Training can start immediately with minimal resources and grow opportunistically as capacity becomes available.

What to do

Source: AWS release notes




If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.

Follow our blog

Get the latest insights and advice on AWS services from our experts.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.