SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

Published
November 26, 2025
https://aws.amazon.com/about-aws/whats-new/2025/11/sagemaker-hyperpod-managed-tiered-kv-cache/

Amazon SageMaker HyperPod Updates

Amazon SageMaker HyperPod now supports Managed Tiered KV Cache and Intelligent Routing for large language model (LLM) inference, optimizing performance for long-context prompts and multi-turn conversations.

Managed Tiered KV Cache intelligently caches and reuses computed values, while Intelligent Routing directs requests to optimal instances, delivering:

  • 40% latency reduction
  • 25% throughput improvement
  • 25% cost savings

Managed Tiered KV Cache uses a two-tier architecture combining local CPU memory (L1) with disaggregated cluster-wide storage (L2). Intelligent Routing maximizes cache utilization through:

  • Prefix-aware routing for common prompt patterns
  • KV-aware routing for maximum cache efficiency
  • Round-robin for stateless workloads

What to do

  • Enable these features through InferenceEndpointConfig or SageMaker JumpStart
  • Deploy models via the HyperPod Inference Operator on EKS-orchestrated clusters

Source: AWS release notes




If you need further guidance on AWS, our experts are available at AWS@westloop.io. You may also reach us by submitting the Contact Us form.

Follow our blog

Get the latest insights and advice on AWS services from our experts.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.