You’ve built an ML pipeline on Kubernetes. Your GPUs cost $4 per hour. And they’re sitting idle 40% of the time waiting for storage to feed them data.
The problem isn’t your code. It’s the persistent volumes you’re using. Traditional Kubernetes storage was designed for databases and web apps – not ML workloads that need to load 500GB datasets before training starts or write multi-gigabyte checkpoints every 30 minutes. This guide is part of our comprehensive Kubernetes storage for AI workloads resource, where we explore solutions to infrastructure bottlenecks across cloud providers.
Azure Container Storage, GKE Managed Lustre, and AWS FSx for Lustre each offer high-performance options that can keep those GPUs fed. They’re not cheap. But neither is wasting half your GPU budget on idle time.
Here’s a head-to-head comparison covering performance benchmarks, pricing analysis, and workload-specific recommendations. Pick the right storage for your ML workload so you stop burning money on idle GPUs.
How does storage performance impact GPU utilisation in Kubernetes ML workloads?
Storage bottlenecks cause GPUs to idle during data loading. You’re wasting compute resources that cost $2-8 per hour per GPU. Slow checkpoint writes block training iterations. The result? GPU utilisation drops from 100% to 40-60% in poorly configured systems. CoreWeave benchmarks show sustained 100% GPU utilisation requires storage delivering 500+ GB/s aggregate throughput across parallel workers. For detailed performance benchmarks across different AI workload types, see our analysis of AI training and inference storage performance requirements.
ML training workloads follow a two-phase I/O pattern. First there’s initial data loading – intense read bursts with mixed sequential and random patterns. Then periodic checkpoint operations creating write spikes. Production monitoring showed sustained 100% GPU core utilisation with only minor dips during initial data loading and checkpoint operations when storage is properly configured.
The numbers tell the story. A multi-billion parameter model training on 4096 H100 GPUs showed peak read rates of approximately 70 GiB/s and write spikes reaching 50 GiB/s. If your storage can’t deliver those rates, the GPUs wait. And waiting GPUs burn money.
Performance metrics that matter: throughput for sequential data loading, IOPS for random access, latency for keeping GPUs fed. Run a cost calculation. If your GPU idles for 30 minutes per training run waiting on storage, that’s $2 wasted per run on a $4/hour GPU. Multiply by your training frequency.
Standard_DS2_v2 VM size provides 6,400 IOPS and 96 MBps throughput. Standard_B2ms delivers only 1,920 IOPS and 22.5 MBps. Pick the wrong one and you’ve throttled your entire pipeline.
What are the key differences between Azure Container Storage, GKE Managed Lustre, and AWS FSx for Lustre?
The three solutions differ in architecture, integration patterns, and how much upfront configuration you’ll need.
Azure Container Storage v2.0.0 focuses on local NVMe integration delivering 7x higher IOPS and 4x lower latency. It’s optimised for single-zone deployments. Azure rebuilt their architecture from the ground up. The result? Better performance while using fewer resources. They also eliminated the previous three-node minimum requirement – it now works with single-node deployments.
GKE Managed Lustre provides fully managed parallel file system with tiered performance from 125-1000 MB/s per TiB and built-in zone-aware scheduling. It requires one-time VPC peering setup with firewall rules allowing traffic on Lustre network ports TCP 988 or 6988.
AWS FSx for Lustre offers S3 integration for data lakes, sub-millisecond latency, and tight SageMaker integration. The S3 lazy-loading integration saves you building that pipeline yourself if your training data lives in S3 buckets.
All three support ReadWriteMany access mode for parallel training jobs, but differ in pricing models and configuration complexity.
Integration ecosystems differ. Azure Container Storage integrates with KAITO for automated AI model deployment using fast NVMe-backed storage. SageMaker HyperPod supports the Amazon EBS CSI driver for lifecycle management with customer-managed encryption keys.
Performance tier models work differently. Azure’s IOPS depend on VM size. GKE offers per-TiB throughput tiers. FSx provides scratch versus persistent modes. Pick based on your workload pattern.
Which cloud provider offers the best storage performance for ML training workloads?
The answer depends on your workload I/O pattern. Random access favours Azure IOPS. Sequential bulk transfers favour GKE throughput. S3 integration favours AWS FSx.
Azure Container Storage v2.0.0 delivers highest IOPS – a 7x improvement over the previous version. That makes it ideal for workloads with many small file operations during data preprocessing. GKE Managed Lustre excels at sustained high throughput up to 1000 MB/s per TiB in premium tier. Best for large sequential dataset loading and checkpoint writes. AWS FSx for Lustre provides balanced performance with sub-millisecond latency and hundreds of GB/s throughput. It’s optimised for S3-backed data lake workflows.
Real numbers from real workloads. CoreWeave benchmarks using VAST Data architecture achieved aggregate read throughput exceeding 500 GiB/s across 64 nodes, maintaining per-node performance at 7.94 GiB/s.
A 20 TiB GKE Managed Lustre instance provides between 2.5 GB/s and 20 GB/s aggregate throughput depending on selected performance tier. A4 virtual machines deliver approximately 2.5 GB/s per GPU from Managed Lustre instances.
Profile your workload I/O pattern first. If you’re IOPS-bound – lots of random small reads during data preprocessing – Azure’s 7x improvement matters. If you’re throughput-bound – sequential loading of large datasets – GKE’s tiered approach makes sense. If you’re pulling training data from S3 buckets, FSx’s lazy-loading integration saves you building that pipeline yourself.
How much does high-performance Kubernetes storage cost for ML workloads on each cloud provider?
GKE Managed Lustre pricing offers four tiers based on MB/s per TiB, with minimum 2.4 TiB capacity. Azure Container Storage versions 2.0.0 and beyond no longer charge a per-GB monthly fee for storage pools, making the service free – you only pay for underlying storage and VM costs. FSx for Lustre costs differ between persistent deployment and scratch file systems linked to S3 buckets.
The real cost calculation includes more than storage pricing. You need VM costs, data transfer costs, and the cost of getting it wrong. Overprovision premium storage for non-performance-critical phases and you waste budget. Underprovision and your GPUs idle. For comprehensive cost optimisation strategies and FinOps best practices, see our guide on FinOps and cost optimisation for AI storage in Kubernetes.
Choose performance tiers matching throughput and capacity requirements rather than defaulting to the highest tier. Use one Managed Lustre instance for both training and serving when spare IOPS exist. Export data to lower-cost Cloud Storage classes post-training for long-term retention.
Hidden costs add up. Subsequent training jobs can use datasets already on FSx avoiding repeated S3 request costs. Deploy Managed Lustre in the same zone as GPU clients to minimise cross-zone data transfer costs.
Azure Container Storage delivers better performance while using fewer resources than previous versions. That frees up CPU capacity for applications. It’s a direct cost saving beyond storage pricing.
Pick a tier based on current workload profiling, not guesswork.
When should I use ephemeral versus persistent storage for Kubernetes ML workloads?
Ephemeral NVMe storage suits temporary model artifacts, caching preprocessed data, and inference model loading where faster loading justifies data loss on pod termination. Persistent storage is required for checkpoints, training datasets, model registries, and any data that must survive pod failures or cluster maintenance. The hybrid approach is optimal – ephemeral for hot path (active training iteration cache), persistent for cold path (checkpoints, final models).
Data stored on ephemeral NVMe disks is temporary and will be lost if the VM is deallocated or redeployed. Persistent volumes can exist beyond the lifetime of individual pods.
Ephemeral NVMe data disks are suitable for high-speed caching layers such as datasets and checkpoints for AI training, or model files used for AI inference. Data that must survive pod failures should be stored on persistent volumes backed by Azure Disk, Azure Files, or other durable storage.
Use ephemeral for data-intensive analytics and processing pipelines that require fast temporary storage. Don’t use ephemeral for critical data.
GKE Managed Lustre supports both dynamic provisioning where storage is tightly coupled to specific workload, and static provisioning where a long-lived instance is shared across multiple clusters. Dynamic provisioning is the default. Static provisioning treats the file system as a persistent shared resource when multiple jobs need access to the same data.
Plan for pod disruption budgets and ensure your application can quickly rebuild from durable storage when using ephemeral NVMe.
How do Azure, GCP, and AWS differ in their approach to Kubernetes storage primitives for ML workloads?
All three providers use the Container Storage Interface (CSI) standard, but they differ in configuration complexity, zone awareness, and access modes.
StorageClass defines storage tiers with cloud-specific CSI drivers, performance parameters, and volumeBindingMode settings. PersistentVolumeClaim requests storage with specific size and access mode – ReadWriteOnce for single pod, ReadWriteMany for parallel jobs.
Access mode support varies. Azure Disk mounted as ReadWriteOnce is only available to single node. Azure Files lets you share data across multiple nodes and pods supporting ReadWriteMany access mode. GKE Managed Lustre and AWS FSx both support ReadWriteMany for parallel training jobs.
Dynamic provisioning allows Kubernetes to automatically provision storage when PVC is created. To reduce management overhead, use dynamic provisioning instead of statically creating and assigning persistent volumes. Define appropriate reclaim policy in storage classes to minimise unneeded storage costs once pods are deleted.
Azure Container Storage automatically detects and orchestrates NVMe data disks with minimal configuration. GKE requires one-time VPC peering setup. FSx needs S3 bucket configuration for lazy-loading integration.
What makes configuration complexity different across Azure Container Storage, GKE Managed Lustre, and AWS FSx?
Azure Container Storage requires minimal upfront setup – just select storage-optimised VMs (L-series, ND-series, or Da-series) and the system automatically detects and orchestrates NVMe data disks. No minimum cluster size requirements. Built-in orchestration handles storage pools, persistent volume lifecycles, snapshots, and scaling.
GKE Managed Lustre needs one-time VPC peering configuration with firewall rules for Lustre network ports (TCP 988 or 6988). Once configured per VPC, it’s done. The system handles zone-aware scheduling automatically through WaitForFirstConsumer binding mode. Performance tier selection (125, 250, 500, or 1000 MB/s per TiB) requires upfront I/O profiling.
AWS FSx demands S3 bucket configuration for data lake integration. You choose between scratch file systems (linked to S3 buckets, lower cost) and persistent deployment (highly available, durable, higher cost). The CSI driver integrates with SageMaker HyperPod for production ML infrastructure, but requires more configuration steps for encryption keys and lifecycle management.
The tradeoff: Azure offers fastest time-to-value for teams new to high-performance Kubernetes storage. GKE requires more upfront VPC planning but provides predictable performance tiers. FSx offers tightest S3 integration but demands most configuration effort. For step-by-step implementation guidance with YAML examples, see our implementation guide for high-performance storage in Kubernetes.
When should I use AWS FSx for Lustre versus EBS for ML storage on EKS?
FSx for Lustre excels at shared access scenarios with ReadWriteMany support, S3 data lake integration, and workloads requiring hundreds of GB/s throughput across many nodes. EBS suits single-pod storage with ReadWriteOnce access, recently integrated with SageMaker HyperPod for customer-managed encryption and production ML infrastructure. Use FSx for distributed training with shared datasets, EBS for single-instance training jobs, model serving, or databases requiring block storage.
FSx for Lustre stores data across multiple network file servers to maximise performance and reduce bottlenecks. EBS CSI driver supports both ephemeral and persistent volumes addressing the need for dynamic storage management in large-scale AI workloads.
FSx persistent file system option provides highly available and durable storage for workloads that run for extended periods and are sensitive to disruptions. If a file server becomes unavailable on FSx persistent file system, it is replaced automatically within minutes. Data on FSx persistent file systems is replicated on disks and any failed disks are automatically replaced transparently.
FSx for Lustre is fully managed and integrated with Amazon S3, enabling lazy-loading of data from S3 buckets on-demand. Using FSx for Lustre accelerates training jobs by enabling faster download of large datasets. Subsequent training jobs can use datasets already on FSx avoiding repeated S3 request costs.
HyperPod offers two flexible approaches for provisioning additional EBS volumes: InstanceStorageConfigs for cluster-level provisioning or EBS CSI driver for dynamic Pod-level management. Customer managed keys support allows HyperPod to encrypt EBS volumes with your own encryption keys for compliance.
FAQ Section
What is the Container Storage Interface (CSI) and why does it matter for ML workloads?
CSI is the Kubernetes standard API allowing storage vendors to develop drivers that work across different Kubernetes implementations without modifying core Kubernetes code. For ML workloads, CSI drivers enable portable storage configurations, dynamic provisioning, and consistent management across Azure, GCP, and AWS. This allows you to write StorageClass configurations once and migrate between cloud providers with minimal YAML changes, reducing vendor lock-in.
Can I use the same storage solution across multiple cloud providers?
Not directly with managed services. Azure Container Storage, GKE Managed Lustre, and AWS FSx are cloud-specific offerings. You can use vendor-neutral storage solutions like MinIO, Portworx, or self-hosted Lustre that run on any Kubernetes cluster. The tradeoff: managed services provide better performance, automatic scaling, and integrated billing, while self-hosted solutions offer portability but require operational overhead. For evaluation criteria and vendor comparisons across enterprise Kubernetes storage solutions, see our enterprise storage vendor ecosystem evaluation framework.
How do I troubleshoot slow data loading during GPU training jobs?
Start by monitoring storage performance metrics: check IOPS utilisation against VM limits, measure throughput during data loading phases, and verify latency stays below 5ms. Common causes include storage tier under-provisioned for workload, cross-zone storage access adding 3-10ms latency, VM IOPS limits throttling storage (Standard_DS2_v2 capped at 6,400 IOPS), or inefficient data loading code making many small reads instead of batch loading.
What is WaitForFirstConsumer and when should I use it?
WaitForFirstConsumer is a Kubernetes volumeBindingMode that delays PersistentVolume binding until a pod using the PVC is scheduled, ensuring storage provisions in the same availability zone as the pod. Use it for latency-sensitive ML workloads where cross-zone network hops would add 3-10ms latency. It’s the default mode for GKE Managed Lustre and recommended for any storage requiring sub-5ms latency.
How much storage capacity do I need for ML training jobs?
Calculate based on three components: training dataset size (500GB for image classification corpus), checkpoint storage (model size × checkpoint frequency, like 10GB model × 10 checkpoints = 100GB), scratch space for preprocessing (typically 1-2x dataset size). For example, training a 10GB model on 500GB dataset requires minimum 500GB + 100GB + 500GB = 1.1TB. Add 20% buffer for safety. For GKE Managed Lustre, round up to minimum 2.4 TiB capacity requirement.
Does storage performance impact model accuracy or just training speed?
Storage performance affects training speed, not model accuracy. Slow storage causes GPUs to idle waiting for data, increasing wall-clock training time from hours to days, but the final model weights remain identical. However, performance impacts iteration velocity: faster storage enables more experiments per day, better hyperparameter tuning, and quicker convergence on optimal architectures.
Can I mix ephemeral and persistent storage in the same Kubernetes pod?
Yes, pods can mount multiple volumes of different types simultaneously. Common pattern: mount ephemeral emptyDir volume for temporary preprocessing cache (fast, deleted on pod termination) and persistent PVC for checkpoints (durable, survives failures). This hybrid approach optimises cost (ephemeral storage free) and performance (NVMe-backed ephemeral for hot path) while protecting data (persistent for checkpoints).
How do I migrate from one cloud provider’s storage to another?
Migration requires data transfer between providers (use rsync, gsutil, or AWS DataSync), StorageClass reconfiguration (update CSI driver and provider-specific parameters), PVC recreation (delete old PVCs, create new ones referencing new StorageClass), and pod redeployment (update pod specs to mount new PVCs). Note: RWX support differs (GKE/AWS support Lustre, Azure needs Azure Files). For large datasets (10+ TB), expect multi-hour transfer times.
How do I choose between GKE Managed Lustre performance tiers?
Match performance tier to actual workload requirements through I/O profiling rather than defaulting to the highest tier. 1000 & 500 MB/s per TiB offer highest throughput for foundation model training and large-scale simulations. 250 MB/s per TiB provides balanced cost-effectiveness for general HPC workloads and AI inference serving. 125 MB/s per TiB suits large-capacity use cases and migrating containerised on-premises applications. A 10 TiB volume on 125 MB/s tier delivers 1.25 GB/s aggregate throughput, while 500 MB/s tier delivers 5 GB/s on same capacity.
Do I need to configure backups for ML training storage?
Depends on data replaceability. Training datasets pulled from S3/GCS buckets don’t need storage-level backups (source of truth in object storage). Checkpoints should be backed up if they represent days of GPU compute investment. Final trained models require backup as production artifacts. For Azure Container Storage, use Azure Backup; for GKE Managed Lustre, use Cloud Storage snapshots; for FSx, enable automatic daily backups.
Can I use ReadWriteMany access mode with Azure Disks?
No, Azure Disks support only ReadWriteOnce (single pod access). For ReadWriteMany in Azure, use Azure Files or Azure Container Storage with NFS provisioner. Azure Files trades some performance (lower IOPS) for shared access capability. If you need high-performance RWX on Azure, consider Azure NetApp Files (premium pricing) or use Azure Container Storage v2.0.0 with multiple ReadWriteOnce volumes for independent pod storage.
How does zone colocation improve storage performance for ML workloads?
Zone colocation ensures Kubernetes pods and their storage exist in the same availability zone, eliminating cross-zone network hops that add 3-10ms latency. For ML workloads targeting sub-1ms storage latency (random data loading during training), cross-zone placement kills performance. GKE Managed Lustre achieves colocation through WaitForFirstConsumer binding mode, which delays volume binding until pod is scheduled, then provisions storage in pod’s zone.
For a complete overview of all aspects of Kubernetes storage for AI workloads—including architecture patterns, performance benchmarks, implementation strategies, and cost optimisation—see our comprehensive Kubernetes storage for AI workloads resource.