You’ve got expensive GPUs sitting there, costing $2-3 per hour or more, and they’re just waiting. Not training models. Not processing data. Just waiting for storage to catch up.
This is the problem with Container Storage Interface in Kubernetes when you throw AI workloads at it. CSI was designed for a world where containers were ephemeral and stateless. AI training runs for days or weeks, needs terabytes of data fed continuously to GPUs, and requires checkpoint storage that can handle massive writes without bringing everything to a halt.
The architectural mismatch is straightforward. CSI assumes pods are replaceable and external state management handles persistence. AI assumes multi-day GPU runs with sustained high throughput are normal. When these two worlds collide, you get storage bottlenecks that translate directly to wasted GPU spend and frustrated engineering teams.
Understanding why standard CSI fails helps you evaluate alternatives and avoid costly misconfigurations. This article is part of our comprehensive guide to Kubernetes storage for AI workloads hitting infrastructure limits. Let’s get into it.
What is Container Storage Interface and why does it matter for Kubernetes?
Container Storage Interface is the standard connector that lets Kubernetes work with different storage vendors. Think of it as a storage integration contract for Kubernetes—it decouples the storage driver lifecycle from the platform.
This matters because storage vendors can now release drivers independently of the Kubernetes release cycle. Before CSI, every storage integration was an in-tree volume plugin tied to Kubernetes versions. CSI changed that with a standardised, vendor-neutral interface.
In practice, CSI is the foundation for PersistentVolumes and PersistentVolumeClaims in modern Kubernetes. It’s what lets organisations connect the platform to virtually any enterprise storage system.
Dynamic provisioning works through a storage class that identifies what type of resource needs to be created. A PersistentVolume is a storage resource that can exist beyond the lifetime of an individual pod. For clusters using CSI drivers, storage classes like managed-csi use Azure Standard SSD locally redundant storage to create a managed disk.
How was CSI originally designed and what assumptions did it make?
CSI came out of Kubernetes’ initial stateless workload focus. The design assumed pods are ephemeral and replaceable, with external state management handling persistence where needed.
The spec prioritised portability and vendor neutrality over performance optimisation. Basic capabilities covered provisioning volumes, attaching to nodes, mounting to pods, and expanding capacity. That’s it.
What CSI didn’t anticipate? Multi-day GPU training runs requiring sustained high throughput. It lacked topology awareness, data locality optimisation, and application-aware snapshots. The volume lifecycle was tied to pod lifecycle with limited Day 2 operations support.
CSI is an extension of traditional storage architectures and wasn’t designed to support Kubernetes’ level of dynamic, distributed scheduling. Going beyond basic provisioning and snapshots requires vendor-specific driver customisation.
Containers are ephemeral by design—they start, stop, and move constantly—while the data those workloads rely on must remain persistent and always available. Traditional storage systems weren’t built for Kubernetes environments where volumes can stay locked to specific nodes, making data hard to access when containers move.
What CSI was missing: I/O prioritisation, checkpoint management, coordinated backups, and disaster recovery orchestration. For more on how these gaps affect business continuity, see our guide on business continuity and disaster recovery strategies for Kubernetes storage.
Why do AI training workloads break standard CSI assumptions?
CSI’s design assumptions don’t hold when you throw AI training at them.
AI training runs for days or weeks. Models training for that long need reliable checkpoint storage and recovery mechanisms. GPUs consume data continuously at high throughput. Any I/O wait translates to expensive GPU idle time.
Training datasets can be terabytes in size. Multi-terabyte datasets must be efficiently cached close to GPU nodes without exploding storage costs. Multi-GPU training requires parallel filesystem access patterns CSI wasn’t designed for.
Checkpoint writes must complete quickly to minimise training interruption during model state saves. CSI’s single-pod attachment model conflicts with distributed training accessing shared datasets. Standard CSI drivers lack I/O prioritisation needed to prevent checkpoint writes from starving read operations.
Here’s a concrete example: You’re running a multi-day LLM training run on a GPU cluster with periodic checkpoint writes every hour. AI workloads require continuous data pipelines, rapid read/write cycles, and high throughput to keep GPUs fed with data.
GPUs are fast, but they’re only as effective as the data feeding them. Slow or outdated storage solutions cause I/O bottlenecks, leaving expensive GPUs underutilised.
The performance mismatch is clear. CSI was designed for database transaction patterns with random I/O. AI needs streaming throughput. The cost impact? That idle time adds up fast when a GPU is sitting there 40-60% of the time waiting for storage instead of training. For specific performance metrics and benchmarks, see our deep-dive on AI training and inference storage performance requirements.
What are the performance requirements for AI model training storage?
AI model training storage needs high sustained throughput in the GB/s range to feed data pipelines keeping GPUs saturated. Low latency checkpoint writes—sub-second for model state saves—minimise training interruption. High IOPS handles loading thousands of small training samples from datasets.
Azure Container Storage v2.0.0, optimised specifically for local NVMe drives, delivers approximately 7 times higher IOPS and 4 times lower latency compared to previous versions. This translates to measurable improvements: 7x read/write IOPS, 4x latency reduction, 60% better PostgreSQL transaction throughput, and 5x faster model file loading for Llama-3.1-8B-Instruct compared to ephemeral OS disk.
NVMe disks deliver significantly higher IOPS and throughput compared to traditional HDD or SSD options. Sub-millisecond latency for accessing training data prevents GPU starvation.
Data locality avoids network transfer delays and egress costs when moving terabyte datasets. Parallel filesystem access allows multiple pods to read shared datasets simultaneously. Reliable storage prevents data loss during multi-day training runs costing thousands in GPU time.
How do storage bottlenecks impact GPU utilisation and costs?
If GPUs sit idle waiting for data, you’re paying for expensive hardware that isn’t doing useful work. Costs rise when queues are long, jobs churn, or GPUs sit idle waiting for data.
A single NVIDIA H100 GPU costs upward of $40,000. When static allocation leaves these resources idle even 25 percent of the time, organisations are essentially missing out on $10,000 worth of value per GPU annually.
Studies suggest typical AI workflows spend between 30 percent to 50 percent of their runtime in CPU-only stages, meaning expensive GPUs contribute nothing during that period. Even 20-30% GPU utilisation loss from storage bottlenecks translates to thousands monthly in a small cluster.
Multi-GPU training amplifies the problem. All GPUs wait when any one starves for data. Checkpoint storage bottlenecks extend overall training time, increasing total compute costs.
Poor storage performance forces teams to over-provision GPUs to compensate. Engineers spend time troubleshooting “slow training” when the root cause is storage, not model architecture. The opportunity cost matters—delayed model deployment means delayed business value.
Bandwidth is often a hidden bottleneck. Distributed training depends on low-latency, high-bandwidth interconnects, but many environments fall short, causing GPUs to starve even when hardware is available. Unstructured data drives AI, but traditional storage systems can’t always feed accelerators fast enough.
What makes stateful workloads different from stateless workloads in Kubernetes?
Stateless applications treat pods as interchangeable. Storage is external or ephemeral. Stateful workloads like databases require stable network identity, persistent storage, and ordered deployment.
StatefulSets require careful pod management, persistent network identities, and ordered deployment/scaling. Database workloads demand low-latency, high-IOPS storage that’s vastly different from ephemeral container needs.
Volume resizing for StatefulSets requires a complex workaround. Currently, the StatefulSet controller has no native support for volume resizing. This is despite the fact that almost all CSI implementations have native support for volume resizing the controller could hook into.
Failure scenarios differ too. Stateless pods restart anywhere. Stateful pods need data locality for recovery. Database failover complexity increases when CSI drivers lack topology awareness across availability zones.
The workaround for StatefulSet volume resizing requires using cascade orphan deletion to preserve running pods while removing the StatefulSet controller, manually editing volumes and claims, then recreating the controller. This process is so complex that vendors like Plural built dedicated controllers to automate it.
What storage solutions work better than standard CSI for AI workloads?
Container-native storage deploys storage controllers as Kubernetes microservices rather than external drivers, treating storage as first-class Kubernetes citizen.
NVMe-backed storage provides sub-millisecond latency and high IOPS required for GPU workloads. Azure Container Storage integrates seamlessly with AKS to enable provisioning of persistent volumes for production-scale, stateful workloads.
Portworx simplifies Kubernetes management to Automate, Protect, and Unify data for any distribution, storage system, or workload. Portworx extends CSI with data management features: DR, mobility, I/O control, and application-aware backups.
Portworx can manage data and manifest migrations between clusters, enabling blue/green upgrades of clusters running stateful workloads. Portworx provides I/O Control to shape I/O and storage automation to automatically increase the size of PVCs and storage pools.
Purpose-built AI clouds like CoreWeave integrate optimised storage, networking, and observability. They give you GPU diversity with fast time-to-market for new hardware, high-bandwidth interconnects like NVIDIA InfiniBand, and AI-tuned storage stacks that prevent idle GPU time.
DataCore Puls8 is a container-native storage platform engineered specifically for Kubernetes environments. Puls8 transforms local disks into a fully managed, resilient storage pool, provisions volumes automatically when containers request them, handles replication and failure recovery behind the scenes.
KAITO (Kubernetes AI Toolchain Operator) automates model deployment with fast storage integration. Mirantis k0rdent AI offers GPU-as-a-Service with storage optimised for hybrid/multi-cloud AI workloads.
For a comprehensive comparison of cloud provider solutions, see our guide on comparing cloud provider Kubernetes storage solutions for machine learning.
FAQ Section
Can CSI drivers be extended to handle AI workload requirements?
Vendor-extended CSI implementations like Portworx add capabilities beyond base specification. Going beyond basic provisioning and snapshots in standard CSI requires vendor-specific driver customisation, which introduces problems. Each vendor’s driver has its own learning curve, making it difficult to standardise workflows.
However, fundamental architectural constraints remain: single-pod attachment semantics, limited topology awareness, lack of application-aware operations. Extensions help but don’t fully resolve the mismatch between CSI’s stateless origins and AI’s persistent, high-performance needs.
How do I identify if storage is bottlenecking my GPU workloads?
If GPUs show less than 70% utilisation while I/O wait is high, storage is likely the bottleneck.
Look for training throughput degradation during data loading phases and extended checkpoint write times. GPU utilisation tells only part of the story. You need end-to-end visibility into GPU occupancy, memory throughput, interconnect latency, and storage I/O.
NVIDIA GPU metrics combined with storage IOPS/latency telemetry reveal the correlation between storage performance and GPU idle time.
What’s the difference between block storage and NVMe for AI training?
Traditional block storage (AWS EBS, Azure Disk) uses network-attached volumes with higher latency and lower IOPS. NVMe is a direct-attached, flash-optimised protocol providing sub-millisecond latency and much higher throughput.
NVMe disks deliver significantly higher IOPS and throughput compared to traditional HDD or SSD options. For GPU training, NVMe’s performance characteristics better match data consumption rates. Benchmarks show 7x IOPS improvement switching to NVMe-backed storage for AI workloads.
Does StatefulSet volume resizing require downtime?
Standard Kubernetes lacks native StatefulSet volume resize support, requiring a workaround that preserves running pods while manually editing volumes and claims, then recreating the controller. This complex procedure can cause downtime if not executed carefully.
Operators from Plural and Zalando automate this process, but the fundamental CSI limitation remains.
Can I use CSI for inference workloads or only training?
Inference workloads have different storage profiles than training. Inference workloads favour responsiveness, concurrency, and predictable cost—delivering results to users with minimal delay while serving many requests. They prioritise latency over throughput, model loading speed over dataset streaming.
CSI limitations still apply. Slow model loading increases time-to-first-token for LLM serving. KAITO integrates with Azure Container Storage to accelerate model deployment, with benchmarks showing 5 times improvement in model file loading.
vLLM and other serving frameworks benefit from fast storage but can tolerate CSI better than multi-day training runs.
How does data locality affect AI training costs?
Data locality—co-locating compute and storage—minimises network transfer times and cloud egress costs. Training with terabyte datasets across availability zones or regions incurs significant latency and transfer fees.
Keep training efficient by reducing cross-region data movement. Co-locating storage and compute minimises latency, cuts egress fees, and ensures expensive GPUs stay busy.
For performance-sensitive applications, you want your compute resources and storage to be as close as possible, ideally within the same zone. Storage schedulers can delay volume provisioning until pod placement is determined, guaranteeing co-location to minimise network latency.
CSI drivers often lack topology awareness, leading schedulers to place pods away from data. Azure Container Storage and container-native solutions provide better locality management.
What is container-native storage and how does it differ from CSI?
Container-native storage deploys storage controllers as Kubernetes microservices rather than external drivers. It treats storage as a first-class Kubernetes citizen using native primitives for management, scaling, and resilience.
DataCore Puls8 is an example providing enterprise features (HA, replication) within Kubernetes. It offers better automation and orchestration than external CSI drivers while maintaining a Kubernetes-native operational model.
Do I need specialised storage for every AI workload?
Not all AI workloads require optimised storage. Inference with small models, batch processing with cached datasets, and development/testing can work with standard CSI.
However, production training (especially LLMs and multi-GPU distributed training), real-time inference with large models, and RAG systems with vector databases benefit significantly from high-performance storage. Evaluate based on GPU costs vs storage investment.
How do multi-cloud deployments complicate CSI management?
Each cloud provider has different CSI driver implementations with varying capabilities and performance characteristics. Managing upgrades, migrations, and disaster recovery across AWS EBS, Azure Disk, and Google Persistent Disk creates operational complexity.
Portworx and container-native solutions provide consistent abstraction across clouds, simplifying multi-cloud storage management while extending beyond basic CSI capabilities.
What role does KAITO play in AI model deployment?
Kubernetes AI Toolchain Operator is the first Kubernetes-native controller automating AI model deployment. It integrates with Azure Container Storage for fast model loading, reducing time from model selection to inference availability.
It simplifies the operational complexity of deploying models with appropriate GPU resources and storage configurations. KAITO addresses the gap between Kubernetes primitives and AI-specific deployment patterns.
Can CSI handle checkpoint storage reliably?
Basic CSI provides volume provisioning and snapshot capabilities suitable for periodic checkpoints. However, it lacks optimisation for large, frequent checkpoint writes during training.
Checkpoint performance depends on the underlying storage backend: NVMe-backed implementations handle it better than network block storage. Application-level checkpoint management (frequency, retention, verification) remains the developer’s responsibility as CSI has no training workflow awareness.
How does GPU partitioning (MIG) affect storage requirements?
Multi-Instance GPU partitions a single GPU into multiple isolated instances for concurrent inference workloads. Each instance needs coordinated storage access to prevent I/O contention.
The scheduler must consider both MIG topology and storage proximity when placing pods. This increases the complexity of storage provisioning and monitoring as more granular resource allocation requires more precise storage orchestration.
For the complete landscape of Kubernetes storage challenges and solutions for AI workloads, see our comprehensive overview of AI workload storage limits.