Insights Business| SaaS| Technology AI Training and Inference Storage Performance Requirements Benchmarked
Business
|
SaaS
|
Technology
Jan 8, 2026

AI Training and Inference Storage Performance Requirements Benchmarked

AUTHOR

James A. Wondrasek James A. Wondrasek
Graphic representation of the topic Kubernetes Storage Hitting AI Workload Limits

Storage bottlenecks waste GPU resources. Your H100 GPUs sit idle during checkpoint operations, burning budget that could fund more compute.

MLPerf benchmarks reveal the concrete numbers. IBM Storage Scale hit 656.7 GiB/s reads for 1T model training. Inference demands sub-millisecond latency with hundreds of thousands of IOPS.

Here’s the key distinction: training demands sequential throughput for checkpoint bandwidth. Inference needs IOPS and ultra-low latency. These are different workloads requiring different storage architectures.

This article is part of our comprehensive guide to Kubernetes storage for AI workloads, where we explore the complete landscape of storage challenges and solutions for machine learning infrastructure.

If you inherited legacy storage infrastructure, you’ve probably discovered traditional enterprise systems can’t meet AI workload demands. Storage optimised for databases performs well on OLTP workloads but collapses under TB-scale sequential checkpoint writes.

This analysis provides benchmarked performance targets across model types, GPU counts, and deployment scales. You’ll get practical sizing formulas and vendor comparison data for evidence-based infrastructure decisions.

NVIDIA DGX SuperPOD, IBM Storage Scale, and Google Cloud implementations provide the real-world performance data. Let’s start with training workloads and their specific storage requirements.

What Storage Performance Do AI Training Workloads Actually Need?

AI training storage requires sustained sequential write throughput measured in hundreds of GBps—not tens—for checkpoint operations.

NVIDIA’s DGX SuperPOD reference architecture specifies 40-125 GBps read bandwidth tiers based on GPU cluster size. The standard tier delivers 40 GBps reads and 20 GBps writes for a single Scalable Unit. Scale to 4 SUs and you need 160 GBps reads and 80 GBps writes.

The enhanced tier pushes higher: 125 GBps reads and 62 GBps writes for single SU, scaling to 500 GBps reads and 250 GBps writes for 4 SU clusters.

IBM Storage Scale achieved 656.7 GiB/s read bandwidth and 412.6 GiB/s write bandwidth in MLPerf benchmarks for 1T model checkpoints. For the Llama 3.1 1T model, checkpoint load took around 23 seconds and save took around 37 seconds.

Here’s the checkpoint bandwidth formula: (model_size + optimizer_state) × checkpoint_frequency × GPU_count / acceptable_overlap_percentage.

Example: A 1T parameter model checkpointing every 30 minutes with 10% overlap target requires under 845 GBps bandwidth. The total checkpoint size reaches 15.2TB when you include the optimizer state (13.2TB for Adam optimizer plus 2TB for model parameters).

Training workloads prioritise throughput over latency. Seconds of latency are fine if bandwidth sustains GPU feeding. Storage bottlenecks show up as GPU idle time during checkpoint writes and data loading.

VAST Data analysed 85,000 checkpoints across production training runs and found global checkpoint bandwidth requirements are modest, typically well below 1 TB/s even for 1T models. Counterintuitively, checkpoint bandwidth per GPU decreases as model size grows. Larger models use more data-parallel training, spreading the checkpoint load across more GPUs. Each GPU handles a smaller checkpoint shard, reducing per-GPU bandwidth needs even as total checkpoint size increases.

Computer vision tasks with high-resolution imagery need roughly 4 GBps per GPU read performance for datasets exceeding 30 TB. LLM workloads require solid write performance for checkpoint operations—at least half of the read performance as write capability.

Shared filesystem RAM caching provides an order of magnitude faster performance compared to remote storage. DGX B200 local NVMe enables additional data staging for improved performance.

MLPerf Storage v2.0 received 200+ performance results from 26 submitting organisations across seven countries. Storage systems now support roughly twice the number of accelerators compared to MLPerf v1.0 benchmarks. When evaluating these benchmark results, the key metric differentiating storage performance is the maximum number of GPUs a system can support, essentially determined by maximum aggregate bandwidth.

What Storage IOPS Are Required for AI Inference Workloads?

AI inference storage demands high IOPS—tens to hundreds of thousands—with microseconds to low milliseconds response times. This is a different performance profile than training.

Concurrent model serving generates random access patterns across model parameters and embedding tables. Your storage must handle huge numbers of simultaneous small I/O operations through scale-out architectures.

Systems must deliver very low latency access through all-flash storage, NVMe drives, and in-memory or GPU-resident data. Technologies like NVMe-over-Fabrics and RDMA minimise network hops. NVMe-oF achieves 20-30 microsecond latency by avoiding SCSI emulation layers. RDMA achieves 10-20 microsecond latencies through CPU-bypass transfers.

NVMe-oF delivers up to 3-4× the IOPS per CPU core compared to iSCSI. The NVMe protocol supports thousands of queues mapped to CPU cores, handling millions of IOPS per host.

Example: A production inference cluster with 100 concurrent requests needs 100,000+ IOPS at sub-millisecond latency. Double the concurrent requests, double the IOPS requirements.

Inference now dominates AI workloads, representing 80-90% of AI compute usage. Storage latency directly impacts inference efficiency. Inference storage bottlenecks appear as increased request queuing time and degraded p99 latency metrics.

How Do Training and Inference Storage Requirements Differ?

Training optimises for sequential throughput for checkpoint writes and data loading. Inference optimises for random IOPS and latency. These are incompatible storage architectures that require different infrastructure approaches.

The scaling dimensions differ: checkpoint bandwidth needs scale linearly with model size while inference IOPS scale with concurrent request count. Training tolerates seconds of I/O latency while inference demands sub-millisecond response times for production SLAs. What’s fine for training breaks SLAs for inference.

Storage tiering strategy reflects these differences. Training workloads use shared parallel file systems optimised for large sequential writes. Inference deployments use NVMe-backed local storage optimised for random reads. You cannot use the same storage architecture for both workloads effectively.

Cost-performance trade-offs differ across workload types. Training maximises checkpoint bandwidth per dollar spent on storage infrastructure. Inference minimises latency percentiles at the cost of higher per-IOPS infrastructure investment. Optimising for one workload degrades the other.

Understanding these cost implications of high-performance storage helps balance performance requirements with budget constraints when selecting storage tiers.

The I/O patterns tell the story clearly. Training generates large sequential writes—TB-scale checkpoints—requiring high sustained throughput. Inference creates small random reads—KB-MB parameter fetches—requiring high IOPS with minimal latency.

Cloud object storage (S3, GCS) offers geo-redundancy and cost-effective capacity but latency can stretch multi-GB checkpoint operations into minutes. Local disks deliver lower latency but take checkpoints offline if the machine fails. VAST Data aggregates NVMe drives across an entire cluster into a single global namespace providing parallel I/O at tens of gigabytes per second.

For distributed training at scale, workloads typically require object storage because training runs in parallel across hundreds of compute nodes. High-performance StorageClasses backed by SSDs serve training workloads with heavy I/O demands. Data locality matters—remote storage access creates bottlenecks. Caching sidecars in pods improve read/write performance.

Infrastructure rightsizing requires understanding which workload type dominates actual cluster usage patterns. Over-provisioning training storage wastes budget that could buy more GPUs.

What Are MLPerf Storage Benchmarks and What Do They Measure?

MLPerf Storage benchmarks developed by MLCommons provide standardised AI workload performance measurement across vendors. No more relying on marketing claims.

The benchmark suite simulates real AI workload access to storage systems through multiple clients, replicating storage loads in large-scale distributed training clusters. JuiceFS describes how MLPerf Storage simulates realistic access patterns rather than synthetic benchmarks.

Version 2.0 introduced checkpoint benchmarks addressing the reality that at scales of 100,000+ accelerators, system failures occur frequently. At that scale with 50,000-hour mean time to failure per accelerator, a cluster running at full utilisation will likely cop a failure every half-hour.

Key metrics: sustained throughput (GBps), IOPS, latency percentiles (p50/p95/p99), checkpoint operation duration. IBM Storage Scale’s MLPerf results provide a real-world vendor performance baseline—656.7 GiB/s read for 1T models.

Benchmarks enable objective comparison across storage architectures—parallel file systems, cloud storage, NVMe arrays, all-flash systems. Results translate to sizing decisions: benchmark throughput × cluster scale factor = required infrastructure capacity.

MLPerf Storage v2.0 attracted dramatically increased participation. Submissions included 6 local storage solutions, 13 software-defined solutions, 16 on-premises shared storage systems, 12 block systems, 2 in-storage accelerator solutions, and 2 object stores.

Our enterprise vendor MLPerf Storage benchmark comparison provides detailed analysis of JuiceFS bandwidth utilisation results and other vendor performance data to support objective evaluation.

MLPerf documentation requires submitted results to meet GPU utilisation thresholds: 90% for 3D U-Net and ResNet-50, and 70% for CosmoFlow. Storage system selection significantly impacts training efficiency.

Given GPU utilisation thresholds are met, the key metric differentiating performance is maximum number of GPUs the storage system can support—determined by system maximum aggregate bandwidth. Network bandwidth utilisation serves as a reference metric: higher utilisation indicates greater software efficiency.

While MLPerf enables objective comparisons of raw performance capabilities, direct cross-vendor comparisons should account for differences in hardware configurations, node scales, and application scenarios. Use MLPerf results to understand achievable performance levels for your workload type rather than as simple vendor rankings.

How Much Checkpoint Bandwidth Does LLM Training Need?

Global checkpoint bandwidth requirements are modest, typically well below 1 TB/s even for 1T models. This contradicts common assumptions about storage requirements.

VAST Data developed a sizing model relating GPU scale and reliability to required global checkpoint bandwidth. The checkpoint bandwidth formula: checkpoint_bandwidth = checkpoint_size × frequency / (acceptable_overlap × training_time).

In an 800B parameter training run, checkpoint interval was 40 minutes with median checkpoint duration of 3.6 minutes, resulting in roughly 9% checkpoint overlap. Production runs consistently kept median checkpoint overlap under 10% of total training time.

Checkpoint size calculation: model parameters (1T × 2 bytes = 2TB) + optimizer state (13.2TB for Adam) = 15.2TB total. For 1T models, optimizer state constitutes 13.2TB of the 15TB total checkpoint.

Example: 15.2TB checkpoint written every 30 minutes with 10% acceptable overlap = 845 GBps required bandwidth. Asynchronous checkpointing using node-local NVMe reduces shared storage bandwidth requirements by 3-10×.

Model trainers rely on asynchronous checkpointing where frequent checkpoints write quickly to node-local storage then drain to global storage at lower fixed frequency. Whole-node failures are rare—only 5%—so local checkpoints usually survive crashes. Global storage needs only enough bandwidth to absorb periodic drains, not the full write rate implied by GPU throughput.

File-per-shard versus aggregated file strategies impact I/O performance substantially. Research shows file system-aware aggregation and I/O coalescing achieve up to 3.9× higher write throughput compared to file-per-shard approaches. Production systems like DeepSpeed adopt file-per-shard layouts for implementation simplicity, but this creates fragmented I/O. IBM Blue Vela testing demonstrated that consolidating writes achieved roughly 34% throughput improvement.

Checkpoint overlap—not GB/s—is the most relevant performance metric for checkpointing. Lower overlap reduces likelihood of catastrophic failure before checkpoint synchronises to shared storage. Checkpoint overlap percentage directly translates to wasted GPU hours: a cluster running at 20% checkpoint overlap wastes 20% of compute capacity on I/O waits. For a 512-GPU cluster with $2/hour H100 GPUs, this represents $200+/hour wasted on storage I/O waits.

Even 1T models can train efficiently with well under 1 TB/s of checkpoint bandwidth. Overprovisioning I/O bandwidth consumes resources that could otherwise support more GPUs without providing improvement in training time performance.

Why Does Storage Become a Bottleneck for AI Workloads?

GPU compute capacity has scaled exponentially—A100 → H100 → B200—while storage I/O improvements lag behind. Your GPUs got faster but your storage didn’t keep pace.

Storage optimised for databases lacks the characteristics needed for AI workloads. Traditional enterprise storage designed for OLTP cannot sustain sequential TB-scale checkpoint writes.

Checkpoint overlap occurs when storage throughput < checkpoint_size / acceptable_time_window, forcing GPU idle time. Example bottleneck: 100 GBps storage serving a 512-GPU cluster with 15TB checkpoints creates 150-second write duration, which exceeds the 10% overlap target.

Cloud object storage (S3, GCS) offers cost-effective capacity but lacks throughput for training checkpoints. Storage performance that works well for databases often fails to meet AI training requirements.

Data must traverse multiple memory tiers from GPU HBM through host DRAM to local storage with orders of magnitude performance differences. Modern LLMs employ 3D parallelism across thousands of GPUs generating hundreds to thousands of distinct files per checkpoint, creating metadata contention on parallel file systems.

Traditional SAN protocols like iSCSI incur hundreds of microseconds of overhead per I/O operation. Most AI training servers ship with 4-8 NVMe SSDs instead of a single high-capacity device for performance. More parallel writes equal higher aggregate throughput.

What I/O Patterns Do Different AI Model Types Generate?

Large Language Models generate large sequential checkpoint writes—TB-scale—and high read bandwidth for training data. Modern LLMs employ 3D parallelism—tensor, pipeline, data—across thousands of GPUs generating hundreds to thousands of distinct files per checkpoint. Single-file aggregation reduced metadata pressure and demonstrated roughly 34% throughput improvement over fragmented file-per-shard approaches.

Computer Vision models have smaller checkpoints but very high IOPS for image augmentation pipelines—tens of thousands of small file reads. ResNet-50 workloads demand high IOPS for concurrent random I/O. Computer vision tasks with high-resolution imagery need roughly 4 GBps per GPU read performance for datasets exceeding 30 TB.

CosmoFlow workload involves large-scale concurrent small-file access and is highly latency-sensitive with performance bottlenecks on latency stability rather than bandwidth.

Recommendation systems create embedding table I/O with mixed random/sequential patterns. Reinforcement Learning generates frequent small checkpoint writes—experience replay buffers—with burst I/O patterns. Multi-modal models mix LLM-style sequential writes with vision-style random reads.

FAQ

How do I calculate required checkpoint bandwidth for my GPU cluster?

Use the formula: (model_params × bytes_per_param + optimizer_state) × checkpoint_frequency × (1 / acceptable_overlap_percentage). Example: 175B model with Adam optimizer = (175B × 2 bytes + 700GB optimizer) × 2 checkpoints/hour × (1 / 0.10) = 190 GBps minimum bandwidth. Industry best practice targets less than 10% checkpoint overlap for cost-effective GPU utilisation.

Can I use cloud object storage for AI training checkpoints?

Cloud object storage offers geo-redundancy but latency can stretch checkpoint operations into minutes. Typically provides only 1-10 GBps throughput per bucket—inadequate for active training. Use it for archival and backup. Managed Lustre or parallel file systems are required for checkpoint writes. See our guide to cloud storage performance tiers for machine learning for detailed comparison of Azure Container Storage, GKE Managed Lustre, and AWS EBS.

What’s the difference between NVMe and SSD for AI workloads?

NVMe provides 3-7× higher throughput (3-7 GB/s versus 550 MB/s) and 10-100× lower latency compared to SATA SSDs. NVMe-oF achieves 20-30 microsecond latency over fabric. NVMe uses PCIe interface versus SATA bottleneck. Most AI training servers ship with 4-8 NVMe SSDs for performance.

How do I identify if storage is bottlenecking my GPU cluster?

Monitor GPU utilisation during training. Sustained drops below 90% during checkpoint operations indicate storage bottleneck. MLPerf requires 90% GPU utilisation for 3D U-Net and ResNet-50 workloads, 70% for CosmoFlow. Measure checkpoint write duration—should be less than 10% of iteration time. Use nvidia-smi and I/O monitoring to correlate GPU idle time with storage I/O waits. For step-by-step validation procedures, see our implementation guide on benchmarking your storage implementation.

Does inference require the same storage as training?

No. Inference optimises for IOPS—tens of thousands—and sub-millisecond latency versus training’s sequential throughput measured in hundreds of GBps. Inference uses NVMe for low-latency random access. Training uses parallel file systems for checkpoint bandwidth. Completely different storage architectures. Inference now dominates AI workloads representing 80-90% of AI compute usage.

What storage performance does a 64-GPU training cluster need?

NVIDIA DGX SuperPOD reference architecture specifies 40-80 GBps read bandwidth for 64-GPU clusters. Standard tier provides 160 GBps reads and 80 GBps writes for 4 SU configuration. Enhanced tier delivers 500 GBps reads and 250 GBps writes supporting billion+ parameter models. Scale linearly with GPU count and model size.

How does asynchronous checkpointing reduce storage requirements?

Asynchronous checkpointing writes frequent checkpoints quickly to node-local storage then drains to global storage at lower fixed frequency. Whole-node failures are rare—only 5% of failures—so local checkpoints usually survive crashes. Global storage needs only enough bandwidth to absorb periodic drains, not full write rate implied by GPU throughput. Reduces checkpoint overlap from 20-30% (synchronous) to less than 5% (asynchronous), freeing GPU compute. Requires NVMe capacity equal to checkpoint_size per node.

What’s a realistic checkpoint overlap target for LLM training?

Industry best practice: less than 10% checkpoint overlap for cost-effective GPU utilisation. Production runs consistently keep median checkpoint overlap under 10% of total training time. 10% overlap = 10% of GPU-hours spent on I/O waits. Lower overlap reduces likelihood of catastrophic failure before checkpoint synchronises to shared storage. Calculate acceptable overlap: (GPU_cost × cluster_size × overlap%) versus storage infrastructure investment.

Can Kubernetes handle AI training storage performance needs?

Kubernetes persistent volumes can front-end high-performance storage—Lustre, parallel file systems, NVMe arrays—but K8s overhead adds latency. For small clusters (less than 64 GPUs), K8s CSI drivers to managed storage work. Large-scale training often uses bare metal with direct storage access to eliminate orchestration overhead. ML training at scale requires object storage because it runs in parallel across hundreds of compute nodes. High-performance StorageClasses backed by SSDs serve training workloads with heavy I/O demands.

How much does adequate AI storage infrastructure cost compared to GPUs?

Storage cost typically 5-15% of total GPU cluster TCO for properly sized systems. Example: 64 H100 GPUs ($2M) requires roughly $100-300K storage infrastructure. Overprovisioning I/O bandwidth consumes resources that could otherwise support more GPUs without improving training time. Under-provisioning wastes GPU capacity. Benchmark-based sizing optimises cost-performance ratio. For detailed cost analysis frameworks, see our FinOps guide to balancing performance requirements with budget.

What storage metrics should I monitor for AI workloads?

Training: checkpoint write duration (seconds), throughput during checkpoints (GBps), GPU utilisation during I/O (%), checkpoint overlap (%). Refer to MLPerf utilisation thresholds for target performance. Network bandwidth utilisation serves as a reference metric with higher utilisation indicating greater software efficiency. Inference: IOPS during serving (K ops/sec), latency percentiles (p50/p95/p99 in ms), request queue depth. Set alerts for degradation thresholds.

How do I migrate from traditional storage to AI-optimised infrastructure?

Assess current bottlenecks through I/O profiling, calculate required performance using benchmark formulas, evaluate managed services (Cloud Lustre) versus on-premises (NVMe arrays/parallel file systems). Google Cloud Storage hierarchical namespace speeds up checkpoint writes by up to 20× compared to flat buckets and provides up to 8× higher QPS for bursty workloads. Pilot with subset of workloads, measure GPU utilisation improvement, validate ROI before full migration. Plan 3-6 month transition for production clusters. Our step-by-step storage class setup guide provides practical configuration examples for common migration scenarios.

For the complete landscape of Kubernetes storage challenges and solutions, including decision frameworks for choosing between cloud providers, enterprise vendors, and hybrid approaches, see our AI workload storage overview.

AUTHOR

James A. Wondrasek James A. Wondrasek

SHARE ARTICLE

Share
Copy Link

Related Articles

Need a reliable team to help achieve your software goals?

Drop us a line! We'd love to discuss your project.

Offices
Sydney

SYDNEY

55 Pyrmont Bridge Road
Pyrmont, NSW, 2009
Australia

55 Pyrmont Bridge Road, Pyrmont, NSW, 2009, Australia

+61 2-8123-0997

Jakarta

JAKARTA

Plaza Indonesia, 5th Level Unit
E021AB
Jl. M.H. Thamrin Kav. 28-30
Jakarta 10350
Indonesia

Plaza Indonesia, 5th Level Unit E021AB, Jl. M.H. Thamrin Kav. 28-30, Jakarta 10350, Indonesia

+62 858-6514-9577

Bandung

BANDUNG

Jl. Banda No. 30
Bandung 40115
Indonesia

Jl. Banda No. 30, Bandung 40115, Indonesia

+62 858-6514-9577

Yogyakarta

YOGYAKARTA

Unit A & B
Jl. Prof. Herman Yohanes No.1125, Terban, Gondokusuman, Yogyakarta,
Daerah Istimewa Yogyakarta 55223
Indonesia

Unit A & B Jl. Prof. Herman Yohanes No.1125, Yogyakarta, Daerah Istimewa Yogyakarta 55223, Indonesia

+62 274-4539660