How many times are your AI training jobs running slower than they should? And those storage bills keep climbing. Traditional storage systems just weren’t designed for the parallel I/O patterns and massive checkpoint files your AI workloads throw at them.
You’re probably seeing GPU idle time during checkpoint saves. Model loading is slow. Backup windows are eating into production time. It all adds up – compute costs increase, time-to-market gets delayed, infrastructure efficiency suffers. As teams scale their Kubernetes storage for AI workloads, these performance gaps become critical infrastructure bottlenecks.
The solution is actually two technologies working together. Changed Block Tracking (CBT) reduces backup overhead by 80-95%. NVMe storage delivers 10-20x lower latency than traditional SSD. This guide gives you the complete implementation with YAML examples you can use today.
By the end of this guide, you’ll have CBT running on your Kubernetes 1.34+ cluster, NVMe volumes provisioned for AI workloads, an incremental backup strategy configured, performance validated with benchmarks, and troubleshooting procedures ready to go.
How Do I Enable Changed Block Tracking in Kubernetes?
Changed Block Tracking (CBT) is an alpha feature in Kubernetes 1.34 that lets storage systems identify and track modifications at the block level between snapshots. Instead of backing up entire volumes every time, CBT focuses only on changed data blocks. For AI workloads with massive model checkpoints, this makes a real difference.
Before you start, check these prerequisites: Kubernetes version 1.34 or later, a CSI driver that implements the SnapshotMetadata service, kubectl access, and cluster-admin permissions for feature gate changes.
Step 1: Enable the Feature Gate
Add CSIVolumeSnapshotMetadata=true to your kube-apiserver. For managed Kubernetes services, the command varies. On AKS, you’ll modify cluster configuration. On GKE, you’ll update cluster features. Check your provider’s documentation for the exact syntax.
Verify it worked:
kubectl get --raw /metrics | grep feature_enabled
Step 2: Install the CRD
Apply the SnapshotMetadataService custom resource definition:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: snapshotmetadataservices.storage.k8s.io
spec:
group: storage.k8s.io
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
driver:
type: string
endpoint:
type: string
scope: Cluster
names:
plural: snapshotmetadataservices
singular: snapshotmetadataservice
kind: SnapshotMetadataService
Step 3: Deploy External Snapshot Metadata Sidecar
Your CSI driver needs a sidecar that implements the SnapshotMetadata service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-snapshot-metadata
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: snapshot-metadata-sidecar
template:
metadata:
labels:
app: snapshot-metadata-sidecar
spec:
serviceAccountName: snapshot-metadata-sa
containers:
- name: snapshot-metadata-sidecar
image: k8s.gcr.io/sig-storage/snapshot-metadata-sidecar:v1.0.0
args:
- "--csi-address=/csi/csi.sock"
- "--v=5"
volumeMounts:
- name: socket-dir
mountPath: /csi
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/your-csi-driver
type: DirectoryOrCreate
Step 4: Create SnapshotMetadataService CR
Tell Kubernetes where to find the service:
apiVersion: storage.k8s.io/v1alpha1
kind: SnapshotMetadataService
metadata:
name: cbt-service
spec:
driver: localdisk.csi.acstor.io
endpoint: unix:///csi/csi.sock
Verification
Check the service is ready:
kubectl get snapshotmetadataservices
You should see cbt-service with a STATUS of “Ready”. Create a test snapshot and verify GetMetadataAllocated RPC is accessible.
Common errors you’ll see: “feature gate not recognised” means your Kubernetes version is below 1.34 – upgrade your cluster. “CRD conflicts” typically means you have existing snapshot CRDs with version incompatibilities.
Which Azure VM Sizes Provide the Best NVMe Performance for AI Workloads?
Choosing the right VM size comes down to three questions. What are your IOPS requirements? What’s your cost per IOPS-hour tolerance? Do you need GPU integration for training workloads?
| VM Size | vCPU | NVMe Config | Max IOPS | Throughput | Use Case | |———|——|————-|———-|————|———-| | Standard_L8s_v3 | 8 | 1x 1.92TB | 400,000 | 2,000 MB/s | Medium model training | | Standard_L16s_v3 | 16 | 2x 1.92TB | 800,000 | 4,000 MB/s | Balanced workloads | | Standard_L80s_v3 | 80 | 10x 1.92TB | 3,800,000 | 20,000 MB/s | Dataset preprocessing | | Standard_NC48ads_A100_v4 | 48 | 2x 894GB | 720,000 | 2,880 MB/s | GPU training | | Standard_ND96isr_H100_v5 | 96 | 8x 3.5TB | 2,000,000 | 16,000 MB/s | Large model training |
The Lsv3 series scales from a single 1.92TB NVMe drive on Standard_L8s_v3 delivering around 400,000 IOPS and 2,000 MB/s, up to 10 NVMe drives on Standard_L80s_v3 delivering about 3.8 million IOPS and 20,000 MB/s.
For AI Training (Large Models), use Standard_NC48ads_A100_v4 or ND96isr_H100_v5. GPU-NVMe locality reduces checkpoint latency. These workloads have frequent large writes, typically 10-100GB checkpoints every 15-30 minutes.
For AI Training (Medium Models), Standard_L16s_v3 gives you balanced IOPS and cost without GPU overhead. These patterns involve moderate writes in the 1-10GB checkpoint range.
For Dataset Preprocessing, Standard_L80s_v3 delivers maximum aggregate IOPS for parallel processing with high read IOPS across many small files.
Capacity Planning: NVMe is ephemeral – data is lost on VM stop or restart. Size for your working dataset plus checkpoints plus 20% overhead. You need a backup strategy for persistence.
How Do I Provision NVMe Volumes with Azure Container Storage?
Azure Container Storage orchestrates local NVMe disks with automatic data striping across available NVMe devices. The CSI driver is localdisk.csi.acstor.io.
Step 1: Verify VM NVMe Availability
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT | grep nvme
You should see nvme0n1, nvme1n1, etc. depending on your VM size.
Step 2: Create StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: acstor-nvme-ai-workloads
provisioner: localdisk.csi.acstor.io
parameters:
protocol: "nvme"
headerDigest: "true"
dataDigest: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
reclaimPolicy: Delete
The WaitForFirstConsumer setting enables topology-aware scheduling. NVMe cannot expand after creation, hence allowVolumeExpansion: false.
Step 3: Create PersistentVolumeClaim
For ephemeral volumes in pods:
apiVersion: v1
kind: Pod
metadata:
name: ai-training-job
spec:
containers:
- name: pytorch-training
image: pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
volumes:
- name: checkpoint-storage
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: acstor-nvme-ai-workloads
resources:
requests:
storage: 500Gi
For StatefulSets:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: distributed-training
spec:
serviceName: training
replicas: 4
template:
spec:
containers:
- name: worker
image: horovod/horovod:latest
volumeMounts:
- name: data
mountPath: /datasets
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: acstor-nvme-ai-workloads
resources:
requests:
storage: 800Gi
Step 4: Verify Provisioning
kubectl get pvc # STATUS should show "Bound"
kubectl get pv # Note the NODE affinity
kubectl exec -it ai-training-job -- df -h /checkpoints
Key Considerations: Because NVMe is ephemeral, pod restarts retain data, but VM stop or restart loses data. Use VolumeSnapshots to persistent storage before VM operations.
How Do I Configure Incremental Backups Using CBT?
The backup architecture uses a base image approach. Your first snapshot is a full backup. Subsequent snapshots are incremental, using QCOW2 format. CBT tracks changed blocks between snapshots.
Step 1: Create VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapshot-class-cbt
driver: localdisk.csi.acstor.io
deletionPolicy: Retain
parameters:
incrementalBackup: "true"
snapshotFormat: "qcow2"
Step 2: Take Initial Full Snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ai-workload-base-snapshot
spec:
volumeSnapshotClassName: csi-snapshot-class-cbt
source:
persistentVolumeClaimName: dataset-cache-pvc
Wait for completion:
kubectl wait --for=jsonpath='{.status.readyToUse}'=true \
volumesnapshot/ai-workload-base-snapshot --timeout=300s
Step 3: Take Incremental Snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ai-workload-increment-001
annotations:
snapshot.storage.kubernetes.io/base-snapshot: "ai-workload-base-snapshot"
spec:
volumeSnapshotClassName: csi-snapshot-class-cbt
source:
persistentVolumeClaimName: dataset-cache-pvc
Step 4: Schedule Regular Incremental Backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: incremental-backup-job
spec:
schedule: "0 */4 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-service-account
containers:
- name: backup
image: veeam/kasten-backup:latest
env:
- name: BASE_SNAPSHOT
value: "ai-workload-base-snapshot"
- name: PVC_NAME
value: "dataset-cache-pvc"
command:
- /bin/bash
- -c
- |
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
kubectl create -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ai-workload-increment-${TIMESTAMP}
annotations:
snapshot.storage.kubernetes.io/base-snapshot: "${BASE_SNAPSHOT}"
spec:
volumeSnapshotClassName: csi-snapshot-class-cbt
source:
persistentVolumeClaimName: ${PVC_NAME}
EOF
restartPolicy: OnFailure
Backup Efficiency: A full backup of a 1TB dataset takes 15-20 minutes and transfers 1TB of data. An incremental backup with 5% change takes 2-3 minutes and transfers 50GB. Storage savings run 80-95% for typical AI workload patterns.
How Do I Benchmark Storage to Validate AI Workload Performance?
Benchmarking establishes baselines for troubleshooting, validates NVMe provisioning succeeded, and confirms performance against requirements.
AI/ML workloads have three distinct patterns: large sequential writes (checkpoint saves), random small reads (dataset loading), and mixed read/write (active training). Understanding AI training and inference storage performance requirements helps you set appropriate baselines for your benchmarking.
Deploy fio Pod:
apiVersion: v1
kind: Pod
metadata:
name: fio-benchmark
spec:
containers:
- name: fio
image: ljishen/fio:latest
volumeMounts:
- name: test-volume
mountPath: /mnt/test
resources:
limits:
memory: "4Gi"
cpu: "4"
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: dataset-cache-pvc
Sequential Write Benchmark:
kubectl exec -it fio-benchmark -- fio \
--name=seq-write-checkpoint \
--directory=/mnt/test \
--size=10G \
--bs=128k \
--rw=write \
--ioengine=libaio \
--iodepth=32 \
--direct=1 \
--numjobs=4 \
--group_reporting \
--runtime=60 \
--time_based
Good results show more than 2000 MB/s throughput for L16s_v3. Acceptable is 1000-2000 MB/s. Investigate anything below 1000 MB/s.
Random Read Benchmark:
kubectl exec -it fio-benchmark -- fio \
--name=rand-read-dataset \
--directory=/mnt/test \
--size=50G \
--bs=4k \
--rw=randread \
--ioengine=libaio \
--iodepth=128 \
--direct=1 \
--numjobs=16 \
--group_reporting \
--runtime=60 \
--time_based
Good results show more than 350K IOPS approaching the VM limit. Acceptable is 200-350K IOPS.
Performance Requirements by Workload:
| Workload Type | Min IOPS | Min Throughput | Max Latency | |—————|———-|—————-|————-| | Large model training | 200K | 2000 MB/s | 10ms | | Medium model training | 100K | 1000 MB/s | 20ms | | Dataset preprocessing | 300K | 1500 MB/s | 5ms | | Inference serving | 150K | 800 MB/s | 15ms |
Troubleshooting: If you’re seeing low IOPS, check CPU throttling with kubectl top nodes, competing pods, or incorrect StorageClass parameters. For high latency, check data striping configuration, node network congestion, or VM size. For inconsistent performance, check Azure throttling, background processes, or snapshot operations.
Should I Use NVMe/TCP, RoCE, or iSCSI for Kubernetes Storage?
The decision comes down to latency requirements.
If you need less than 20μs, you need RoCE/RDMA. If 100-200μs is acceptable, NVMe/TCP is optimal. If more than 500μs is acceptable, iSCSI is sufficient.
| Protocol | Latency | Throughput | Complexity | Cost | Best For | |———-|———|————|————|——|———-| | RoCE (RDMA) | 10-20μs | Very High | High | High | Ultra-low latency AI | | FC-NVMe | 50-100μs | High | Medium | High | SAN modernisation | | NVMe/TCP | 100-200μs | High | Low | Low | Cloud AI/ML workloads | | iSCSI | 100-500μs | Medium | Low | Low | Legacy compatibility |
For guidance on selecting protocols across different cloud providers, see our comparison of cloud provider Kubernetes storage solutions.
NVMe/TCP (Recommended for Most Cloud Workloads)
This works on commodity Ethernet. The latency of 100-200μs is suitable for most AI training. No specialised hardware is required. Cloud providers support it – Azure, GCP, AWS all work.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nvme-tcp-storage
provisioner: localdisk.csi.acstor.io
parameters:
protocol: "nvme"
transport: "tcp"
headerDigest: "true"
dataDigest: "true"
RoCE/RDMA offers extreme low latency at 10-20μs but requires specialised NICs, lossless Ethernet with PFC and ECN configuration, and complex troubleshooting. Use for high-frequency trading ML models or real-time inference with less than 10ms SLA. When evaluating vendors for these advanced configurations, refer to our enterprise Kubernetes storage vendor ecosystem evaluation framework.
iSCSI has a mature ecosystem and works everywhere, but higher latency than NVMe protocols. Use when you have existing iSCSI SANs or need legacy application compatibility.
How Do I Troubleshoot Pods Stuck in ContainerCreating State?
When your pod remains in ContainerCreating state for more than 2 minutes while using NVMe PVCs:
Step 1: Check PVC Status
kubectl get pvc
If the PVC is pending, volume provisioning failed. Check your StorageClass and CSI driver logs.
Step 2: Check PVC Events
kubectl describe pvc <pvc-name>
Common errors:
| Error | Root Cause | Solution | |——-|————|———-| | “no nodes available” | WaitForFirstConsumer but no node selected | Check pod node selector/affinity | | “capacity exceeded” | Insufficient NVMe capacity | Use VM with larger/more NVMe disks | | “CSI driver not found” | CSI driver not installed on node | Install CSI driver daemonset |
Step 3: Check Pod Events
kubectl describe pod <pod-name>
If you see “iscsiadm not found”, install iSCSI tools on the node:
# Ubuntu/Debian
sudo apt-get install -y open-iscsi
sudo systemctl start iscsid
sudo systemctl enable iscsid
If you see “nvme-tcp module not loaded”:
sudo modprobe nvme-tcp
echo nvme-tcp | sudo tee -a /etc/modules-load.d/nvme.conf
Prevention Checklist: Ensure CSI driver is installed on all nodes, verify required kernel modules are loaded (nvme-tcp), check sufficient NVMe capacity exists on nodes, confirm StorageClass has volumeBindingMode: WaitForFirstConsumer.
How Much Does NVMe Storage Cost Compared to Standard SSDs?
Azure Container Storage v2.0.0 is completely free to use for all storage pool sizes. NVMe storage is included in VM pricing with no separate storage cost for local NVMe disks.
For a 1-month training job with a 2TB working dataset and 400K IOPS requirement:
Option A: NVMe (L8s_v3)
- VM cost: $561/month
- Backup to Azure Blob (2TB): $40/month
- Total: $601/month
Option B: Premium SSD
- VM cost: $500/month (smaller VM, D8s_v5)
- Premium SSD P60 (8TB, 400K IOPS): $1,200/month
- Total: $1,700/month
Cost savings: 65% with NVMe.
Cost Optimisation Strategies:
Store immutable datasets on cheaper Azure Blob at $18/TB/month, copy to NVMe at training start. Right-size VM for actual IOPS need – don’t use L80s_v3 with 3.8M IOPS if your workload only needs 400K. Scale down to a smaller VM during off-hours for 66% cost reduction on 8 hours/day training.
Hidden Costs: Data egress for copying large datasets from Azure Blob to NVMe costs around $0.09-$0.12 per GB. Snapshot storage accumulates over time as incremental backups pile up. Dev/test environments shouldn’t use expensive GPU plus NVMe VMs.
ROI Example: Traditional storage with Premium SSD costs $1,700/month times 3 months equals $5,100. NVMe storage with L8s_v3 costs $601/month times 3 months equals $1,803. Savings: $3,297 over a 3-month project.
Frequently Asked Questions
Does Changed Block Tracking work with all CSI drivers?
No, CBT requires specific CSI driver support implementing the SnapshotMetadata service. Supported drivers as of January 2025 include Azure Disk CSI Driver (v1.30+), Google Persistent Disk CSI Driver (v1.13+), and Blockbridge CSI Driver.
Check compatibility:
kubectl get csidrivers -o custom-columns=NAME:.metadata.name,SNAPSHOT-METADATA:.spec.snapshotMetadataSupported
Can I use NVMe storage for persistent data that survives VM restarts?
Yes, but with caveats. Azure Container Storage with ephemeral NVMe disks persists data through pod restarts and node reboots, but data is lost when the VM is stopped or deallocated. Use NVMe for hot data like active training checkpoints, back up to persistent storage with regular VolumeSnapshots to Azure Disk or Blob.
How do I migrate existing PVCs from Premium SSD to NVMe storage?
Create a snapshot of your existing PVC, then create a new NVMe PVC from the snapshot:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: new-nvme-pvc
spec:
dataSource:
name: migration-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
storageClassName: acstor-nvme-ai-workloads
resources:
requests:
storage: 1Ti
Update your application to reference the new PVC. Downtime is typically 5-15 minutes for TB-scale volumes.
How do I monitor NVMe performance in production?
Monitor storage latency (p99):
histogram_quantile(0.99,
rate(csi_sidecar_operation_duration_seconds_bucket[5m]))
Monitor IOPS utilisation:
rate(node_disk_io_time_seconds_total{device=~"nvme.*"}[5m])
Alert when P99 latency exceeds 50ms for 5 minutes, volume capacity exceeds 90%, or IOPS utilisation exceeds 80% of VM limit. Import Kubernetes Storage Dashboard (ID: 11454) into Grafana and add NVMe-specific panels.
Is NVMe storage suitable for database workloads like PostgreSQL or MySQL?
Yes, with considerations. The advantages are ultra-low latency improving transaction throughput, high IOPS supporting concurrent queries, and faster checkpoint writes reducing I/O stalls.
The risks are data durability where ephemeral NVMe loses data on VM stop. You need replication with standby replicas on persistent storage and frequent snapshots to persistent storage.
The recommended architecture has a primary database using NVMe PVC for WAL (write-ahead log) and data files with continuous WAL archiving to Azure Blob. A standby replica uses Azure Premium SSD for persistence and receives streaming replication from the primary. Backups run VolumeSnapshot every 6 hours to Azure Disk.
Example:
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
storageClassName: acstor-nvme-ai-workloads
resources:
requests:
storage: 500Gi
Performance gain: 2-4x transaction throughput versus Premium SSD for write-heavy workloads. Always maintain replicas on persistent storage to mitigate risk.
Can I use CBT with cross-region disaster recovery?
Yes, CBT reduces disaster recovery replication bandwidth significantly. The architecture uses local incremental snapshots with CBT for hourly snapshots in the primary region with minimal overhead. Cross-region replication has the backup tool query GetMetadataDelta, transfer only changed blocks to the secondary region, and reconstruct the full volume in the DR region. For comprehensive DR planning, review our guide on business continuity and disaster recovery strategies for Kubernetes storage.
Bandwidth savings example: a full volume of 1TB with 5% daily change rate traditionally requires 1TB initial plus 50GB per day. With CBT DR, you need 1TB initial plus 2.5GB per day (only changed blocks compressed).
RPO (Recovery Point Objective) of 1-hour with hourly CBT snapshots and RTO (Recovery Time Objective) of 10-30 minutes to restore from the incremental chain. The limitation is both regions must support CBT-enabled CSI drivers.
What are the security implications of Changed Block Tracking?
CBT exposes block allocation maps which potentially reveal data patterns about which blocks contain data. Mitigation uses RBAC controls on SnapshotMetadata service access. CBT works with encrypted volumes as changed block tracking happens below the encryption layer.
RBAC configuration:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: snapshot-metadata-reader
rules:
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["snapshotmetadataservices"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["volumesnapshots"]
verbs: ["get", "list", "create"]
Ensure backup applications use service accounts with minimal permissions. Best practices include enabling audit logging for snapshot operations, encrypting snapshots at rest in backup storage, using network policies to restrict CSI sidecar access, and regular security reviews of RBAC policies.
Wrapping Up
You now have a high-performance storage and backup solution for your Kubernetes AI workloads.
You’ve enabled Changed Block Tracking for 80-95% backup efficiency improvements. You’ve provisioned NVMe volumes delivering 10-20x lower latency than traditional storage. You’ve configured an incremental backup strategy reducing backup windows from hours to minutes. You’ve established performance baselines with fio benchmarking. For broader context on addressing Kubernetes storage infrastructure limits for AI workloads, explore our complete series.
Next steps: monitor in production by setting up Prometheus alerts for storage performance. Optimise costs by reviewing VM sizing and implementing an ephemeral/persistent hybrid strategy. Test DR procedures by validating cross-region backup restore at least quarterly.
Storage performance is no longer your AI workload bottleneck.