Business

SaaS

Technology

•

Jan 8, 2026

Implementing High Performance Storage and Changed Block Tracking in Kubernetes

Q: How do I monitor NVMe performance in production?

Monitor storage latency (p99) using histogram_quantile(0.99, rate(csi_sidecar_operation_duration_seconds_bucket[5m])) and IOPS utilisation using rate(node_disk_io_time_seconds_total{device=~"nvme.*"}[5m]). Alert when P99 latency exceeds 50ms for 5 minutes, volume capacity exceeds 90%, or IOPS utilisation exceeds 80% of VM limit.

How many times are your AI training jobs running slower than they should? And those storage bills keep climbing. Traditional storage systems just weren’t designed for the parallel I/O patterns and massive checkpoint files your AI workloads throw at them.

You’re probably seeing GPU idle time during checkpoint saves. Model loading is slow. Backup windows are eating into production time. It all adds up – compute costs increase, time-to-market gets delayed, infrastructure efficiency suffers. As teams scale their Kubernetes storage for AI workloads, these performance gaps become critical infrastructure bottlenecks.

The solution is actually two technologies working together. Changed Block Tracking (CBT) reduces backup overhead by 80-95%. NVMe storage delivers 10-20x lower latency than traditional SSD. This guide gives you the complete implementation with YAML examples you can use today.

By the end of this guide, you’ll have CBT running on your Kubernetes 1.34+ cluster, NVMe volumes provisioned for AI workloads, an incremental backup strategy configured, performance validated with benchmarks, and troubleshooting procedures ready to go.

How Do I Enable Changed Block Tracking in Kubernetes?

Changed Block Tracking (CBT) is an alpha feature in Kubernetes 1.34 that lets storage systems identify and track modifications at the block level between snapshots. Instead of backing up entire volumes every time, CBT focuses only on changed data blocks. For AI workloads with massive model checkpoints, this makes a real difference.

Before you start, check these prerequisites: Kubernetes version 1.34 or later, a CSI driver that implements the SnapshotMetadata service, kubectl access, and cluster-admin permissions for feature gate changes.

Step 1: Enable the Feature Gate

Add CSIVolumeSnapshotMetadata=true to your kube-apiserver. For managed Kubernetes services, the command varies. On AKS, you’ll modify cluster configuration. On GKE, you’ll update cluster features. Check your provider’s documentation for the exact syntax.

Verify it worked:

kubectl get --raw /metrics | grep feature_enabled

Step 2: Install the CRD

Apply the SnapshotMetadataService custom resource definition:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: snapshotmetadataservices.storage.k8s.io
spec:
  group: storage.k8s.io
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                driver:
                  type: string
                endpoint:
                  type: string
  scope: Cluster
  names:
    plural: snapshotmetadataservices
    singular: snapshotmetadataservice
    kind: SnapshotMetadataService

Step 3: Deploy External Snapshot Metadata Sidecar

Your CSI driver needs a sidecar that implements the SnapshotMetadata service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-snapshot-metadata
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snapshot-metadata-sidecar
  template:
    metadata:
      labels:
        app: snapshot-metadata-sidecar
    spec:
      serviceAccountName: snapshot-metadata-sa
      containers:
      - name: snapshot-metadata-sidecar
        image: k8s.gcr.io/sig-storage/snapshot-metadata-sidecar:v1.0.0
        args:
          - "--csi-address=/csi/csi.sock"
          - "--v=5"
        volumeMounts:
        - name: socket-dir
          mountPath: /csi
      volumes:
      - name: socket-dir
        hostPath:
          path: /var/lib/kubelet/plugins/your-csi-driver
          type: DirectoryOrCreate

Step 4: Create SnapshotMetadataService CR

Tell Kubernetes where to find the service:

apiVersion: storage.k8s.io/v1alpha1
kind: SnapshotMetadataService
metadata:
  name: cbt-service
spec:
  driver: localdisk.csi.acstor.io
  endpoint: unix:///csi/csi.sock

Verification

Check the service is ready:

kubectl get snapshotmetadataservices

You should see cbt-service with a STATUS of “Ready”. Create a test snapshot and verify GetMetadataAllocated RPC is accessible.

Common errors you’ll see: “feature gate not recognised” means your Kubernetes version is below 1.34 – upgrade your cluster. “CRD conflicts” typically means you have existing snapshot CRDs with version incompatibilities.

Which Azure VM Sizes Provide the Best NVMe Performance for AI Workloads?

Choosing the right VM size comes down to three questions. What are your IOPS requirements? What’s your cost per IOPS-hour tolerance? Do you need GPU integration for training workloads?

| VM Size | vCPU | NVMe Config | Max IOPS | Throughput | Use Case | |———|——|————-|———-|————|———-| | Standard_L8s_v3 | 8 | 1x 1.92TB | 400,000 | 2,000 MB/s | Medium model training | | Standard_L16s_v3 | 16 | 2x 1.92TB | 800,000 | 4,000 MB/s | Balanced workloads | | Standard_L80s_v3 | 80 | 10x 1.92TB | 3,800,000 | 20,000 MB/s | Dataset preprocessing | | Standard_NC48ads_A100_v4 | 48 | 2x 894GB | 720,000 | 2,880 MB/s | GPU training | | Standard_ND96isr_H100_v5 | 96 | 8x 3.5TB | 2,000,000 | 16,000 MB/s | Large model training |

The Lsv3 series scales from a single 1.92TB NVMe drive on Standard_L8s_v3 delivering around 400,000 IOPS and 2,000 MB/s, up to 10 NVMe drives on Standard_L80s_v3 delivering about 3.8 million IOPS and 20,000 MB/s.

For AI Training (Large Models), use Standard_NC48ads_A100_v4 or ND96isr_H100_v5. GPU-NVMe locality reduces checkpoint latency. These workloads have frequent large writes, typically 10-100GB checkpoints every 15-30 minutes.

For AI Training (Medium Models), Standard_L16s_v3 gives you balanced IOPS and cost without GPU overhead. These patterns involve moderate writes in the 1-10GB checkpoint range.

For Dataset Preprocessing, Standard_L80s_v3 delivers maximum aggregate IOPS for parallel processing with high read IOPS across many small files.

Capacity Planning: NVMe is ephemeral – data is lost on VM stop or restart. Size for your working dataset plus checkpoints plus 20% overhead. You need a backup strategy for persistence.

How Do I Provision NVMe Volumes with Azure Container Storage?

Azure Container Storage orchestrates local NVMe disks with automatic data striping across available NVMe devices. The CSI driver is localdisk.csi.acstor.io.

Step 1: Verify VM NVMe Availability

lsblk -o NAME,SIZE,TYPE,MOUNTPOINT | grep nvme

You should see nvme0n1, nvme1n1, etc. depending on your VM size.

Step 2: Create StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: acstor-nvme-ai-workloads
provisioner: localdisk.csi.acstor.io
parameters:
  protocol: "nvme"
  headerDigest: "true"
  dataDigest: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
reclaimPolicy: Delete

The WaitForFirstConsumer setting enables topology-aware scheduling. NVMe cannot expand after creation, hence allowVolumeExpansion: false.

Step 3: Create PersistentVolumeClaim

For ephemeral volumes in pods:

apiVersion: v1
kind: Pod
metadata:
  name: ai-training-job
spec:
  containers:
  - name: pytorch-training
    image: pytorch/pytorch:2.0.0-cuda11.8-cudnn8-runtime
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /checkpoints
  volumes:
  - name: checkpoint-storage
    ephemeral:
      volumeClaimTemplate:
        spec:
          accessModes: [ "ReadWriteOnce" ]
          storageClassName: acstor-nvme-ai-workloads
          resources:
            requests:
              storage: 500Gi

For StatefulSets:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: distributed-training
spec:
  serviceName: training
  replicas: 4
  template:
    spec:
      containers:
      - name: worker
        image: horovod/horovod:latest
        volumeMounts:
        - name: data
          mountPath: /datasets
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: acstor-nvme-ai-workloads
      resources:
        requests:
          storage: 800Gi

Step 4: Verify Provisioning

kubectl get pvc  # STATUS should show "Bound"
kubectl get pv   # Note the NODE affinity
kubectl exec -it ai-training-job -- df -h /checkpoints

Key Considerations: Because NVMe is ephemeral, pod restarts retain data, but VM stop or restart loses data. Use VolumeSnapshots to persistent storage before VM operations.

How Do I Configure Incremental Backups Using CBT?

The backup architecture uses a base image approach. Your first snapshot is a full backup. Subsequent snapshots are incremental, using QCOW2 format. CBT tracks changed blocks between snapshots.

Step 1: Create VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-snapshot-class-cbt
driver: localdisk.csi.acstor.io
deletionPolicy: Retain
parameters:
  incrementalBackup: "true"
  snapshotFormat: "qcow2"

Step 2: Take Initial Full Snapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ai-workload-base-snapshot
spec:
  volumeSnapshotClassName: csi-snapshot-class-cbt
  source:
    persistentVolumeClaimName: dataset-cache-pvc

Wait for completion:

kubectl wait --for=jsonpath='{.status.readyToUse}'=true \
  volumesnapshot/ai-workload-base-snapshot --timeout=300s

Step 3: Take Incremental Snapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ai-workload-increment-001
  annotations:
    snapshot.storage.kubernetes.io/base-snapshot: "ai-workload-base-snapshot"
spec:
  volumeSnapshotClassName: csi-snapshot-class-cbt
  source:
    persistentVolumeClaimName: dataset-cache-pvc

Step 4: Schedule Regular Incremental Backups

apiVersion: batch/v1
kind: CronJob
metadata:
  name: incremental-backup-job
spec:
  schedule: "0 */4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: backup-service-account
          containers:
          - name: backup
            image: veeam/kasten-backup:latest
            env:
            - name: BASE_SNAPSHOT
              value: "ai-workload-base-snapshot"
            - name: PVC_NAME
              value: "dataset-cache-pvc"
            command:
            - /bin/bash
            - -c
            - |
              TIMESTAMP=$(date +%Y%m%d-%H%M%S)
              kubectl create -f - <<EOF
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: ai-workload-increment-${TIMESTAMP}
                annotations:
                  snapshot.storage.kubernetes.io/base-snapshot: "${BASE_SNAPSHOT}"
              spec:
                volumeSnapshotClassName: csi-snapshot-class-cbt
                source:
                  persistentVolumeClaimName: ${PVC_NAME}
              EOF
          restartPolicy: OnFailure

Backup Efficiency: A full backup of a 1TB dataset takes 15-20 minutes and transfers 1TB of data. An incremental backup with 5% change takes 2-3 minutes and transfers 50GB. Storage savings run 80-95% for typical AI workload patterns.

How Do I Benchmark Storage to Validate AI Workload Performance?

Benchmarking establishes baselines for troubleshooting, validates NVMe provisioning succeeded, and confirms performance against requirements.

AI/ML workloads have three distinct patterns: large sequential writes (checkpoint saves), random small reads (dataset loading), and mixed read/write (active training). Understanding AI training and inference storage performance requirements helps you set appropriate baselines for your benchmarking.

Deploy fio Pod:

apiVersion: v1
kind: Pod
metadata:
  name: fio-benchmark
spec:
  containers:
  - name: fio
    image: ljishen/fio:latest
    volumeMounts:
    - name: test-volume
      mountPath: /mnt/test
    resources:
      limits:
        memory: "4Gi"
        cpu: "4"
  volumes:
  - name: test-volume
    persistentVolumeClaim:
      claimName: dataset-cache-pvc

Sequential Write Benchmark:

kubectl exec -it fio-benchmark -- fio \
  --name=seq-write-checkpoint \
  --directory=/mnt/test \
  --size=10G \
  --bs=128k \
  --rw=write \
  --ioengine=libaio \
  --iodepth=32 \
  --direct=1 \
  --numjobs=4 \
  --group_reporting \
  --runtime=60 \
  --time_based

Good results show more than 2000 MB/s throughput for L16s_v3. Acceptable is 1000-2000 MB/s. Investigate anything below 1000 MB/s.

Random Read Benchmark:

kubectl exec -it fio-benchmark -- fio \
  --name=rand-read-dataset \
  --directory=/mnt/test \
  --size=50G \
  --bs=4k \
  --rw=randread \
  --ioengine=libaio \
  --iodepth=128 \
  --direct=1 \
  --numjobs=16 \
  --group_reporting \
  --runtime=60 \
  --time_based

Good results show more than 350K IOPS approaching the VM limit. Acceptable is 200-350K IOPS.

Performance Requirements by Workload:

| Workload Type | Min IOPS | Min Throughput | Max Latency | |—————|———-|—————-|————-| | Large model training | 200K | 2000 MB/s | 10ms | | Medium model training | 100K | 1000 MB/s | 20ms | | Dataset preprocessing | 300K | 1500 MB/s | 5ms | | Inference serving | 150K | 800 MB/s | 15ms |

Troubleshooting: If you’re seeing low IOPS, check CPU throttling with kubectl top nodes, competing pods, or incorrect StorageClass parameters. For high latency, check data striping configuration, node network congestion, or VM size. For inconsistent performance, check Azure throttling, background processes, or snapshot operations.

Should I Use NVMe/TCP, RoCE, or iSCSI for Kubernetes Storage?

The decision comes down to latency requirements.

If you need less than 20μs, you need RoCE/RDMA. If 100-200μs is acceptable, NVMe/TCP is optimal. If more than 500μs is acceptable, iSCSI is sufficient.

For guidance on selecting protocols across different cloud providers, see our comparison of cloud provider Kubernetes storage solutions.

NVMe/TCP (Recommended for Most Cloud Workloads)

This works on commodity Ethernet. The latency of 100-200μs is suitable for most AI training. No specialised hardware is required. Cloud providers support it – Azure, GCP, AWS all work.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-tcp-storage
provisioner: localdisk.csi.acstor.io
parameters:
  protocol: "nvme"
  transport: "tcp"
  headerDigest: "true"
  dataDigest: "true"

RoCE/RDMA offers extreme low latency at 10-20μs but requires specialised NICs, lossless Ethernet with PFC and ECN configuration, and complex troubleshooting. Use for high-frequency trading ML models or real-time inference with less than 10ms SLA. When evaluating vendors for these advanced configurations, refer to our enterprise Kubernetes storage vendor ecosystem evaluation framework.

iSCSI has a mature ecosystem and works everywhere, but higher latency than NVMe protocols. Use when you have existing iSCSI SANs or need legacy application compatibility.

How Do I Troubleshoot Pods Stuck in ContainerCreating State?

When your pod remains in ContainerCreating state for more than 2 minutes while using NVMe PVCs:

Step 1: Check PVC Status

kubectl get pvc

If the PVC is pending, volume provisioning failed. Check your StorageClass and CSI driver logs.

Step 2: Check PVC Events

kubectl describe pvc <pvc-name>

Common errors:

Step 3: Check Pod Events

kubectl describe pod <pod-name>

If you see “iscsiadm not found”, install iSCSI tools on the node:

# Ubuntu/Debian
sudo apt-get install -y open-iscsi
sudo systemctl start iscsid
sudo systemctl enable iscsid

If you see “nvme-tcp module not loaded”:

sudo modprobe nvme-tcp
echo nvme-tcp | sudo tee -a /etc/modules-load.d/nvme.conf

Prevention Checklist: Ensure CSI driver is installed on all nodes, verify required kernel modules are loaded (nvme-tcp), check sufficient NVMe capacity exists on nodes, confirm StorageClass has volumeBindingMode: WaitForFirstConsumer.

How Much Does NVMe Storage Cost Compared to Standard SSDs?

Azure Container Storage v2.0.0 is completely free to use for all storage pool sizes. NVMe storage is included in VM pricing with no separate storage cost for local NVMe disks.

For a 1-month training job with a 2TB working dataset and 400K IOPS requirement:

Option A: NVMe (L8s_v3)

VM cost: $561/month
Backup to Azure Blob (2TB): $40/month
Total: $601/month

Option B: Premium SSD

VM cost: $500/month (smaller VM, D8s_v5)
Premium SSD P60 (8TB, 400K IOPS): $1,200/month
Total: $1,700/month

Cost savings: 65% with NVMe.

Cost Optimisation Strategies:

Store immutable datasets on cheaper Azure Blob at $18/TB/month, copy to NVMe at training start. Right-size VM for actual IOPS need – don’t use L80s_v3 with 3.8M IOPS if your workload only needs 400K. Scale down to a smaller VM during off-hours for 66% cost reduction on 8 hours/day training.

Hidden Costs: Data egress for copying large datasets from Azure Blob to NVMe costs around $0.09-$0.12 per GB. Snapshot storage accumulates over time as incremental backups pile up. Dev/test environments shouldn’t use expensive GPU plus NVMe VMs.

ROI Example: Traditional storage with Premium SSD costs $1,700/month times 3 months equals $5,100. NVMe storage with L8s_v3 costs $601/month times 3 months equals $1,803. Savings: $3,297 over a 3-month project.

Frequently Asked Questions

Does Changed Block Tracking work with all CSI drivers?

No, CBT requires specific CSI driver support implementing the SnapshotMetadata service. Supported drivers as of January 2025 include Azure Disk CSI Driver (v1.30+), Google Persistent Disk CSI Driver (v1.13+), and Blockbridge CSI Driver.

Check compatibility:

kubectl get csidrivers -o custom-columns=NAME:.metadata.name,SNAPSHOT-METADATA:.spec.snapshotMetadataSupported

Can I use NVMe storage for persistent data that survives VM restarts?

Yes, but with caveats. Azure Container Storage with ephemeral NVMe disks persists data through pod restarts and node reboots, but data is lost when the VM is stopped or deallocated. Use NVMe for hot data like active training checkpoints, back up to persistent storage with regular VolumeSnapshots to Azure Disk or Blob.

How do I migrate existing PVCs from Premium SSD to NVMe storage?

Create a snapshot of your existing PVC, then create a new NVMe PVC from the snapshot:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: new-nvme-pvc
spec:
  dataSource:
    name: migration-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  storageClassName: acstor-nvme-ai-workloads
  resources:
    requests:
      storage: 1Ti

Update your application to reference the new PVC. Downtime is typically 5-15 minutes for TB-scale volumes.

How do I monitor NVMe performance in production?

Monitor storage latency (p99):

histogram_quantile(0.99,
  rate(csi_sidecar_operation_duration_seconds_bucket[5m]))

Monitor IOPS utilisation:

rate(node_disk_io_time_seconds_total{device=~"nvme.*"}[5m])

Alert when P99 latency exceeds 50ms for 5 minutes, volume capacity exceeds 90%, or IOPS utilisation exceeds 80% of VM limit. Import Kubernetes Storage Dashboard (ID: 11454) into Grafana and add NVMe-specific panels.

Is NVMe storage suitable for database workloads like PostgreSQL or MySQL?

Yes, with considerations. The advantages are ultra-low latency improving transaction throughput, high IOPS supporting concurrent queries, and faster checkpoint writes reducing I/O stalls.

The risks are data durability where ephemeral NVMe loses data on VM stop. You need replication with standby replicas on persistent storage and frequent snapshots to persistent storage.

The recommended architecture has a primary database using NVMe PVC for WAL (write-ahead log) and data files with continuous WAL archiving to Azure Blob. A standby replica uses Azure Premium SSD for persistence and receives streaming replication from the primary. Backups run VolumeSnapshot every 6 hours to Azure Disk.

Example:

volumeClaimTemplates:
- metadata:
    name: pgdata
  spec:
    storageClassName: acstor-nvme-ai-workloads
    resources:
      requests:
        storage: 500Gi

Performance gain: 2-4x transaction throughput versus Premium SSD for write-heavy workloads. Always maintain replicas on persistent storage to mitigate risk.

Can I use CBT with cross-region disaster recovery?

Yes, CBT reduces disaster recovery replication bandwidth significantly. The architecture uses local incremental snapshots with CBT for hourly snapshots in the primary region with minimal overhead. Cross-region replication has the backup tool query GetMetadataDelta, transfer only changed blocks to the secondary region, and reconstruct the full volume in the DR region. For comprehensive DR planning, review our guide on business continuity and disaster recovery strategies for Kubernetes storage.

Bandwidth savings example: a full volume of 1TB with 5% daily change rate traditionally requires 1TB initial plus 50GB per day. With CBT DR, you need 1TB initial plus 2.5GB per day (only changed blocks compressed).

RPO (Recovery Point Objective) of 1-hour with hourly CBT snapshots and RTO (Recovery Time Objective) of 10-30 minutes to restore from the incremental chain. The limitation is both regions must support CBT-enabled CSI drivers.

What are the security implications of Changed Block Tracking?

CBT exposes block allocation maps which potentially reveal data patterns about which blocks contain data. Mitigation uses RBAC controls on SnapshotMetadata service access. CBT works with encrypted volumes as changed block tracking happens below the encryption layer.

RBAC configuration:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: snapshot-metadata-reader
rules:
- apiGroups: ["snapshot.storage.k8s.io"]
  resources: ["snapshotmetadataservices"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["volumesnapshots"]
  verbs: ["get", "list", "create"]

Ensure backup applications use service accounts with minimal permissions. Best practices include enabling audit logging for snapshot operations, encrypting snapshots at rest in backup storage, using network policies to restrict CSI sidecar access, and regular security reviews of RBAC policies.

Wrapping Up

You now have a high-performance storage and backup solution for your Kubernetes AI workloads.

You’ve enabled Changed Block Tracking for 80-95% backup efficiency improvements. You’ve provisioned NVMe volumes delivering 10-20x lower latency than traditional storage. You’ve configured an incremental backup strategy reducing backup windows from hours to minutes. You’ve established performance baselines with fio benchmarking. For broader context on addressing Kubernetes storage infrastructure limits for AI workloads, explore our complete series.

Next steps: monitor in production by setting up Prometheus alerts for storage performance. Optimise costs by reviewing VM sizing and implementing an ephemeral/persistent hybrid strategy. Test DR procedures by validating cross-region backup restore at least quarterly.

Storage performance is no longer your AI workload bottleneck.

Implementing High Performance Storage and Changed Block Tracking in Kubernetes

How Do I Enable Changed Block Tracking in Kubernetes?

Which Azure VM Sizes Provide the Best NVMe Performance for AI Workloads?

How Do I Provision NVMe Volumes with Azure Container Storage?

How Do I Configure Incremental Backups Using CBT?

How Do I Benchmark Storage to Validate AI Workload Performance?

Should I Use NVMe/TCP, RoCE, or iSCSI for Kubernetes Storage?

How Do I Troubleshoot Pods Stuck in ContainerCreating State?

How Much Does NVMe Storage Cost Compared to Standard SSDs?

Frequently Asked Questions

Wrapping Up

Related Articles

BMAD Method – Turning Vibe Coding Into Software Engineering

Which of the top 5 AI coding assistants is right for you?

How to recruit software developers in days instead of months

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG