Author: vmtechie

  • VCF 9.1 Makes VKS Harder to Ignore

    VKS VKS VCF 9.1

    VKS on VCF 9.1 What Actually Changed & Why It Matters

    A Comic Book Story in Seven Chapters

    Issue #01 · May 2026 · The VCF 9.1 Saga

    ⚡ Cast of Characters ⚡

    Captain VKS
    vSphere Kubernetes Service 3.6
    The hero. Born from vSphere, forged in CNCF conformance. Now powered up with VCF 9.1 abilities.
    The Architect
    Platform Engineer
    Our protagonist. Runs multi-domain VCF estates. Needs Kubernetes at enterprise scale without the circus.
    Cluster Creep
    The Villain of Sprawl
    Feeds on operational toil, slow provisioning, and fragmented toolchains. Grows stronger with every manual step.
    FINOPS
    The Oracle
    VCF Operations 9.1
    Sees all. Knows cost. Tracks every namespace. Speaks in metrics and FinOps.
    🥊 The Challengers 🥊
    $
    The Cloud Twins
    The Hyperscaler Duo
    They move fast and always whisper: “Just move to our cloud.” They charge per hour and never let go.
    The Red Baron
    The Opinionated Platform
    Arrives in full armor. Brings his own runtime, registry, mesh, and opinions about everything. Enterprise prices included.
    The Wrangler
    The Multi-Cluster Cowboy
    Rides across any ranch — any cloud, any edge, any distro. Freedom is his creed. But who’s managing the cattle?
    $ kubeadm
    Bare Knuckle
    The DIY Brawler
    No platform. No hand-holding. Bare metal, kubeadm, and grit. Cheap up front. Costs you in blood and 3 AM pages.
    Chapter 01The 37-Minute Nightmare
    37 MIN !! ?!
    The data center. 6:42 AM. The Architect stares at a provisioning timer that refuses to move. Cluster Creep watches from the shadows, feeding on frustration.
    The Architect 37 minutes to spin up a dev cluster. Thirty. Seven. Minutes. The hyperscaler team next door gets theirs in ten. The CTO is asking questions.
    Cluster Creep Yesss… and that’s just the deployment. Wait until you see the upgrade windows. I’ve got 45 minutes of downtime planned for each cluster. You have 200 clusters. Do the math. 😈
    That’s 150 hours of maintenance windows per upgrade cycle… across the fleet…
    VCF 9.1 DROPS MAY 5, 2026
    May 5, 2026. Broadcom releases VCF 9.1. And everything changes.
    Captain VKS Miss me? I brought Fast Deploy. Let me show you the new numbers.
    MetricVCF 9.0VCF 9.1
    Cluster Deploy Time37 min11 min (↓69%)
    Cluster Upgrade Time45 min15 min (↓67%)
    Max Clusters / Supervisor~100500
    Node Pool PlacementManualDRS Intelligent
    Chapter 02The Challengers Step Forward
    V S
    Word of VCF 9.1 spreads. Four challengers emerge from the fog, each claiming the throne of enterprise Kubernetes. The Architect has heard their pitches before.
    The Cloud Twins Adorable upgrade, Captain. But we’ve been doing sub-10-minute clusters for years. Managed control plane. Global regions. Auto-scaling node groups. Why fight gravity? Just come to the cloud.
    Captain VKS Sure — and your managed control plane costs how much per cluster per month? Multiply that by 500 clusters. Now add the egress fees. Now add the data sovereignty audit your CISO just mandated. I run on hardware you already own.
    The Red Baron How charming. You finally got CNI choice? I ship with my own SDN, my own service mesh, my own registry, my own CI/CD pipelines, and a full developer portal. I am the platform. You’re still assembling one.
    Captain VKS You are the platform. That’s the problem. Your opinions become my constraints. Your lifecycle becomes my upgrade treadmill. Your per-core subscription becomes my CFO’s nightmare. I give choice. You give mandates.
    The Wrangler Y’all are so cute with your single-vendor stacks. I run on any infrastructure. True multi-cluster freedom. No lock-in. Ever.
    Captain VKS Freedom is great until your team is maintaining six different infrastructure backends. I give you 500 clusters on one Supervisor with one operational model. You give them options and a prayer.
    Bare Knuckle I don’t need a platform. kubeadm, a Makefile, and raw skill. Zero licensing. Zero overhead. Pure Kubernetes.
    Captain VKS I respect the craft. But who patches your nodes at 2 AM? Who handles etcd backups? Who runs certificate rotation? Your “zero cost” platform costs three full-time engineers.
    The Architect I’ve evaluated all of you. Here’s my problem: I already run VCF. My VMs, NSX networking, vSAN storage, and security policies are all here. I need Kubernetes that joins my platform — not one that replaces it or ignores it.
    The best Kubernetes platform is the one that doesn’t make me build a second operations team…
    Chapter 03Fast Deploy — 11 Minutes or Bust
    Captain VKS explains what changed under the hood. Fast Deploy isn’t a marketing stunt — it’s an architectural rework of the provisioning pipeline.
    Captain VKS Here’s what actually happened. We parallelized the node bootstrapping sequence, pre-staged container images into a local content library, and eliminated redundant API round-trips during cluster init. 11 minutes, from API call to workload-ready.
    The Architect What about upgrades? That’s where we bleed. Every cluster upgrade is a maintenance window, and my team juggles 200+ clusters.
    Captain VKS 45 minutes down to 15. Pre-staged images, parallel node drain-and-replace, and Multiple Clusters per Zone means you keep workloads running on Zone A while upgrading Zone B.
    ⚡ IMPACT METER ⚡
    Provisioning Speed Gain
    69%
    Upgrade Speed Gain
    67%
    Scale Ceiling Increase
    5× (500 clusters)
    The Cloud Twins 11 minutes… fine, that’s competitive. But can you match our global availability zones?
    Captain VKS I don’t need 60 regions. My Architect’s data stays in his sovereign data center, on his hardware, under his compliance umbrella. Your 60 regions are 60 places his CISO has to audit.
    Chapter 04DRS Strikes Back — Intelligent Node Pool Placement
    GPU HOST AI ML NVMe HOST DB CACHE COMPUTE HOST WEB API DRS SCHEDULER
    VCF 9.1 introduces Intelligent Node Pool Placement. This isn’t basic affinity rules — it’s DRS-level scheduling applied to Kubernetes node pools.
    Captain VKS GPU pods → GPU hosts. NVMe workloads → NVMe nodes. DRS algorithm decides placement — not your YAML-wrestling platform team.
    The Red Baron I have Topology Manager, NUMA-aware scheduling, and a full operator ecosystem. Infrastructure-aware placement is table stakes for me.
    Captain VKS You schedule within the cluster. I schedule the cluster itself. DRS sees the whole estate. Your scheduler sees one namespace.
    The Oracle With VKS Cost Showback in VCF Operations 9.1, I can tell you exactly what each namespace, each cluster, each team is costing you. FinOps FOCUS-compliant.
    The Oracle I also expose an API for your RAG pipelines and MCP frameworks — your AIOps engine can query cost data directly.
    Per-NS
    Cost Attribution
    FOCUS
    FinOps Compliant
    Real-Time
    Pricing Estimates
    Show + Charge
    Back Capability
    Chapter 05Container-as-a-Service & The CNI Revolution
    CNI-A CNI-B CNI-C VKS
    VCF 9.1 introduces a simplified Container Service — deploy containers without deep Kubernetes expertise. Meanwhile, VKS 3.6 opens up CNI choice for the first time.
    Captain VKS First: Container-as-a-Service. Your app teams get a self-service surface. Click, deploy, done. No Supervisor clusters or ClusterClass YAML.
    Captain VKS Second: CNI freedom. VKS 3.6 deprecated ClusterBootstrap. Pick your CNI through the Addon Framework using AddonConfig CRDs. Antrea default, but the door is open.
    The Wrangler Oh, you’re just now letting people choose their CNI? Welcome to 2022, Captain.
    Captain VKS You let them choose. I let them choose with validated blueprints, lifecycle support, and a single vendor to call at 3 AM. Choice without support is just risk with extra steps.
    The Architect And the Ingress story? The popular open-source Ingress controller is being retired…
    Captain VKS Avi Load Balancer — natively integrated. Centralized control plane, distributed data plane, full observability. Plus vDefend gives you zero-trust lateral security for every pod.
    Chapter 06The Arena — Where Platforms Are Measured
    🛡️ ☁️ 🎩 🤠 🥊
    The Architect pulls up the scoreboard. No hype. No marketing. Just the dimensions that matter when you’re running Kubernetes in a regulated enterprise with 500+ VMs already on VCF.
    ⚔️ HEAD TO HEAD ⚔️
    Dimension 🛡️ Captain VKS ☁️ Cloud Twins 🎩 Red Baron 🤠 Wrangler 🥊 Bare Knuckle
    Data SovereigntyYour DCTheir DCYour DCDependsYour DC
    VM + K8s Unified OpsNativeSeparateSeparateSeparateSeparate
    Infra-Aware SchedulingDRS-LevelNode GroupsTopology MgrManualDIY
    Cluster Scale Ceiling500 / SupervisorUnlimited*Per InfraPer InfraPer Team
    Integrated FinOpsFOCUS NativeCost Explorer3rd Party3rd PartySpreadsheet
    Network SecurityvDefend + NSXVPC / SGBuilt-in SDNBYOBYO
    Licensing ModelPer-Core VCFPer-Cluster/HrPer-Core SubOpen SourceFree
    Day 2 ToilLowLowMediumMediumHigh
    AI / GPU ConformanceCNCF AI CertGPU PoolsOperatorsBYOBYO
    The Cloud TwinsWe still win on global reach and elastic scale.
    The Red BaronAnd I still own the developer experience story. Integrated CI/CD, GitOps, developer portal — out of the box.
    Captain VKS Fair. I’m not claiming I win everywhere. But for organizations already running VCF — I’m the only Kubernetes that doesn’t create a second operational island. VMs and containers. One platform. One team. One pane.
    The Architect That’s the point everyone misses. I don’t need the “best” Kubernetes in a vacuum. I need the best Kubernetes for my stack. And my stack is VCF.
    Chapter 07The Numbers Don’t Lie
    💥 THE FINAL SHOWDOWN 💥
    Broadcom surveyed 44 VCF 9 customers in March 2026. Here’s what they found — and why the challengers are looking over their shoulders.
    51%
    Less Infra Mgmt Time
    46%
    Less Monitoring Time
    47%
    Less Capacity Needed
    39%
    Faster MTTR/MTTI
    Cluster Creep No… NO! My sprawl… my complexity… my beautiful 37-minute deploy times… NOOOOO!
    ⚡ DEFEATED ⚡
    The challengers watch from the sidelines. They’re not defeated — but they know the game just changed.
    The Cloud TwinsWe’ll be back. Hybrid is where we’re heading too. See you at the edge…
    The Red BaronImpressive numbers. But developer experience is the next battlefield. Don’t get comfortable.
    The WranglerNot every ranch runs on one brand of fence. I’ll see you at the multi-cloud rodeo.
    Bare KnuckleSome of us still prefer the raw fight. But… 11 minutes is hard to argue with.
    The Architect VCF 9.1 gives me 11-minute deploys, 15-minute upgrades, 500 clusters per Supervisor, intelligent DRS-based node placement, native FinOps cost tracking, self-service CaaS, open CNI choice, native Avi ingress, and zero-trust pod security. All on the same VCF stack I’m already running.
    Captain VKS And I’m CNCF Kubernetes AI Conformant. The challengers are strong — I respect each of them. But none of them can do what I do: run Kubernetes as a native citizen of your existing VMware estate.
    VCF 9.1 doesn’t just iterate on VKS — it redefines the operational ceiling. Fast Deploy eliminates the provisioning tax. DRS-based placement removes manual scheduling toil. FinOps cost showback closes the last visibility gap. And with 500 clusters per Supervisor, VKS is the platform-scale Kubernetes runtime that VCF architects have been waiting for.

    The challengers each bring real strengths — managed simplicity, opinionated platforms, multi-cloud freedom, zero-cost entry. This isn’t a story where the hero has no flaws. But for the Architect running a VCF estate with VMs, containers, and AI workloads under one roof — the calculus is clear.

    The question is no longer “can VKS compete?” — it’s “what’s your excuse for not running it?”
    📚 Sources & References
  • VCF 9.1 – Top 20 Highlights for Cloud Service Providers

    VCF 9.1 – Top 20 Highlights for Cloud Service Providers

    Multi-tenancy, self-service, networking isolation, storage economics, fleet operations — curated for the CSP lens.

    VCF 9.1 is arguably the most CSP-significant VCF release in recent memory. The networking story alone — edge-free distributed connectivity, VPC isolation policies, EVPN/VXLAN peering — rewrites the playbook for multi-tenant service delivery. But the improvements span every layer: storage economics, Kubernetes density, fleet operations, and cyber recovery. Here are the 20 features that matter most if you’re running — or planning to run — a VMware-powered cloud service.

    Multi-Tenancy & Self-Service
    01 / 20

    VCD → VCF Automation Migration Tool

    This is the feature VCD-based CSPs have been waiting for. VCF 9.1 introduces a native migration path from VMware Cloud Director to VCF Automation. VMs are imported from OrgVDC resource pools directly into vSphere Namespaces. Supervisors, Clusters, Regions, Projects, and Namespaces are auto-created and mapped to existing VCD constructs. Network boundaries of OrgVDC are migrated to NSX VPC — preserving tenant isolation through the transition.

    Why CSPs Care

    Unblocks the single biggest migration concern for VCD-based providers. Automated construct mapping dramatically reduces migration effort, tenant downtime, and the professional services cost of transitioning to the VCF Automation operating model.

    02 / 20

    Self-Service Namespace Creation with Guardrails

    Organization admins can now delegate vSphere Namespace creation to Project Admins on a self-service basis. The governance layer is granular: admins define which Regions, Namespace Classes, Connectivity Profiles, Subnets, Infrastructure Policies, VPCs, and Service Engine Groups are available to each project. Tenants consume within those boundaries without filing tickets.

    Why CSPs Care

    Every namespace creation ticket that disappears from a CSP’s queue is margin improvement. Self-service with admin-defined guardrails is the operational model CSPs need — tenant autonomy without infrastructure risk.

    03 / 20

    Upfront Pricing Estimates & Tenant Notifications

    Tenants now see real-time pricing estimates before deploying catalog items, VMs, and VKS clusters. Consumption reports, infrastructure alerts, and critical operation notifications are surfaced directly in the VCF Automation UI. Providers configure which alerts and reports are visible to tenants.

    Why CSPs Care

    Transparent showback/chargeback is fundamental to CSP economics. When tenants see cost before they click “deploy,” billing disputes drop, resource waste decreases, and self-service confidence goes up.

    04 / 20

    Project-Scoped Content Libraries

    A new form of content library scoped to specific projects within an organization. Admins can restrict VM image availability so that only the users and resources of a given project can access particular images. Canonical Ubuntu images are now available as validated, subscribed content — provider-controlled.

    Why CSPs Care

    Image governance per tenant project. CSPs curate approved OS images without cross-tenant leakage — essential for regulated tenants and for CSPs offering tiered service catalogs.

    Networking & Tenant Isolation
    05 / 20

    VPC Connectivity Policies — Community, Promiscuous, Isolated

    VCF 9.1 introduces connectivity policies that control inter-VPC communication within a tenant project — without firewall rules. Community: VPCs in the same community talk to each other. Promiscuous: talks to any VPC. Isolated: only communicates with promiscuous VPCs. These can be mixed within a project for precise segmentation.

    Why CSPs Care

    Multi-tier tenant networking (dev/staging/prod isolation, shared-services patterns) handled by policy rather than per-rule firewalls. Reduces CSP networking configuration overhead per tenant from hours to minutes.

    06 / 20

    Transit Gateway Advanced Connectivity

    CTGW is now decoupled from Tier-0. VCF 9.1 supports HA mode per CTGW, multiple CTGWs and DTGWs per project, and multiple external connections per CTGW. For outbound traffic, tenants get full control over which Tier-0 is used, where SNAT is applied, and which External IP block is consumed.

    Why CSPs Care

    Per-project external connectivity with independent Tier-0 selection eliminates the shared gateway bottleneck. CSPs can model complex tenant topologies — multi-ISP, multi-region, dedicated uplinks — on shared infrastructure.

    07 / 20

    Distributed Transit Gateway with EVPN/VXLAN

    Peer directly with the physical fabric using industry-standard EVPN/VXLAN. This decouples the control and data plane for north-south traffic — VMs get direct N/S connectivity without traffic tromboning through Edge appliances. No edge lifecycle, no edge provisioning, no edge scaling headaches.

    Why CSPs Care

    Edge VM sprawl is one of the top operational pain points at CSP scale. DTGW with EVPN/VXLAN eliminates it entirely for N/S traffic — better latency, fewer failure domains, dramatically simpler operations.

    08 / 20

    Virtual Network Appliances (VNA) — Edge-Free Network Services

    A dedicated VNA Cluster now runs network services for Distributed External Connections: External IP (1:1 NAT), DHCP, NAT (SNAT/DNAT), VPC Outbound NAT (N:1 — new in 9.1), and NSX LB for Supervisor/VKS (new in 9.1) plus Avi VPC LB Plugin. Only NAT and LB traffic is redirected to VNAs — L2/L3 and External IP traffic remains fully distributed.

    Why CSPs Care

    Network services without deploying and managing Edge VMs per tenant. The distributed data-path keeps per-tenant traffic efficient while VNAs handle only the stateful services that need them.

    09 / 20

    TGW Span + Infoblox IPAM Integration

    Transit Gateway Span constrains a TGW and its subnets to selected vCenter clusters — controlling where subnets are available, where workloads can be placed, and aligning DTGW spans with external connection VLANs. Separately, Infoblox integration discovers and maps Network Containers to external IP blocks, provisions subnets/IPs using Infoblox CIDRs, and auto-registers workload IPs and FQDNs.

    Why CSPs Care

    TGW Span gives CSPs physical network alignment per tenant cluster — critical for VLAN-constrained environments. Infoblox integration provides the single DDI source-of-truth that large CSPs already depend on, now natively integrated with VCF networking.

    Storage & Data Efficiency
    10 / 20

    vSAN ESA Inline Compression (ZSTD) + Global Deduplication GA

    vSAN 9.1 introduces a ZSTD-based inline compression algorithm tuned specifically for vSAN — delivering significantly higher data reduction ratios while balancing CPU utilization. Compression is now always-on. In parallel, vSAN Global Deduplication reaches GA, supporting between 3 and 64 hosts with improved processing efficiency. Crucially, Global Dedup is fully compatible with Data-at-Rest encryption — no negative impact on reduction ratios.

    Why CSPs Care

    Direct $/TB improvement. Better compression + dedup = higher tenant density per physical disk. This is fundamental to CSP storage margin economics, especially for VDI, database, and backup workload profiles.

    11 / 20

    Auto-RAID + Effective Capacity View

    Auto-RAID automatically manages optimal resilience settings per cluster using a single “vSAN ESA Auto RAID Policy” in vCenter — dynamically adjusting as cluster size changes (4-host, 6-host stretched, 2-node, single-host bootstrap). The new “effective capacity” view replaces raw capacity statistics with usable capacity and simplified space-efficiency summaries covering dedup ratio, compression ratio, thin provisioning savings, and snapshot savings.

    Why CSPs Care

    No more manual storage policy tuning across hundreds of tenant clusters. Effective capacity view aligns with how CSPs bill and report storage — usable TB, not raw TB with overhead footnotes.

    12 / 20

    Native S3 Object Storage on vSAN — Technology Preview

    Block, file, and S3-compatible object storage running on the same vSAN cluster. Multi-tenant object storage is provisioned and managed via VCF Automation or vSphere Supervisor. Scalable, resilient architecture courtesy of vSAN ESA. Available as Technology Preview in Patch 01 of VCF 9.1.

    Why CSPs Care

    A new service tier on existing hardware. CSPs can offer S3-compatible object storage to tenants without deploying separate storage infrastructure — opening up developer-oriented and AI/ML data-lake use cases.

    Kubernetes & Containers
    13 / 20

    VKS: 500 Clusters per Supervisor + Fast Deploy

    VKS now supports up to 500 Kubernetes clusters per Supervisor — a 2.6× scale increase over VCF 9.0. VKS 3.6 ships Kubernetes 1.35 (CNCF-certified, 24-month support). Fast Deploy leverages linked-clone (unencrypted VMs) and direct-mode (encrypted VMs) technologies to reduce cluster provisioning time by approximately 70% and upgrades by approximately 75%.

    Why CSPs Care

    Dramatically higher Kubernetes tenant density per control plane instance. Fast Deploy addresses burst scenarios common in VDI and retail — and reduces time-to-revenue for new K8s tenant onboarding from 37 minutes to 11 minutes.

    14 / 20

    Container Service — CaaS Without Kubernetes

    Deploy isolated, secure containers directly on vSphere Pods within vSphere Namespaces — no full Kubernetes cluster required. UI-driven provisioning and lifecycle control. Supports StatefulSets with persistent volumes and multi-container pods. Based on the proven vSphere Pods technology with VM-level isolation.

    Why CSPs Care

    CSPs can offer a lightweight container service tier below full VKS — lower cost, faster deploy, familiar vSphere management. This broadens the addressable tenant market to teams that want containers but don’t need (or want to manage) Kubernetes.

    Operations & Lifecycle
    15 / 20

    Unified Fleet IAM & Management

    VCF 9.1 delivers end-to-end IAM with VCF-level roles across all components — vCenter, NSX, Operations, Automation, Logs, Networks, HCX, and Orchestration — all brokered through VIDB (Identity Broker). Unified password policies with vault integration, bulk certificate management (generate CSRs, renew, import across the fleet), and OAuth/API token access for programmatic automation. Custom VCF roles can be provisioned across vCenter and VCF instances.

    Why CSPs Care

    Single identity plane for the entire VCF estate. CSPs managing multi-instance fleets get consistent RBAC, password governance, and certificate rotation at scale — replacing the fragmented per-instance identity management that doesn’t survive operational audits.

    16 / 20

    Centralized LCM — 4× Parallel Upgrades

    Lifecycle Management is now part of the VCF Services Platform with a unified software depot secured via OAuth token. Optimized precheck workflows and a 4× improvement in parallel cluster upgrade operations — centrally managed from VCF Operations. One place to download and manage binaries, and quickly assess health and upgrade readiness across the fleet.

    Why CSPs Care

    CSPs running hundreds of clusters can upgrade 4× faster in parallel. Single depot and centralized LCM eliminates the maintenance-window sprawl that plagues large CSP environments — turning a weekend-long upgrade cycle into an overnight operation.

    17 / 20

    Flexible Licensing — License Server + Aggregated Usage

    VCF components are automatically licensed via vCenter when configured in connected mode. A dedicated license server offloads license logic from VCF Operations. Multiple licenses can be applied directly to a vCenter and its connected components. Aggregated license usage for ESX 8.x and 9.x. On-prem license appliance available for air-gapped or sovereign environments.

    Why CSPs Care

    CSPs with mixed-version estates (VCF 5.x through 9.x) get aggregated license management across generations. Override licenses support unique CSP scenarios — trial tenants, PoC environments, and tiered service offerings with differentiated entitlements.

    Security & Cyber Recovery
    18 / 20

    On-Premises Cyber Recovery Clean Room

    Full ransomware protection and recovery on customer-owned infrastructure — no cloud dependency. The solution extends vSAN Protection and Recovery to provide on-prem clean room capabilities with push-button vDefend-based network isolation, EDR integration (Carbon Black included by default, CrowdStrike BYOL supported), guided restore point selection, VM analysis and validation in the isolated environment, and orchestrated failback workflows.

    Why CSPs Care

    CSPs can offer “Cyber Recovery as a Service” as a premium tier — fully on-prem, data-sovereign, with clean room isolation that satisfies regulated industries prohibiting cloud-based recovery. The EDR vendor choice (Carbon Black or CrowdStrike) aligns with whatever the tenant already runs.

    19 / 20

    Security Posture Management & Compliance Automation

    Fleet-wide compliance assessments using built-in benchmarks — enable benchmarks, assign to policies, clone and modify rules to suit requirements. Run assessments on-demand, view and filter results, export to PDF/CSV, and perform one-click remediation to infrastructure objects. Confidential Computing visibility through the SecOps dashboard (AMD SEV-SNP, Intel TDX). VCF-wide audit trails with standardized log architecture for security forensics.

    Why CSPs Care

    Automated compliance reporting for regulated tenants (FIPS 140-3, STIG, custom benchmarks). One-click remediation across the fleet reduces CSP audit preparation from weeks to hours. The audit trail becomes a sellable compliance artifact for tenants in financial services and government.

    Edge
    20 / 20

    VCF Edge — 5,000 Hosts, 256 Parallel Upgrades, ZTP + GitOps

    Fleet capacity doubled to 5,000 ESX hosts per instance. Parallel upgrade scale increased 4× from 64 to 256 clusters. Zero Touch Provisioning uses UEFI HTTPS Boot with TPM and Secure Boot support — hosts inherit desired-state image and configuration from the cluster, no TFTP required. Day-0 activation scripts configure vSphere clusters, Supervisor, and FLB. Argo CD-based GitOps provides pull-based workload delivery with drift detection and auto-correction. Flexible 1/2/3+ node topologies with full air-gap support.

    Why CSPs Care

    CSPs serving retail, telco, or industrial edge can scale to thousands of sites with lights-out ZTP and GitOps delivery. 256 parallel upgrades make fleet-wide patching operationally viable — a requirement for edge CSPs where site-by-site maintenance windows are physically impossible.

    The CSP Takeaway

    VCF 9.1 is a platform release, not just a feature release. The networking overhaul (DTGW, VNAs, VPC policies, EVPN/VXLAN) alone justifies the upgrade for any CSP running multi-tenant workloads. Layer on the VCD migration tool, self-service namespaces, storage economics improvements, and fleet-scale operations — and this is the release that brings VCF’s cloud operating model to parity with what CSPs have been building manually around VCD for years.

  • I Built a Tool to Stop YAML Hell During Cloud → VCF 9 VKS Migrations

    I Built a Tool to Stop YAML Hell During Cloud → VCF 9 VKS Migrations

    The Problem:

    The Solution:

    What It Does: 

    🚀 Try it below

    // VMTECHIE.BLOG

    EKS/AKS/OCP → VKS/VCF9

    Upload K8s manifests → analyze cloud deps → transform for VKS on VCF 9 → download migration bundle

    1UPLOAD
    2ANALYSIS
    3BUNDLE
    UPLOAD & CONFIG
    Export: kubectl get all,cm,secret,pvc,ingress,sa,pdb,hpa -n <ns> -o yaml
    SOURCE PLATFORM
    EKS
    AKS
    OCP
    DROP YAML FILES
    or click · multi-file · .yaml .yml
    OR PASTE YAML
    Separate docs with ---

    TARGET VKS CONFIG
    HARBOR FQDN
    On-prem Harbor registry
    https://
    HARBOR PROJECT
    /
    vSAN SC NAME
    VELERO BUCKET URL
    VELERO BUCKET NAME
    TARGET NS (opt)
    blank = keep original
    ANALYSIS
    Transformation complete.
    ISSUES & CHANGES
    MIGRATION BUNDLE
    All files ready. Review tabs then download ZIP.

    Step-by-Step Usage:

    # Connect to your source EKS/AKS/OpenShift cluster
    kubectl config use-context my-eks-cluster
    # Export all resources from production namespace
    kubectl get all,configmaps,secrets,pvc,ingress,serviceaccounts,pdb,hpa \
    -n production -o yaml > production-export.yaml
    # Repeat for each namespace you're migrating
    kubectl get all,cm,secret,pvc,ingress,sa,pdb,hpa -n staging -o yaml > staging-export.yaml
    # oc get all,cm,secret,pvc,route,sa,pdb,hpa,deploymentconfigs,imagestreams \
    -n production -o yaml > production-export.yaml

    Known Limitations:

    Disclaimer & Privacy:

  • The Integration Debt Nobody Budgets For — And How VCF Eliminates It…

    The Integration Debt Nobody Budgets For — And How VCF Eliminates It…

    Optionality sounds powerful… until you have to operate it.

    This is not a debate about which hypervisor is fastest or which Kubernetes distribution has the most GitHub stars. It is a more fundamental question: what does it cost your organisation to assemble a platform versus deploying one? And as AI workloads enter the data centre, that question has never carried higher stakes.


    🔷 1. The Illusion of Flexibility

    Modern infrastructure platforms arrive with a compelling pitch:

    The Pitch

    • Choose your compute
    • Pick your storage
    • Define your networking
    • Add Kubernetes
    • Extend to AI later

    At first glance, this looks like control. It reads like architectural maturity. It feels like optionality. The reality is subtler.

    ⚠️

    Reality Check

    What appears as flexibility often becomes integration responsibility. You are no longer just consuming a platform — you are building and maintaining one. The components are yours to choose. So is the glue, the upgrade matrix, and the 2am incident call when two of them disagree.


    🔶 2. The Cost Nobody Invoices — Operational Fragmentation

    Most infrastructure cost conversations stop at licensing. That is the wrong place to stop.

    Organisations that assemble their stack from best-of-breed point products pay a tax that never appears on a single invoice. That tax is operational fragmentation — the compounding overhead of managing upgrade matrices, support escalations, skill silos, and integration glue between components that were never designed to coexist.

    Hidden Costs of an Assembled Stack

    • 🔁 Cross-component compatibility testing before every patch cycle
    • 🔄 Coordinated upgrades across independently-released product versions
    • 🧩 Integration gaps between tooling layers with no validated fix path
    • 🛠 Multi-vendor troubleshooting with no single accountable party
    • 📋 Separate training and certification paths per product silo
    • ⚙️ Custom automation scripts that break on every minor version update

    None of these costs appear on a rack-and-stack BOM. But they absolutely show up in headcount, MTTR, change failure rate, and the number of people needed on a change advisory board call to approve a routine patch.

    📌

    Key Insight

    Complexity doesn’t disappear — it just moves. In optional models, it moves to the operator.


    🔷 3. The Shift in What Matters

    The success criteria for enterprise infrastructure has fundamentally changed.

    Old Question

    • “Do I have the best individual components?”

    New Question

    • “Can my platform run everything — consistently — at scale?”

    Including enterprise VMs, Kubernetes workloads, and AI/ML pipelines — on the same operational model, under the same lifecycle management, enforcing the same security policy.


    🔶 4. What “Integrated” Actually Means in a VCF Context

    Integration is one of the most overloaded words in enterprise IT. Vendors routinely describe a collection of separately licensed, separately patched, separately supported products as an “integrated platform” because they share an API or a common UI skin. That is not integration — that is aggregation with a coat of paint.

    True integration, as delivered by VMware Cloud Foundation, means something more fundamental. VCF is not a loose collection of components. It is an engineered system, built to operate as one.

    Compute vSphere
    Storage vSAN ESA
    Networking NSX
    Kubernetes vSphere Kubernetes Service
    Operations VMware VCF Operations
    Lifecycle VCF Ops in Conjuction with SDDC Manager

    What integration actually delivers:

    • Single Bill of Materials: vSphere, vSAN ESA, NSX, VKS and VCF Ops are validated, tested, and shipped as a versioned unit. The interoperability matrix is solved by Broadcom — not by your operations team.
    • Unified Lifecycle Management: VCF Ops orchestrates Day-2 operations — patching, upgrades, cluster expansion — across all stack components in a single guided workflow.
    • Shared Policy Plane: NSX DFW, vSAN SPBM, and VKS Supervisor Namespaces consume the same identity and policy constructs. Security posture defined once propagates consistently across VM and container workloads.
    • Native AI & GPU Fabric: VCF 9’s NVIDIA AI Enterprise integration and VKS GPU scheduling work at the platform level — no bolt-on operator, no custom integration project.

    What This Enables

    A single operational model across VMs, containers, and AI workloads — with one lifecycle, one policy plane, one support contract.


    🔷 5. Optionality vs Integration — The Real Trade-Off

    The choice is not between good and bad — it is between two fundamentally different operational philosophies. Here is what that looks like in practice.

    Dimension DIY / Assembled Stack VMware Cloud Foundation
    Architecture Assembled — maximum component choice Pre-integrated — engineered as a system
    Upgrade Coordination Manual — you own the BOM and compatibility matrix Automated — VCF Ops in conjuction with SDDC Manager orchestrates end-to-end
    Security Policy Consistency Fragmented — per-layer silos, no enforcement parity Unified — NSX DFW spans VM + container workloads
    AI/GPU Scheduling Custom — no native shared pool across VM + K8s Native — VKS Supervisor + NVIDIA AIE integration
    Sovereign / Air-Gap Possible — but requires significant custom work Designed — built for sovereign deployment patterns
    Support Accountability Multi-vendor — no single throat to choke Single contract — one Broadcom support engagement
    Day-0 Deployment Weeks to months — integration work starts on day one Hours — Cloud Builder automation handles bring-up
    Operational Risk Higher — integration gaps are your responsibility Lower — Broadcom validates the full stack

    The assembled model earns its place when flexibility and component choice genuinely matter. VCF earns its place when operational outcomes — upgrade coherence, policy consistency, Day-2 simplicity — are the priority. Know which problem you are actually solving.


    🔶 6. Architect’s Take — LCM Is Where It Pays Off Most Visibly

    💡

    Scaling Principle

    You don’t scale by increasing choice. You scale by reducing variability.

    Lifecycle management is the unglamorous work that consumes a disproportionate share of infrastructure team capacity. Patching a fragmented 200-node environment with independent networking, storage, and compute upgrade cycles can absorb weeks of engineering time per quarter. That is time not spent on automation, capacity planning, or AI platform delivery.

    VCF’s Ops Manager LCM workflow reduces this to a structured, guided operation:

    • Broadcom pre-validates the combined patch bundle across vSphere, vSAN, NSX, and VKS before release
    • VCF Ops Manager performs pre-check validation of cluster health, DRS rules, and NSX edge availability before any host enters maintenance mode
    • Rolling vMotion-aware patching keeps workloads running — no scheduled downtime windows for routine patches
    • Async patch support in VCF 9 lets you apply critical security fixes to individual components outside the full bundle cadence

    Root Causes of Operational Failure at Scale

    • Operational inconsistency across teams and workload types
    • Upgrade risk from unvalidated cross-stack dependencies
    • Cross-stack debugging with no authoritative owner

    An integrated platform directly addresses all three. For regulated industries — financial services, government, healthcare — the ability to demonstrate a coherent, auditable, single-vendor patch history across the entire stack is not an operational preference. It is a compliance requirement.


    🔷 7. Why This Matters Even More for AI + Kubernetes

    For years, the integration argument was primarily an operational efficiency argument. AI changes the calculus entirely.

    GPU-accelerated AI training and inference workloads have characteristics that stress every boundary in a fragmented stack:

    • NUMA-aware scheduling must be consistent from the hypervisor layer through the container orchestrator. A mismatch breaks CPU–GPU affinity, and you leave 20–30% of GPU performance on the floor.
    • High-bandwidth east-west traffic between GPU nodes demands network policy enforcement without the overhead of a separately managed overlay.
    • Shared GPU pools serving both VM-based inference endpoints and Kubernetes training jobs require a scheduler that understands both resource models — which is precisely what VKS on VCF Supervisor delivers.
    • Observability continuity from vSphere Metrics through VCFVCF Operations to the Kubernetes layer means you can correlate a GPU memory spike in a training pod with the underlying ESXi host’s thermal profile — without stitching logs from three separate products.

    Assembled Model

    • Each new capability — GPU workloads, multi-tenant K8s, high-perf storage — becomes a new integration point and a new failure domain

    Integrated Model (VCF)

    • Each new capability is part of the same system — inherited policy, lifecycle, and observability included on day one
    🚀

    Faster deployment. Lower risk. Consistent operations across VM, container, and AI workloads.


    🔶 8. The Platform Multiplier Effect

    Here is the compounding argument that does not get made enough: integration creates a multiplier effect on every new capability you deploy.

    When VKS lands in a VCF environment, it does not arrive as an isolated Kubernetes cluster. It inherits NSX micro-segmentation, vSAN SPBM storage policies, vSphere HA and DRS scheduling intelligence, and VCF Operations observability — on day one, without custom integration work. A standalone Kubernetes distribution requires weeks of effort to reach equivalent operational parity with the surrounding infrastructure.

    The same logic applies to NVIDIA AI Enterprise on VCF, to VCF Automation (VCFA) for self-service provisioning, and to every future capability Broadcom ships as part of the platform. Each addition is additive — not additive-plus-integration-project.

    Over a five-year horizon, this multiplier is where integrated platforms generate the most measurable TCO advantage.


    🔷 9. When Integration Is the Wrong Answer

    Intellectual honesty requires acknowledging this: integrated platforms are not universally the right answer.

    ⚖️

    Be Honest With Your Context

    VCF is optimised for organisations running mixed VM and container workloads at scale, in regulated or sovereign environments, where operational consistency and single-vendor accountability matter. If that profile does not match yours, acknowledge it.

    • If your organisation has a dominant public cloud strategy and on-premises infrastructure is genuinely residual, VCF’s operational depth may not be justified at small scale
    • If you have deep in-house expertise in specific open-source components and the engineering capacity to maintain integration glue, DIY can work — and can be cheaper at certain scales
    • If your primary requirement is developer-facing Kubernetes with no legacy VM estate, a lighter-weight distribution may be sufficient

    Your architecture should match your actual operational context — not a vendor’s reference diagram.


    🔶 10. Verdict

    The goal is not to build infrastructure. The goal is to run applications — reliably and at scale.

    🎯 Three Principles That Hold

    • Integration matters more than optionality
    • Consistency matters more than customization
    • Operational simplicity matters more than theoretical flexibility

    VMware Cloud Foundation represents this integrated approach — delivering a platform designed to run everything, not just host it. The components beneath — ESXi, vSAN ESA, NSX — are best-in-class. But the durable value is VCF Ops Manager, Supervisor Namespaces, and the unified policy plane that ties them together. That is the investment that compounds.

    🔥 Final Thought

    Enterprises don’t fail because they lack choice. They fail because they underestimate complexity. The right platform is the one that removes that complexity — not the one that distributes it. As infrastructure demands continue to grow — driven by AI workloads, sovereign mandates, and the accelerating pace of platform feature delivery — the organisations that have invested in integrated foundations will absorb that complexity without proportionally growing their operations teams. That is why the integration debt nobody budgets for is also the one that VCF was built to eliminate.

    Further reading on vmtechie.blog:   ·  VCF Fleet Sizer Tool  ·  VCF Upgrade Path Planner

  • Why VCF with VKS is a Stronger Enterprise Choice Than KubeVirt

    Why VMware VKS Is a Stronger Enterprise Choice Than KubeVirt | vmtechie.blog

    KubeVirt is a capable open-source project and a legitimate choice in the right context. But when the workload is enterprise AI at scale — GPU clusters, production AI factories, regulated environments — the gap between VKS with VCF and KubeVirt is not a minor preference. It spans architecture, operations, governance, and enterprise transformation strategy.

    PREMISE Let’s Be Honest About KubeVirt First

    A technically credible argument never starts by dismissing the competition. KubeVirt is a real, production-used project with genuine strengths. Let’s acknowledge them honestly before making the VKS case.

    Where KubeVirt genuinely wins: Cloud-native purists wanting a single Kubernetes control plane for everything. Cost-sensitive environments where ESXi licensing is a barrier. Dev/test scenarios where VM-grade isolation isn’t critical. Upstream OSS communities wanting full control over the stack. Teams with deep Kubernetes operational maturity who want to manage VMs and containers through a unified API.

    If your organisation is already 100% Kubernetes-native with no enterprise VM workloads or compliance requirements, KubeVirt is a reasonable choice. That’s the honest truth. This is not a case of good vs bad — it is a case of enterprise integration vs architectural freedom.

    But here’s the equally honest truth: for enterprise AI infrastructure — GPU clusters, DGX/HGX environments, production AI factories, regulated tenancy — VKS with VCF tends to hold a stronger position across most architectural and operational dimensions that matter to enterprise teams. Here’s the case, dimension by dimension.

    00 The Core Difference: Integrated Platform vs Extension Model

    Before diving into technical specifics, it’s worth understanding the conceptual gap — because it explains every practical difference that follows.

    With VKS, Kubernetes is delivered as a built-in service on top of the VMware infrastructure stack. It is tightly integrated with vSphere, storage, networking, policy, and lifecycle management. It is designed as part of the platform — not added to it.

    With KubeVirt, virtualisation is added into Kubernetes as an extension. It is an innovative approach, but it still means you are effectively layering VM functionality into an environment originally built for containers. In practice, VKS gives enterprises a unified operating model. KubeVirt often introduces more integration points, more dependencies, and more operational responsibility.

    The directional difference: KubeVirt extends Kubernetes to run VMs. VKS extends a mature enterprise virtualisation platform to run Kubernetes properly. In production, that direction matters more than it appears on a whiteboard.

    01 Hypervisor Architecture — Purpose-Built vs Added On

    The most fundamental difference is architectural. KubeVirt layers VM capability onto a system designed for containers. VKS extends a hypervisor designed from day one to run workloads with hardware-level isolation.

    KubeVirt Stack
    Application / AI Workload
    QEMU/KVM Process
    Container (Pod)
    Kubernetes Node
    Linux Kernel
    Hardware
    VKS with VCF Stack
    Application / AI Workload
    Container / Kubernetes Pod
    VM (vSphere Supervisor)
    ESXi Microkernel (Type-1)
    Hardware

    ESXi is a Type-1 bare-metal hypervisor — it runs directly on hardware with a microkernel architecture under 150MB in size. It was designed to do one thing exceptionally well: run workloads with deterministic performance and hardware isolation. VMs and containers on VKS are both first-class constructs — not one emulating the other.

    The analogy: KubeVirt is running a city inside a shipping container. VKS is building a city on actual land. Each abstraction layer in KubeVirt compounds — adding latency, scheduling complexity, and failure domains that are less pronounced in a purpose-built hypervisor model.

    02 GPU & AI Workload Performance — The Widest Gap

    This is the dimension that matters most for anyone building NVIDIA AI infrastructure. The gap here is not marginal — it is architectural.

    KubeVirt GPU Reality

    GPU passthrough to VMs via KubeVirt requires VFIO/IOMMU — complex to configure, brittle in production, and requiring deep Linux kernel expertise. More critically:

    • No native MIG (Multi-Instance GPU) awareness — partitioning must be configured externally
    • GPU sharing across VMs and containers in the same cluster is operationally complex
    • No current equivalent of NVIDIA vGPU time-slicing with hardware-enforced QoS guarantees
    • The KubeVirt device plugin model does not yet integrate cleanly with MIG partition profiles

    VKS with VCF with NVIDIA AI Enterprise

    This is the explicitly certified, supported path for enterprise NVIDIA GPU deployments:

    • NVIDIA vGPU natively supported on ESXi — VMs get dedicated vGPU profiles (A100-40C, H100-80C) with hardware-enforced QoS [1]
    • MIG partitioning integrates cleanly — a single H100 can serve multiple Kubernetes pods and VMs simultaneously with hard partition isolation [2]
    • NVIDIA GPU Operator supports vSphere Supervisor as a validated deployment target
    • NVIDIA AI Enterprise is explicitly certified on vSphere — the recommended enterprise path for DGX/HGX production deployments [3]
    // VKS — GPU resource request (clean, native)
    resources: limits: nvidia.com/gpu: 1 # vGPU profile enforced at hypervisor level # MIG partitioning transparent to workload # QoS guaranteed by ESXi scheduler

    03 Security & Isolation — 20 Years vs 5 Years

    Security is where enterprise architects lose sleep — and where VKS has the most compelling, battle-tested story.

    KubeVirt’s Security Model

    VM isolation in KubeVirt depends on the container runtime security boundary plus QEMU process isolation. A compromised container runtime (containerd, runc vulnerability) can potentially affect the QEMU process hosting the VM. Nested virtualisation increases the kernel attack surface. RBAC for VM operations is layered onto Kubernetes RBAC — not purpose-built for multi-tenant VM isolation.

    VKS + NSX Security Model

    ESXi’s VMX process isolation is 20+ years hardened. Each VM is fully isolated at the hypervisor level regardless of what happens in the container layer above. Beyond that:

    • NSX Distributed Firewall (DFW) applies microsegmentation at the vNIC level — every Kubernetes pod can have firewall policy enforced at the hypervisor, not just the overlay network [4]
    • vSphere Trust Authority and TPM integration provide cryptographic attestation of host state before VMs are allowed to run — KubeVirt currently has no comparable integrated mechanism
    • Regulatory compliance (PCI-DSS, HIPAA, SOC2) control mapping for vSphere is well-established and widely audited; equivalent mappings for KubeVirt environments are still maturing
    • ESXi security patches are coordinated and tested against the full vSphere stack — KubeVirt kernel updates require independent validation across the QEMU/KVM/container runtime chain

    04 Day-2 Operations — Where the Pain Is

    Every infrastructure architect knows that Day-1 deployment is 10% of the story. Day-2 operations — patching, upgrades, live migration, monitoring — is where you live for the next 3-5 years.

    VCF / VKS Capability
    KubeVirt Equivalent
    vMotion — zero-downtime live migration
    Basic VM migration (no storage vMotion)
    VCF Lifecycle Manager — full stack upgrade
    Manual Kubernetes + KubeVirt operator coordination
    VCF Operations — unified VM + container observability
    Separate toolchains (Prometheus + custom exporters)
    VKS K8s upgrades decoupled from vCenter lifecycle
    K8s + KubeVirt operator + host OS must be co-validated
    vSphere Update Manager — coordinated patching
    DIY patching across kernel, QEMU, CRI, CNI layers
    SPBM — storage QoS policy across VMs + PVCs
    CSI only, no differentiated storage QoS

    VCF Lifecycle Manager manages the entire stack — ESXi, vCenter, NSX, vSAN, and Kubernetes cluster versions — in a single coordinated upgrade workflow. In KubeVirt environments, version skew between the Kubernetes release, KubeVirt operator version, QEMU version, and the host kernel is a recurring operational hazard that requires dedicated engineering effort to manage safely.

    One of the most underappreciated advantages of VKS is that Kubernetes cluster upgrades are fully decoupled from vCenter upgrades. In practice, this means platform teams can roll out new Kubernetes versions — moving from 1.28 to 1.29 to 1.30 — independently, without waiting for a vCenter maintenance window or coordinating with the infrastructure team managing the underlying SDDC. Each Tanzu Kubernetes cluster has its own lifecycle, managed via the Supervisor and VCF LCM, with no hard dependency on the vCenter version for day-to-day Kubernetes updates. Compare this to KubeVirt, where the Kubernetes control plane, KubeVirt operator, and host OS are all tightly coupled — a Kubernetes minor version upgrade requires validating compatibility across all three layers simultaneously. For enterprises running multiple Kubernetes clusters across workload domains, VKS’s decoupled upgrade model is a significant operational advantage.

    05 Networking — NSX vs CNI Complexity

    Networking for AI workloads is not just about connectivity — it’s about bandwidth, latency, topology awareness, and security policy across a mixed VM and container estate.

    KubeVirt Networking Complexity

    VM network interfaces in KubeVirt are exposed as secondary interfaces via Multus — requiring careful co-ordination between multiple CNI plugins. SR-IOV for VM workloads requires manual IOMMU/VF configuration per node. There is no unified microsegmentation plane between VMs and pods — policy must be applied at multiple layers independently.

    VKS + NSX — Unified Fabric

    NSX provides a single network fabric for both VMs and Kubernetes pods. The same DFW policy engine applies to both. NSX Advanced Load Balancer (AVI) handles Kubernetes ingress and LoadBalancer services natively with full traffic visibility across both VM and container workloads. Critically for AI infrastructure: Geneve overlay with hardware offload to SmartNICs including BlueField DPUs — directly aligned with NVIDIA’s AI factory reference architecture.

    06 Enterprise Transformation Reality — The Mixed Workload Problem

    Most enterprise modernisation conversations get derailed by a false premise: that organisations are either “all VMs” or “all containers.” The reality, in virtually every large enterprise, is a persistent mix that will not resolve cleanly for years.

    A typical enterprise estate in 2026 includes: traditional VM-based business applications, modern microservices and cloud-native workloads, packaged enterprise software with no container-native path, data platforms and stateful databases, and security or compliance-sensitive workloads requiring strict isolation guarantees. VKS is designed for this hybrid reality. It does not force everything into a Kubernetes-first abstraction before the organisation is ready for it.

    The modernisation argument: VKS allows organisations to modernise without forcing them to abandon the operational model they already trust. Infrastructure teams keep using the VMware foundation they know — while platform teams gain access to Kubernetes in a way that feels native to the environment. That makes transformation more realistic, not just more aspirational.

    Operational Risk — The Questions That Matter

    When enterprises evaluate platforms, they often focus too much on feature checklists and not enough on operational risk. The real questions are not just “Can this run VMs and containers?” They are:

    • How hard is it to support at 2am when something breaks?
    • How predictable are upgrades across the full stack?
    • How many teams need to coordinate for a routine patch?
    • How many integration gaps need to be owned and maintained internally?
    • How fast can issues be isolated and root-caused in a mixed VM/container environment?

    VKS reduces this risk because the platform is more cohesive — fewer seams between layers, fewer teams needed, fewer custom integrations to maintain. KubeVirt can be very attractive architecturally, but it assumes a higher level of Kubernetes operational maturity and a stronger tolerance for platform engineering complexity that most enterprise IT organisations do not have the staffing to sustain.

    07 Governance & Private Cloud Readiness

    For regulated industries, sovereign cloud environments, and enterprise private clouds, governance matters just as much as technology capability. Organisations need consistent policy, security boundaries, visibility, and controlled operations. They need to know who owns what, how workloads are deployed, and how infrastructure changes are managed.

    This is where VMware’s enterprise DNA shows. VKS fits naturally into environments that require structure, compliance, and clear operational accountability:

    • Role-based access control unified across VMs, Kubernetes namespaces, and vSphere objects — one policy model, not two
    • Audit trails from vCenter and NSX cover both VM and container operations in a single log stream [5]
    • Change management integration — VCF’s API surface maps cleanly to ITSM platforms (ServiceNow, Jira Service Management)
    • Sovereign cloud readiness — vSphere’s tenancy model and encryption capabilities are mapped to GDPR, data residency, and sovereign cloud frameworks across APAC, EU, and regulated US sectors

    KubeVirt can absolutely be used in serious environments — but it is more often the right fit for organisations that want deeper open-source flexibility and are comfortable owning more of the platform decisions themselves. For most enterprise private clouds, that is not a trade-off they are willing to make.

    08 Head-to-Head Summary

    Dimension VKS with VCF KubeVirt
    Platform Model ✅ Integrated — Kubernetes is native to the stack ⚠️ Extension model — VMs added onto Kubernetes
    GPU / AI Workloads ✅ vGPU, MIG, NVIDIA AI Enterprise certified ⚠️ VFIO passthrough, limited MIG integration
    Security Isolation ✅ 20+ yr hardened VMX, NSX microsegmentation ⚠️ QEMU-in-container, larger attack surface
    Live Migration ✅ vMotion — zero-downtime, storage + compute ⚠️ Functional but no storage vMotion equivalent
    Lifecycle Management ✅ VCF LCM unified + K8s upgrades decoupled from vCenter ❌ K8s, KubeVirt operator & host OS must be co-validated
    Networking ✅ NSX unified VM + container fabric + DPU offload ⚠️ Multus + multi-CNI complexity
    Storage QoS ✅ SPBM across VMs + PVCs, vSAN ESA ⚠️ CSI only, no differentiated QoS
    Mixed Workload Support ✅ Native — VMs and containers are co-equals ⚠️ Container-first; VMs require abstraction overhead
    Governance & Compliance ✅ Unified RBAC, audit, PCI/HIPAA/SOC2 controls ⚠️ Immature compliance tooling, separate audit streams
    Operational Risk ✅ Cohesive platform, fewer integration gaps ❌ Higher ownership burden, more seams to maintain
    Observability ✅ Unified VM + container via VCF Operations ⚠️ Separate toolchains required
    NVIDIA Certification Path ✅ Explicit NCP-AII / NVIDIA AI Enterprise support ❌ Not part of NVIDIA enterprise certification stack
    Cost (Licensing) ⚠️ VCF licensing required ✅ Open source, no hypervisor licensing
    // The Directional Argument
    KubeVirt makes Kubernetes run VMs.
    VKS makes a production-hardened hypervisor run Kubernetes.

    When the workload is enterprise AI at scale, the foundation matters more than the interface. Choose your substrate based on the operational reality you’ll live with for the next five years.

    CLOSING The Right Tool for the Right Job

    KubeVirt will continue to evolve. The upstream community is active, and features like live migration and GPU support are maturing. For greenfield cloud-native organisations without legacy VM estates or strict compliance requirements, it deserves serious evaluation.

    Where KubeVirt is the better fit: If your organisation is already deeply Kubernetes-native, your team has strong platform engineering capability, you want to avoid hypervisor licensing costs, and you are comfortable owning more of the integration decisions — KubeVirt is a legitimate and architecturally coherent choice. Open-source flexibility and a Kubernetes-first operating model are real advantages in the right context.

    But for enterprise organisations running AI workloads on NVIDIA DGX/HGX infrastructure, managing regulated environments, and needing proven lifecycle tooling across a mixed VM and container estate — VKS with VCF backed by VCF offers a more mature, better-integrated, and lower-risk path. It is the architecture that has been most thoroughly validated for this use case in production enterprise environments.

    The question was never “containers vs VMs.” The question is: what platform will reduce operational complexity rather than relocate it?

    My view: VKS is the stronger enterprise choice. Not because KubeVirt lacks innovation. Not because Kubernetes is weak. But because VKS is aligned with enterprise operational reality — and in production, that alignment is what separates an exciting architecture from a platform you can actually sustain.

    KubeVirt moves complexity from the hypervisor layer into your Kubernetes operations team. VKS distributes it across a tested, integrated platform with decades of enterprise hardening. For most organisations, that trade-off has a clear answer.

    And in enterprise IT, that is often what separates an exciting architecture from a successful platform.

    More from vmtechie.blog VCF architecture, AI infrastructure, sizing tools and upgrade planners for enterprise engineers.
    Visit the Blog →
  • Planning a VMware Cloud Foundation 9.0 Upgrade? Start Here…

    vmtechie.blog · Infrastructure Tools

    I Built a VCF Upgrade
    Path Planner
    — Here’s Why

    Tool: VCF Upgrade Path Planner Covers: 8 upgrade paths Target: VCF 9.0 / 9.0.2

    If you’ve ever had to plan a VMware Cloud Foundation upgrade from scratch, you know how scattered the information can be — KB articles here, TechDocs pages there, blog posts from different release cycles, and no single place that ties it all together into a clear, ordered sequence.

    That frustration is exactly what drove me to build the VCF Upgrade Path Planner. As someone who works with VCF environments day-to-day and runs vmtechie.blog to share practical infrastructure knowledge with the community, I wanted to create something that gives engineers a solid starting point before they walk into a maintenance window — a tool that reflects real-world upgrade sequencing, not just the high-level marketing overview.

    Example — vSphere 7.0 → VCF 9.0 upgrade journey

    This planner covers eight upgrade paths — spanning vSphere 7.0, 7.0 U2/U3, 8.0, and 8.0 U2/U3 converge routes to VCF 9.0, the VCF 5.0 and 5.1/5.2 in-place upgrade paths, the 9.0.0/9.0.1 to 9.0.2 maintenance path, and a current-state check for VCF 9.0.2 — all linked directly to official Broadcom Knowledge Base articles, TechDocs pages, and VMware blog posts so you can verify every recommendation against authoritative source material.

    All 8 Upgrade Paths Covered
    §

    Why I Built This

    If you’ve ever had to plan a VMware Cloud Foundation upgrade from scratch, you know how scattered the information can be. KB articles here, TechDocs pages there, blog posts from different release cycles, and no single place that ties it all together into a clear, ordered sequence. That frustration is exactly what drove me to build the VCF Upgrade Path Planner. As someone who works with VCF environments day-to-day and runs vmtechie.blog to share practical infrastructure knowledge with the community, I wanted to create something that gives engineers a solid starting point before they walk into a maintenance window — a tool that reflects real-world upgrade sequencing, not just the high-level marketing overview.

    This planner covers eight upgrade paths spanning vSphere 7.0, 7.0 U2/U3, 8.0, and 8.0 U2/U3 converge routes to VCF 9.0, the VCF 5.0 and 5.1/5.2 in-place upgrade paths, the 9.0.0/9.0.1 to 9.0.2 maintenance path, and a current-state check for VCF 9.0.2 — all linked directly to official Broadcom Knowledge Base articles, TechDocs pages, and VMware blog posts so you can verify everything against authoritative source material. A significant amount of research, testing, iteration, and community review has gone into getting the sequencing, version gates, and critical warnings right. That said, VCF is a complex and fast-moving platform, and I’m one person — so if you spot a step that’s missing, a version gate that’s wrong, or guidance that doesn’t match your experience in the field, please reach out and let me know. Every piece of feedback makes this tool better for everyone in the community.

    🔗

    Everything is sourced

    Every step links directly to the relevant Broadcom KB, TechDocs page, or VMware blog post so you can verify each recommendation against authoritative source material before acting on it.

    ⚠️

    Critical gates are flagged

    Version gates, one-way doors, and ordering requirements — like the Aria Operations 8.18 gate, the NSX Edge OVF certificate expiry fix in 9.0.2, and the mandatory vLCM Baseline-to-Image transition — are surfaced prominently, not buried in footnotes.

    §

    How We Calculate Time, Risk & Effort

    The complexity numbers shown in each upgrade path — estimated duration, risk score, and effort score — are not pulled from a vendor SLA document. They are practical estimates built from field experience with VCF environments of varying sizes and community input from engineers who have executed these upgrades in production. Here is how each metric is derived.

    Duration
    4–8w
    weeks estimated
    Risk Score
    50
    out of 100
    Effort Score
    65
    out of 100

    Duration

    Estimated based on the number of sequential phases in the path, the number of components that require ordered upgrades (SDDC Manager → NSX → vCenter → ESXi is always serial, never parallel), and the realistic time each component upgrade takes in a mid-sized environment. Converge paths from vSphere carry additional time for pre-converge remediation, vLCM Baseline-to-Image transitions, and the VCF Installer workflow itself. Paths starting from VCF 5.0 carry extra time for the mandatory VCF 5.2 intermediate hop. These are conservative estimates — your actual duration will vary based on node count, hardware speed, precheck findings, change management windows, and whether you are running a lab or a production fleet.

    💡

    What is RDU (Reduced Downtime Upgrade)?

    Starting with VCF 9.0, vCenter upgrades exclusively use Reduced Downtime Upgrade (RDU). Instead of upgrading in-place and taking the existing vCenter offline for the full duration, RDU deploys a brand-new temporary vCenter appliance alongside the existing one, migrates all configuration and inventory data across while the environment stays running, then decommissions the old appliance. The result is a much shorter management plane outage — typically just a few minutes for the final cutover rather than the extended downtime of a traditional in-place upgrade. In VCF 9.0.1+, the Installer automatically assigns a 169.254.x.x link-local IP address for the temporary appliance, so you no longer need to pre-stage a static IP on your management network in most environments. RDU is only required for major version jumps (e.g. 8.x → 9.x) — within-9.x maintenance updates use a regular in-place upgrade with no temporary appliance needed.

    Risk Score

    A relative measure from 0 to 100 that reflects how many irreversible transitions the path contains, how many components must be upgraded in strict sequence, and how much room there is to safely roll back if something goes wrong. A vSphere 7.0 converge path scores higher risk not because converge is inherently dangerous, but because it involves more one-way doors — once the VCF Installer runs and creates the management domain, you cannot unconverge back to standalone vSphere. Maintenance paths like 9.0.0 to 9.0.2 score low risk because they involve fewer components, shorter windows, and well-understood rollback via snapshot.

    Effort Score

    Reflects the total planning and execution workload — number of discrete steps, number of decisions that require engineer judgment rather than automation, number of separate maintenance windows required, and the degree of documentation and preparation needed before you can safely begin. A vSphere 7.0 to VCF 9.0 path scores high effort not because any single step is especially hard, but because the cumulative preparation — HCL checks, Baseline-to-Image transitions, ELM removal, VCF Installer staging, Aria Suite pre-work, workload domain imports — adds up to a substantial project even before the first upgrade window opens.

    ⏱️
    Duration Factors
    • Sequential component count
    • Intermediate hops required
    • Pre-converge remediation
    • Workload domain count
    • Aria Suite pre-work
    🎯
    Risk Factors
    • One-way door transitions
    • Rollback constraints
    • NSX version direction rules
    • vCenter RDU complexity
    • ELM removal requirements
    🏗️
    Effort Factors
    • Total discrete steps
    • Judgment calls required
    • Separate change windows
    • Documentation prep
    • Depot configuration work
    Upgrade Path Duration Risk Effort Risk Bar

    All three scores scale relative to each other across the eight paths, so they are most useful as a comparison tool — if you are deciding between targeting VCF 9.0.0 or 9.0.1, or choosing whether to converge from vSphere 8.0 U3 versus waiting to patch to U3 first, the scores give you a quick read on the relative complexity trade-off. They are starting points for your own planning conversation, not guarantees — always validate your specific environment against official Broadcom documentation and run the SDDC Manager upgrade prechecks before committing to a maintenance window.

    §

    A Community Tool

    VCF is a complex and fast-moving platform, and I’m one person. A significant amount of hardwork has gone into building and refining this planner — cross-referencing every step against official Broadcom documentation, KB articles, and VMware engineering blog posts, running it through multiple review cycles, and iterating on the content based on community feedback. But if you spot a step that’s missing, a version gate that’s wrong, or guidance that doesn’t match your experience in the field, please reach out and let me know. Drop a comment below or contact me directly — every piece of feedback makes this tool better for everyone in the community.

    Spotted something missing or incorrect?

    Drop a comment below or reach out directly. Your field experience makes this tool better for the whole community.

    Leave Feedback ↓
    🚀

    Try the VCF Upgrade Path Planner

    Open the tool directly on vmtechie.blog and generate your tailored upgrade plan in seconds.

    Open the Planner →

  • How the VCF 9 Fleet Sizer Actually Works

    How the VCF 9 Fleet Sizer Actually Works

    A complete walkthrough of every calculation behind the tool — from raw NVMe capacity to ESA protection factors, NVMe memory tiering, and VCF licence entitlement. No black boxes.


    Table of Contents

    1. What the tool sizes
    2. Host specification inputs
    3. Management VM stack
    4. Compute sizing formula
    5. vSAN ESA storage pipeline
    6. Protection policies & PF table
    7. Final host count & limiter
    8. NVMe memory tiering
    9. External storage mode
    10. VCF licence entitlement
    11. Principal storage options (KB 416270)
    12. Assumptions & caveats

    1. What the tool sizes

    The VCF 9 Fleet Sizer calculates the minimum number of ESXi hosts required across a VMware Cloud Foundation deployment — one Management Domain and any number of VI Workload Domains. For each domain it independently determines whether CPU, memory, or storage is the binding constraint, and returns the host count driven by the most demanding dimension.

    The sizer is built specifically for VCF 9 with vSAN ESA — the Express Storage Architecture that requires NVMe-only drives and operates as a single storage tier without a separate cache/capacity split. It also models external storage mode (Fibre Channel, NFS) where hosts are sized on compute and memory only, and a disaggregated NVMe memory tiering model unique to VCF 9.

    ⚠️ Planning aid only — not an official Broadcom tool. All outputs are estimates based on the inputs you provide. Validate every design against official Broadcom documentation, the VMware HCL, and field engineering guidance before procurement or deployment. Real-world DRR and vSAN overheads vary significantly by workload.


    2. Host specification inputs

    Every domain (management and each WLD) has an independent host specification. The tool does not assume all hosts are identical across domains — a management cluster might run 2×16c hosts while a production WLD uses 2×32c AI-optimised nodes.

    InputDefaultUsed inNotes
    CPU Qty2Core count, licensingSockets per host
    Cores per CPU16Core count, licensingPhysical cores — no hyperthreading multiplier applied
    RAM (GB)1,024Memory sizingTotal usable host RAM
    NVMe Qty6Storage sizingNVMe drives per host (vSAN ESA only)
    NVMe Size (TB)7.68Storage sizingTB decimal — converted to GB via ×1,000
    CPU OversubscriptionUsable vCPUvCPU:pCPU ratio — applies before reserve
    RAM OversubscriptionUsable RAM1× = no oversubscription. Rarely exceed 1× for RAM
    Compute Reserve %30%Usable vCPU & RAMHeadroom withheld from placement (HA, overhead)

    Raw capacity per host formulas:

    Host Cores = CPU Qty × Cores per CPU
    Raw GB per Host = NVMe Qty × NVMe Size (TB) × 1,000

    ⚠️ No hyperthreading multiplier. The sizer deliberately does not multiply physical cores by 2 for hyperthreading. Logical thread counts are workload-specific and highly variable. Instead, the CPU oversubscription ratio gives you explicit control. A 2× ratio on a 32-core host models the same headroom as a 64-thread count at 1× — but you’re aware you’re making that choice.


    3. Management VM stack

    The Management Domain hosts a fixed stack of VCF infrastructure VMs. These are not user workloads — they are the control plane. Their combined vCPU, RAM, and disk demand is the entire sizing input for the management cluster. The tool carries an accurate per-component VM stack based on current VCF 9 T-shirt sizes from Broadcom documentation.

    ComponentSizesvCPU rangeRAM rangeDisk range
    vCenter Server (Mgmt)S / M / L / XL4 – 2421 – 58 GB694 – 2,283 GB
    NSX ManagerM / L / XL6 – 2424 – 96 GB300 – 400 GB
    NSX EdgeS / M / L / XL2 – 164 – 64 GB200 GB
    NSX Global ManagerS / M / L / XL4 – 2416 – 96 GB300 – 400 GB
    Avi Load BalancerS / M / L8 – 2424 – 48 GB128 – 512 GB
    vCenter Server (WLD)S / M / L / XL4 – 2421 – 58 GB694 – 2,283 GB
    VCF Operations (SDDC Mgr)S / M / L / XL4 – 2416 – 128 GB274 GB
    VCF Operations CollectorS / M2 – 48 – 32 GB144 GB
    VCF Operations for LogsS / M / L12 – 4824 – 96 GB1,590 GB
    VCF Operations for NetworksL / XL / XXL12 – 4824 – 96 GB1,590 GB
    VCF Net. CollectorM / L / XL / XXL4 – 1612 – 48 GB200 – 300 GB
    Identity ManagerEmbedded / HA0 – 320 – 64 GB0 – 400 GB

    Management sizing is deterministic: configure your component sizes, and the tool sums the total vCPU, RAM, and disk demand — no workload VM estimates needed.


    4. Compute sizing formula

    For Workload Domains, tenant demand is specified as VM count × per-VM averages for vCPU, RAM, and disk. Infrastructure VMs (NSX Edges, VKS Supervisor nodes) can optionally be included in the WLD demand totals. All demands are then sized against the host specification to determine the compute host floor.

    WLD demand totals:

    Demand vCPU = (VMs × vCPU/VM) + Infra vCPU
    Demand RAM = (VMs × RAM/VM) + Infra RAM
    Demand Disk = (VMs × Disk/VM) + Infra Disk

    Usable capacity per host:

    Usable vCPU/host = Host Cores × CPU Oversub × (1 − Reserve%)
    Usable RAM/host = Host RAM × RAM Oversub × (1 − Reserve%)

    Compute host floors (evaluated independently):

    CPU Hosts = ⌈ Demand vCPU / Usable vCPU per host ⌉
    RAM Hosts = ⌈ Demand RAM / Usable RAM per host ⌉

    Example: 200 VMs × 4 vCPU = 800 vCPU demand. Host: 2×16c = 32 physical cores × 2× oversub × 0.70 reserve factor = 44.8 usable vCPU/host. CPU Hosts = ⌈ 800 / 44.8 ⌉ = 18 hosts.


    5. vSAN ESA storage pipeline

    vSAN ESA storage sizing is a sequential pipeline of capacity transformations. Each stage adds overhead for a specific reason. Starting from raw VM disk demand, the pipeline applies data reduction, swap space, protection overhead, free space reserve, and growth buffer — in that order — to arrive at the total raw capacity required and therefore the storage host floor.

    Pipeline stages:

    Step 1 — VM Capacity GB = Demand Disk GB ÷ DRR
    (DRR = Dedup Ratio × Compression Ratio)
    Step 2 — Swap GB = Demand RAM GB × VM Swap%
    (100% for mgmt, configurable for WLD)
    Step 3 — Interim GB = VM Capacity GB + Swap GB
    Step 4 — Protected GB = Interim GB × Protection Factor (PF)
    Step 5 — With Free GB = Protected GB × (1 + vSAN Free%)
    Step 6 — Total Required = With Free GB × (1 + Growth%)

    Storage host floor:

    Effective Hosts = Total Hosts − Failures to Tolerate
    Per-Host Requirement = Total Required GB ÷ Effective Hosts
    Storage Hosts = ⌈ Total Required GB / Raw GB per Host ⌉ + Failures

    Data Reduction Ratio (DRR)

    The tool splits DRR into two separate inputs: Dedup Ratio and Compression Ratio. DRR = Dedup × Compression. Both default to 1.0 (no reduction) because real-world ratios depend entirely on data entropy — databases compress poorly, VDI golden images deduplicate extremely well. Using optimistic DRR values leads to undersized storage clusters.

    ⚠️ DRR above 2.0 is optimistic. Unless you have measured DRR from an equivalent workload in your environment, keep both ratios at 1.0. A DRR of 2.0 halves your storage host count. If the real-world ratio comes in at 1.2, you’ll need significantly more hosts than planned.

    TiB conversion

    The tool uses binary TiB throughout. NVMe drives are marketed in TB decimal (1 TB = 1,000 GB). Conversion: 1 TB = 1,000 GB = 0.9095 TiB. A 6× 7.68 TB host = approximately 41.9 TiB raw per host after conversion.


    6. Protection policies & PF table

    The Protection Factor (PF) is the storage overhead multiplier applied to usable data to account for redundancy. It is determined by your chosen RAID type, FTT (Failures to Tolerate), and for RAID-5, the stripe width. The tool enforces the minimum host count per policy.

    PolicyPFMin HostsFTTNotes
    RAID-5 2+1 FTT=11.50x31Default — best balance of protection and efficiency
    RAID-5 4+1 FTT=11.25x61Lower overhead but needs 6+ hosts
    RAID-6 4+2 FTT=21.5x62Two simultaneous drive failures tolerated
    Mirror FTT=12.x31Simple mirror — highest rebuild performance
    Mirror FTT=23.×52Three copies of every object
    Mirror FTT=34.×73Maximum redundancy — very high storage cost

    7. Final host count & limiter

    The final host count is the maximum across four independent floors: CPU hosts, RAM hosts, storage hosts, and the policy minimum. The tool identifies which floor is binding and labels it the Limiter.

    Final Hosts = max( CPU Hosts, RAM Hosts, Storage Hosts, Policy Min )
    LimiterMeaningCommon cause
    ComputeCPU is the binding constraintHigh vCPU density, low oversub ratio
    MemoryRAM is the binding constraintMemory-intensive workloads, RAM oversub at 1×
    StoragevSAN ESA capacity drives the countLarge disk demand, high PF, low DRR, insufficient NVMe
    PolicyProtection policy min host countSmall cluster — compute fine but policy enforces minimum N hosts

    When storage is the limiter, your NVMe capacity per host is insufficient to hold the protected dataset within the compute-determined host count. Solutions: increase NVMe drive count or size, relax the vSAN free% reserve, or accept a higher host count.


    8. NVMe memory tiering (VCF 9)

    VCF 9 introduces NVMe-backed memory tiering, where fast NVMe drives act as a memory extension. A partition of each NVMe drive is set aside as a memory tier — not storage — allowing effective RAM per host to exceed physical DRAM installed. This can reduce the host count when memory is the sizing constraint.

    Tiering formulas:

    Partition GB = min( Drive GB, DRAM × NVMe Ratio, 512 GB cap )
    NVMe Ratio Used = Partition GB ÷ Host DRAM GB
    Effective Host RAM = Host DRAM × (1 + NVMe Ratio Used)
    Tiered Demand R = ( Eligible Demand ÷ (1 + NVMe Ratio Used) )
    + Ineligible Demand

    Key inputs: Eligibility % (what fraction of workload is not latency-sensitive), NVMe-to-DRAM ratio (GB of NVMe tier per GB of DRAM), and tier drive size (separate from vSAN data drives). The effective RAM and reduced demand figure feed back into the RAM host floor calculation.

    ⚠️ Tiering caveats. NVMe tiering suits read-heavy workloads with temporal locality. It is not appropriate for latency-sensitive databases, real-time analytics, or anything where memory bandwidth consistency matters. The eligibility % input requires honest assessment of your workload mix.


    9. External storage mode

    Both the Management Domain and each WLD can be toggled to External Array mode — modelling Fibre Channel or NFS as principal storage. In this mode, the vSAN ESA storage pipeline is bypassed entirely. Host count is determined by compute only, and the user supplies an estimated array capacity for documentation.

    Final Hosts (ext) = max( CPU Hosts, RAM Hosts, Policy Min )
    — Storage floor is removed

    The Limiter can only be Compute, Memory, or Policy. No ESA capacity, PF, or per-host storage figures are calculated for external domains.

    Entitlement impact

    Every VCF core licence includes 1 TiB of vSAN raw storage entitlement. When a domain runs external storage, those cores are still licensed at the same cost but the bundled vSAN storage is unused.

    Forfeited TiB = Licensed Cores × 1 TiB/core

    For a 10-host domain with 2×32c hosts, that’s 640 TiB of vSAN entitlement forfeited — storage the customer is paying for but not using. The tool surfaces this inline, in the Fleet License Summary, and in the export report so the commercial impact is visible before procurement conversations begin.


    10. VCF licence entitlement calculation

    VCF 9 is licensed per core. The tool calculates total core count across the fleet and derives the vSAN storage entitlement bundled with those licences.

    Mgmt Cores = Mgmt Hosts × Host Cores
    WLD Cores = Σ( WLD Hosts × Host Cores )
    Entitlement (TiB) = ( Mgmt Cores + WLD Cores ) × 1 TiB/core
    Fleet vSAN Raw TiB = Σ( Hosts × NVMe Qty × NVMe TB × 0.9095 )
    Add-on Required = max( 0, Fleet Raw TiB − Entitlement TiB )

    If raw capacity exceeds entitlement, the difference is flagged as Add-on TiB Required — additional vSAN capacity licensing needed beyond what’s included in core licences. External storage domains exclude their array capacity from the fleet raw total.


    11. Principal storage options in VCF 9 (KB 416270)

    VCF 9 supports a broader set of principal storage options than previous versions. Some are available via standard greenfield workflows; others require the Converge workflow. This distinction matters — it affects automation, LCM, and Day 2 operations.

    Storage ModelMgmt DefaultMgmt AdditionalVI WLDMethod
    vSAN ESAPrincipalPrincipalPrincipal🟢 Greenfield
    vSAN OSAPrincipalPrincipalPrincipal🟢 Greenfield
    Storage Cluster (disagg. vSAN)PrincipalPrincipal🟢 Greenfield
    Compute-Only ClusterPrincipalPrincipal🟢 Greenfield
    Fibre Channel (FC)PrincipalPrincipal + SuppPrincipal + Supp🟢 Greenfield
    NFS v3PrincipalPrincipal + SuppPrincipal + Supp🟢 Greenfield
    iSCSIPrincipal*Principal*Principal*🔄 Converge
    NFS v4.1Principal*Principal*Principal*🔄 Converge
    FCoEPrincipal*Principal*Principal*🔄 Converge
    NVMe/FC · NVMe/TCP · NVMe/RDMAPrincipal*Principal*Principal*🔄 Converge

    * Via Converge workflow: deploy ESXi 9 → configure target datastore → deploy vCenter 9 → import into VCF 9 using Converge (management) or Import vCenter (WLD).

    ⚠️ Day 2 operations constraint: For non-LCM Day 2 operations (host commissioning, adding/removing hosts or clusters), perform the operation in vCenter first, then run Sync Inventory in VCF Operations. If this step is skipped, lifecycle management in VCF Operations will be blocked for those hosts and clusters.

    Source: Broadcom KB Article 416270


    12. Assumptions & caveats

    AssumptionDetail
    Single cluster per domainEach WLD is modelled as one cluster. Multi-cluster WLDs are not supported.
    Homogeneous hostsAll hosts within a domain use the same spec. Mixed-node clusters are not modelled.
    vSAN ESA onlyThe storage pipeline models ESA only. vSAN OSA has different overhead characteristics.
    Growth is a flat bufferGrowth % is applied once, not compounded year-over-year. Add headroom manually for multi-year plans.
    VM Swap fixed at 100% for mgmtThe management domain’s swap requirement is not user-configurable.
    No stretched cluster modellingStretched clusters double host count and require witness nodes — not currently modelled.
    Flat DRR across all dataA single DRR applies to the entire disk demand. Mixed workloads with varying compressibility are not modelled per-VM.
    No explicit vSAN CPU/RAM overheadvSAN ESA consumes a small amount of host CPU and memory. Include this in your Compute Reserve % input.

    🚫 Not an official Broadcom tool. This sizer is an independent planning aid built by vmtechie.blog. It is not endorsed by or affiliated with Broadcom. All figures are estimates. Validate every design against official Broadcom TechDocs, VMware HCL, and field engineering guidance before procurement or deployment.

  • VCF 9 Fleet Planning Sizer

    VCF 9 Fleet Planning Sizer

    After several VCF design sessions—navigating management domains, ESA policies, and the new core-based licensing—one thing became clear: we have plenty of docs, but we need more interactive clarity. I built the VCF 9 Fleet Planning Sizer (ESA Only) to help architects model environments quickly.

    🔷 VCF 9 Fleet Planning Sizer (ESA Only)

    👉 Try it here: https://sizer.vmtechie.blog/

    This is an independent planning calculator designed to help architects model:

    • Infrastructure VM footprint (Supervisor, Edge, etc.)
    • Management Domain sizing
    • Multiple Workload Domains
    • ESA storage behavior
    • DRR (Dedup × Compression realism)
    • Failure domain modeling (0 / N+1 / N+2)
    • Core-based licensing visibility
    • vSAN entitlement vs raw consumption

    Why I Built This Tool

    Designing VCF 9 isn’t just about adding up VMs. It’s about navigating the “Triple Constraint”: Compute, ESA Storage, and Licensing. In real architecture discussions, we constantly ask:

    • What is actually limiting this cluster?
    • CPU, Memory, or Storage?
    • How many hosts do we really need?
    • What does FTT=2 + RAID-6 really do to capacity?
    • Are we oversizing?
    • Are we license constrained?
    • What happens if I add Supervisor HA?
    • What does N-2 failure tolerance mean in practice?

    Spreadsheets can answer parts of this, but they don’t show the dynamic interaction between policy, compute, and ESA, This tool tries to do that.

    Management Domain Sizing

    The calculator starts with:

    🔹 Hardware Profile

    • CPUs per host
    • Cores per CPU
    • RAM per host
    • NVMe quantity & size
    • Minimum host count

    🔹 Policy Inputs

    • CPU oversubscription
    • Memory oversubscription
    • Host reserve %
    • FTT & RAID policy
    • vSAN free space %
    • Dedup & compression
    • VM Swap Used %
    • Failure modeling

    How It Calculates Management Hosts

    1. Compute usable vCPU per host
    2. Compute usable RAM per host
    3. Apply reserve factor
    4. Compare demand from full Management VM stack
    5. Determine limiter (Compute / Memory / Storage)
    6. Calculate ESA protected storage requirement
    7. Apply failure domain logic
    8. Final host count = max(CPU, RAM, Storage, Minimum Hosts)

    You immediately see:

    • Demand vs Capacity
    • Protection Factor
    • ESA storage breakdown
    • Core licenses required
    • Raw TiB consumed

    Full Management VM Stack Modeling

    The tool includes:

    • SDDC Manager
    • vCenter
    • NSX Manager
    • NSX Edge
    • AVI
    • VCF Operations
    • Log Insight
    • Network Insight
    • Identity
    • Custom VMs

    Each with T-shirt sizing.

    ESA Storage Model

    ESA math is often misunderstood,The calculator models:

    VM Capacity = (VM disks + infra disks) / DRRSwap = Provisioned RAM × Swap %Interim Total = VM Capacity + SwapProtected = Interim × Protection Factor+ Free Space Reserve+ Growth %Storage Hosts = ceil(total / per-host raw capacity + failures)

    Protection Factor examples:

    PolicyFTTProtection Factor
    RAID-112.0
    RAID-123.0
    RAID-511.5
    RAID-521.75
    RAID-621.5

    Workload Domains (Where It Gets Interesting)

    You can add multiple WLDs.

    Each WLD has:

    🔹 Tenant Demand

    • VM count
    • vCPU per VM
    • RAM per VM
    • Disk per VM
    • Growth %

    🔹 Policy + Planning

    • CPU/Mem oversub
    • FTT + RAID
    • Reserve %
    • Free space %
    • Dedup × Compression
    • VM Swap Used %
    • Failure Domain (0 / N+1 / N+2)

    Limiter Visualization + Health Model

    Each WLD shows:

    • Compute limiter
    • Memory limiter
    • Storage limiter
    • Utilization %
    • Health badge:
      • 🟢 Healthy
      • 🟡 Tight
      • 🔵 Oversized

    This gives immediate architectural intuition.

    Licensing Visibility (Core-Based)

    The calculator also models:

    • Management core licenses
    • Workload core licenses
    • Total fleet cores
    • Entitlement (1 TiB per core)
    • Required add-on capacity

    What Makes This Different?

    This tool is:

    ✔ ESA-focused
    ✔ Policy-aware
    ✔ Failure-domain realistic
    ✔ Multi-domain capable
    ✔ Licensing visible
    ✔ Architecture-driven

    It’s not just math. It reflects real design conversations.

    ⚠️ Important Disclaimer

    This calculator is:

    • Independent
    • Not an official Broadcom / VMware tool
    • Not endorsed by my employer
    • Intended as a planning aid only

    Always validate against:

    • Official documentation
    • HCL
    • Field engineering guidance

    🧑‍💻 Who Is This For?

    • VCF Architects
    • Cloud Platform Leads
    • Infrastructure Engineers
    • Pre-sales Architects
    • Capacity planners
    • Anyone doing ESA-based VCF 9 designs

    🚀 Try It

    👉 Live here:

    https://sizer.vmtechie.blog

    If you test it, I’d love feedback

    Final Thoughts

    Architecture clarity reduces risk.This tool is my contribution to making VCF 9 planning:

    More transparent.
    More realistic.
    More engineer-friendly.

  • VCF 9 – Updating the Supervisor Service

    VCF 9 – Updating the Supervisor Service

    Supervisor and VKS clusters are built using a common Kubernetes distribution core, but their Kubernetes versions are delivered differently. Starting with VCF 9, Supervisor Kubernetes releases are delivered independently of vCenter. You can update the Supervisor version by deploying a release from the Supervisor Content Library. In this blog post, we will walk through the Supervisor update process step by step. Let’s get started!

    Create and Configure a Subscribed Content Library for Supervisor Images

    For vSphere Supervisor, VMware publishes Supervisor images through a content delivery network (CDN). To enable or upgrade vSphere Supervisor, you can create a Subscribed Content Library that synchronizes with the Supervisor release images.

    You can configure the content library in either Immediate or On-Demand synchronization mode. Note that immediate synchronization from the public CDN may require more time and consume additional disk space.

    • Log in to vCenter as a vSphere administrator.
    • From the Home menu, select Content Libraries
    • Click Create
    • Provide a name for the library (for example, supervisor update library) and click Next.
    • On the Configure Content Library page, select Subscribed Content Library.
    • In the Download content section, select the synchronization mode of the content library and click Next
    • When prompted, accept the SSL certificate thumbprint.The thumbprint will remain stored on your system until the subscribed content library is removed from the inventory
    • Apply Security Policy click Next
    • On the Add storage page, select a datastore as a storage location for the content library contents and click Next.
    • Review the details and click Finish

    Assign the content library to the vSphere Supervisor platform

    • on vCenter go to Home menu, select Supervisor Management
    • Select Content Distribution.
    • On the Supervisor Images Library card, click Assign
    • Select the Content Library that created above and click Assign
    • The new content library begins synchronizing, which may take some time. After synchronization is complete, the new Supervisor Kubernetes versions included in the images will appear under the Updates tab

    Apply Updates

    • Select the Available Version you want to update to. For example: v1.30.10+vmware.1-fips-vsc9.0.0.0100. ⚠️ Updates must be applied incrementally. You cannot skip versions (e.g., upgrading directly from 1.28 to 1.30). The correct sequence is 1.28 → 1.29 → 1.30.
    • Select a Supervisor to update and click Apply Updates

    The system runs a series of pre-checks to verify the compatibility of the different components against the Supervisor Kubernetes version to which you want to update.

    Learn which are the pre-checks that are run before updating the supervisor and how to troubleshoot in case of errors resulting from failed pre-checks, HERE

    When the pre-checks are completed successfully, you can update the Supervisor.

    Upgrading the VMware vSphere Supervisor service is a critical step in maintaining a secure, stable, and feature-rich VMware Cloud Foundation environment. By following best practices—planning incremental updates, leveraging subscribed content libraries, and validating compatibility at every stage—administrators can ensure minimal downtime while keeping workloads and Kubernetes clusters up to date. Regular Supervisor upgrades not only enhance platform capabilities but also strengthen the foundation for running modern applications, containers, and cloud-native services efficiently and reliably.

  • VCF Automation – Tenant Management

    VCF Automation – Tenant Management

    In today’s multi-tenant cloud environments, VMware Cloud Foundation Automation (VCFA) offers a robust layered architecture that seamlessly bridges enterprise-grade infrastructure management with developer-ready self-service capabilities.

    By clearly separating responsibilities—from VMware Cloud Service Providers who manage the physical and virtual infrastructure, to organization administrators who allocate resources, and finally to developers who consume them—VCFA enables efficient resource governance, operational consistency, and scalability. This structured approach not only supports multi-tenancy and workload isolation but also accelerates innovation by empowering end users to deploy applications and services quickly within well-defined boundaries.

    Why Tenant Management Matters?

    Tenant management is more than just dividing resources—it’s about ensuring cost efficiency, security, scalability, and compliance in a shared infrastructure. In VCFA, these capabilities allow VMware Cloud Service Providers to maximize utilization without compromising performance or governance for individual tenants.

    Key concepts to understand from both the Provider and Tenant perspectives:

    Projects

    Projects control user access to namespaces and user ownership of provisioned resources. All organizations are created with a default project. The default project is empty and does not have any namespaces or users.

    Example: A VMware Cloud Service Provider might assign a dedicated project to each customer department for clearer billing and isolation.

    Regions

    The Regions page lists all the regions where the organization has a quota in. Organizations can have a quota in one or many regions. Your provider administrator assigns the regional quota to your organization. Quota in a region can come from one or many vSphere Zones within that region.

    Example: A global enterprise hosted by a VMware Cloud Service Provider might have quotas in Asia and Europe to ensure low-latency access for local teams.

    Namespace Class

    Namespace classes are templates for namespace provisioning. These templates can be used to standardize namespace attributes, like utilization limits, reservations, VM classes, storage classes, and content libraries. organizations comes preconfigured with three default namespace classes (small, medium, and large), which are meant to serve as example templates. The only different attributes among these built-in templates are the CPU and Memory limits. Administrators can use these templates as-is or can modify them to suit their needs.

    Namespace

    Projects are the central construct for organizing and allocating infrastructure resources to tenants or teams. As the organization administrator, you manage and distribute infrastructure by assigning namespaces to projects. When configuring a project, you must add at least one namespace so that users within the project can begin provisioning workloads such as virtual machines, VMware Kubernetes Service (VKS) clusters, or other supported resources. Namespaces act as scoped resource pools, defining limits for CPU, memory, and storage to ensure fair allocation and performance consistency. Each namespace is tied to a Virtual Private Cloud (VPC) and a namespace class, which in turn is associated with at least one zone to determine placement and availability. This structure not only enforces resource governance but also enables automation workflows to deploy consistently within predefined boundaries. All organizations are created with a default project, which is initially empty and contains no namespaces or users, providing a baseline starting point for configuration.

    Example: A tenant of a VMware Cloud Service Provider might create separate namespaces for development and production to avoid accidental resource conflicts.

    Virtual Private Clouds (VPCs)

    A Virtual Private Cloud (VPC) in VMware Cloud Foundation Automation (VCFA) offers an isolated networking environment that can be associated with one or more namespaces. Organizations can create multiple VPCs and assign each to specific namespaces based on workload or isolation requirements.

    Each VPC is an independent network and supports three types of IP address spaces, each offering different levels of reachability:

    • Private CIDRs: These addresses are internal to the VPC and are not routable outside without NAT. They are managed by the VPC administrator and do not need to be globally unique, allowing reuse across multiple VPCs.
    • TGW Private IP Blocks: These IP blocks are scoped at the organization level and are advertised through the Transit Gateway (TGW) within the organization. Organization admins define these blocks, and project admins can allocate subnets from them for their VPCs. This enables direct communication between VPCs in the same organization using the TGW Private IP space.
    • External IP Blocks: Managed by the provider admin, these IPs enable outbound access through Source NAT. Organization admins can assign subnets from provider-defined external blocks, giving workloads external connectivity while still using internal addressing.

    You can choose to deploy a separate VPC per namespace for stricter isolation, or share a VPC across namespaces where network separation is not required.

    Transit Gateways

    Each organization has a transit gateway which provides connectivity to the provider gateway within the organization. One or more VPCs are connected to the transit gateway, and that connection is defined by a VPC connectivity profile. Each VPC has connected workloads and a private subnet. SNAT rules translate addresses from this private subnet to a public address in the IP spaces block. This infrastructure enables the organization and its workloads to connect to external networks.

    You can view what transit gateways are available to your organization on the Manage & Govern > Networking > Transit Gateways page.

    IP Management

    Provider can use IP Spaces to manage their IP address allocation needs. IP Spaces provide a structured approach to allocating public IP addresses to different organizations, enabling connectivity to external networks.

    An IP space consists of a set of CIDR blocks that are reserved, these CIDRs must be dedicated to  and used by organization administrators as they configure services. An IP space can only be IPv4.

    Organization administrators can create and manage the private IP blocks within their organization. there tenant can view external IP address blocks assigned to this organization by a provider. You can also create and view private TGW IP address blocks for the entire organization to use. Finally, you can view private VPC IP address blocks that are applicable to specific VPCs.

    In essence, VMware Cloud Foundation Automation’s tenant management capabilities provide a structured, role-based framework for organizing projects, namespaces, VPCs, transit gateways, and IP resources. By aligning provider and tenant responsibilities, VMware Cloud Service Providers ensure secure isolation, consistent governance, and streamlined automation—empowering organizations to scale efficiently while maintaining full control over infrastructure and networking resources.