Author: vmtechie

VCF 9.1 Makes VKS Harder to Ignore

VKS on VCF 9.1 What Actually Changed & Why It Matters

A Comic Book Story in Seven Chapters

Issue #01 · May 2026 · The VCF 9.1 Saga

⚡ Cast of Characters ⚡

Captain VKS

vSphere Kubernetes Service 3.6

The hero. Born from vSphere, forged in CNCF conformance. Now powered up with VCF 9.1 abilities.

The Architect

Platform Engineer

Our protagonist. Runs multi-domain VCF estates. Needs Kubernetes at enterprise scale without the circus.

Cluster Creep

The Villain of Sprawl

Feeds on operational toil, slow provisioning, and fragmented toolchains. Grows stronger with every manual step.

The Oracle

VCF Operations 9.1

Sees all. Knows cost. Tracks every namespace. Speaks in metrics and FinOps.

🥊 The Challengers 🥊

The Cloud Twins

The Hyperscaler Duo

They move fast and always whisper: “Just move to our cloud.” They charge per hour and never let go.

The Red Baron

The Opinionated Platform

Arrives in full armor. Brings his own runtime, registry, mesh, and opinions about everything. Enterprise prices included.

The Wrangler

The Multi-Cluster Cowboy

Rides across any ranch — any cloud, any edge, any distro. Freedom is his creed. But who’s managing the cattle?

Bare Knuckle

The DIY Brawler

No platform. No hand-holding. Bare metal, kubeadm, and grit. Cheap up front. Costs you in blood and 3 AM pages.

Chapter 01The 37-Minute Nightmare

The data center. 6:42 AM. The Architect stares at a provisioning timer that refuses to move. Cluster Creep watches from the shadows, feeding on frustration.

The Architect 37 minutes to spin up a dev cluster. Thirty. Seven. Minutes. The hyperscaler team next door gets theirs in ten. The CTO is asking questions.

Cluster Creep Yesss… and that’s just the deployment. Wait until you see the upgrade windows. I’ve got 45 minutes of downtime planned for each cluster. You have 200 clusters. Do the math. 😈

That’s 150 hours of maintenance windows per upgrade cycle… across the fleet…

May 5, 2026. Broadcom releases VCF 9.1. And everything changes.

Captain VKS Miss me? I brought Fast Deploy. Let me show you the new numbers.

Metric	VCF 9.0	VCF 9.1
Cluster Deploy Time	37 min	11 min (↓69%)
Cluster Upgrade Time	45 min	15 min (↓67%)
Max Clusters / Supervisor	~100	500
Node Pool Placement	Manual	DRS Intelligent

Chapter 02The Challengers Step Forward

Word of VCF 9.1 spreads. Four challengers emerge from the fog, each claiming the throne of enterprise Kubernetes. The Architect has heard their pitches before.

The Cloud Twins Adorable upgrade, Captain. But we’ve been doing sub-10-minute clusters for years. Managed control plane. Global regions. Auto-scaling node groups. Why fight gravity? Just come to the cloud.

Captain VKS Sure — and your managed control plane costs how much per cluster per month? Multiply that by 500 clusters. Now add the egress fees. Now add the data sovereignty audit your CISO just mandated. I run on hardware you already own.

The Red Baron How charming. You finally got CNI choice? I ship with my own SDN, my own service mesh, my own registry, my own CI/CD pipelines, and a full developer portal. I am the platform. You’re still assembling one.

Captain VKS You are the platform. That’s the problem. Your opinions become my constraints. Your lifecycle becomes my upgrade treadmill. Your per-core subscription becomes my CFO’s nightmare. I give choice. You give mandates.

The Wrangler Y’all are so cute with your single-vendor stacks. I run on any infrastructure. True multi-cluster freedom. No lock-in. Ever.

Captain VKS Freedom is great until your team is maintaining six different infrastructure backends. I give you 500 clusters on one Supervisor with one operational model. You give them options and a prayer.

Bare Knuckle I don’t need a platform. kubeadm, a Makefile, and raw skill. Zero licensing. Zero overhead. Pure Kubernetes.

Captain VKS I respect the craft. But who patches your nodes at 2 AM? Who handles etcd backups? Who runs certificate rotation? Your “zero cost” platform costs three full-time engineers.

The Architect I’ve evaluated all of you. Here’s my problem: I already run VCF. My VMs, NSX networking, vSAN storage, and security policies are all here. I need Kubernetes that joins my platform — not one that replaces it or ignores it.

The best Kubernetes platform is the one that doesn’t make me build a second operations team…

Chapter 03Fast Deploy — 11 Minutes or Bust

Captain VKS explains what changed under the hood. Fast Deploy isn’t a marketing stunt — it’s an architectural rework of the provisioning pipeline.

Captain VKS Here’s what actually happened. We parallelized the node bootstrapping sequence, pre-staged container images into a local content library, and eliminated redundant API round-trips during cluster init. 11 minutes, from API call to workload-ready.

The Architect What about upgrades? That’s where we bleed. Every cluster upgrade is a maintenance window, and my team juggles 200+ clusters.

Captain VKS 45 minutes down to 15. Pre-staged images, parallel node drain-and-replace, and Multiple Clusters per Zone means you keep workloads running on Zone A while upgrading Zone B.

⚡ IMPACT METER ⚡

Provisioning Speed Gain

69%

Upgrade Speed Gain

67%

Scale Ceiling Increase

5× (500 clusters)

The Cloud Twins 11 minutes… fine, that’s competitive. But can you match our global availability zones?

Captain VKS I don’t need 60 regions. My Architect’s data stays in his sovereign data center, on his hardware, under his compliance umbrella. Your 60 regions are 60 places his CISO has to audit.

Chapter 04DRS Strikes Back — Intelligent Node Pool Placement

VCF 9.1 introduces Intelligent Node Pool Placement. This isn’t basic affinity rules — it’s DRS-level scheduling applied to Kubernetes node pools.

Captain VKS GPU pods → GPU hosts. NVMe workloads → NVMe nodes. DRS algorithm decides placement — not your YAML-wrestling platform team.

The Red Baron I have Topology Manager, NUMA-aware scheduling, and a full operator ecosystem. Infrastructure-aware placement is table stakes for me.

Captain VKS You schedule within the cluster. I schedule the cluster itself. DRS sees the whole estate. Your scheduler sees one namespace.

The Oracle With VKS Cost Showback in VCF Operations 9.1, I can tell you exactly what each namespace, each cluster, each team is costing you. FinOps FOCUS-compliant.

The Oracle I also expose an API for your RAG pipelines and MCP frameworks — your AIOps engine can query cost data directly.

Per-NS

Cost Attribution

FOCUS

FinOps Compliant

Real-Time

Pricing Estimates

Show + Charge

Back Capability

Chapter 05Container-as-a-Service & The CNI Revolution

VCF 9.1 introduces a simplified Container Service — deploy containers without deep Kubernetes expertise. Meanwhile, VKS 3.6 opens up CNI choice for the first time.

Captain VKS First: Container-as-a-Service. Your app teams get a self-service surface. Click, deploy, done. No Supervisor clusters or ClusterClass YAML.

Captain VKS Second: CNI freedom. VKS 3.6 deprecated ClusterBootstrap. Pick your CNI through the Addon Framework using AddonConfig CRDs. Antrea default, but the door is open.

The Wrangler Oh, you’re just now letting people choose their CNI? Welcome to 2022, Captain.

Captain VKS You let them choose. I let them choose with validated blueprints, lifecycle support, and a single vendor to call at 3 AM. Choice without support is just risk with extra steps.

The Architect And the Ingress story? The popular open-source Ingress controller is being retired…

Captain VKS Avi Load Balancer — natively integrated. Centralized control plane, distributed data plane, full observability. Plus vDefend gives you zero-trust lateral security for every pod.

Chapter 06The Arena — Where Platforms Are Measured

The Architect pulls up the scoreboard. No hype. No marketing. Just the dimensions that matter when you’re running Kubernetes in a regulated enterprise with 500+ VMs already on VCF.

⚔️ HEAD TO HEAD ⚔️

Dimension	🛡️ Captain VKS	☁️ Cloud Twins	🎩 Red Baron	🤠 Wrangler	🥊 Bare Knuckle
Data Sovereignty	Your DC	Their DC	Your DC	Depends	Your DC
VM + K8s Unified Ops	Native	Separate	Separate	Separate	Separate
Infra-Aware Scheduling	DRS-Level	Node Groups	Topology Mgr	Manual	DIY
Cluster Scale Ceiling	500 / Supervisor	Unlimited*	Per Infra	Per Infra	Per Team
Integrated FinOps	FOCUS Native	Cost Explorer	3rd Party	3rd Party	Spreadsheet
Network Security	vDefend + NSX	VPC / SG	Built-in SDN	BYO	BYO
Licensing Model	Per-Core VCF	Per-Cluster/Hr	Per-Core Sub	Open Source	Free
Day 2 Toil	Low	Low	Medium	Medium	High
AI / GPU Conformance	CNCF AI Cert	GPU Pools	Operators	BYO	BYO

The Cloud TwinsWe still win on global reach and elastic scale.

The Red BaronAnd I still own the developer experience story. Integrated CI/CD, GitOps, developer portal — out of the box.

Captain VKS Fair. I’m not claiming I win everywhere. But for organizations already running VCF — I’m the only Kubernetes that doesn’t create a second operational island. VMs and containers. One platform. One team. One pane.

The Architect That’s the point everyone misses. I don’t need the “best” Kubernetes in a vacuum. I need the best Kubernetes for my stack. And my stack is VCF.

Chapter 07The Numbers Don’t Lie

💥 THE FINAL SHOWDOWN 💥

Broadcom surveyed 44 VCF 9 customers in March 2026. Here’s what they found — and why the challengers are looking over their shoulders.

51%

Less Infra Mgmt Time

46%

Less Monitoring Time

47%

Less Capacity Needed

39%

Faster MTTR/MTTI

Cluster Creep No… NO! My sprawl… my complexity… my beautiful 37-minute deploy times… NOOOOO!

⚡ DEFEATED ⚡

The challengers watch from the sidelines. They’re not defeated — but they know the game just changed.

The Cloud TwinsWe’ll be back. Hybrid is where we’re heading too. See you at the edge…

The Red BaronImpressive numbers. But developer experience is the next battlefield. Don’t get comfortable.

The WranglerNot every ranch runs on one brand of fence. I’ll see you at the multi-cloud rodeo.

Bare KnuckleSome of us still prefer the raw fight. But… 11 minutes is hard to argue with.

The Architect VCF 9.1 gives me 11-minute deploys, 15-minute upgrades, 500 clusters per Supervisor, intelligent DRS-based node placement, native FinOps cost tracking, self-service CaaS, open CNI choice, native Avi ingress, and zero-trust pod security. All on the same VCF stack I’m already running.

Captain VKS And I’m CNCF Kubernetes AI Conformant. The challengers are strong — I respect each of them. But none of them can do what I do: run Kubernetes as a native citizen of your existing VMware estate.

VCF 9.1 doesn’t just iterate on VKS — it redefines the operational ceiling. Fast Deploy eliminates the provisioning tax. DRS-based placement removes manual scheduling toil. FinOps cost showback closes the last visibility gap. And with 500 clusters per Supervisor, VKS is the platform-scale Kubernetes runtime that VCF architects have been waiting for.

The challengers each bring real strengths — managed simplicity, opinionated platforms, multi-cloud freedom, zero-cost entry. This isn’t a story where the hero has no flaws. But for the Architect running a VCF estate with VMs, containers, and AI workloads under one roof — the calculus is clear.

The question is no longer “can VKS compete?” — it’s “what’s your excuse for not running it?”

📚 Sources & References

Deploy Modern Apps Faster with VKS in VCF 9.1 — VKS scale, Fast Deploy, CNI, node pool placement details
Announcing VCF 9.1: Modern Private Cloud Built for Efficiency and Resilience — Product announcement, AI conformance, Container Service
Scale, Simplify, and Secure Your Private Cloud Operations with VCF 9.1 — VCF Operations, VKS cost showback, FinOps FOCUS, customer survey data (n=44)
Accelerate, Streamline, and Control Your Self-Service Private Cloud with VCF 9.1 — VCF Automation, Container Service runtime, Fast Deploy (37 min → 11 min)
Simplify Workload Connectivity with VCF 9.1 — Multi-NIC VKS, Istio Service Mesh, Distributed Transit Gateway, IPFIX for pods
Broadcom Press Release: VMware Cloud Foundation 9.1 — Official announcement, AI positioning, Private Cloud Outlook 2026 survey
VKS 3.6 Release Notes — ClusterBootstrap deprecation, AddonConfig CRDs, ImageBaker, CNI flexibility
VCF 9.1 Hands-on Labs — Now Live — Try VKS 3.6 scale, Fast Deploy, and multi-network in the lab

May 13, 2026

VCF 9.1 – Top 20 Highlights for Cloud Service Providers
Multi-tenancy, self-service, networking isolation, storage economics, fleet operations — curated for the CSP lens.

VCF 9.1 is arguably the most CSP-significant VCF release in recent memory. The networking story alone — edge-free distributed connectivity, VPC isolation policies, EVPN/VXLAN peering — rewrites the playbook for multi-tenant service delivery. But the improvements span every layer: storage economics, Kubernetes density, fleet operations, and cyber recovery. Here are the 20 features that matter most if you’re running — or planning to run — a VMware-powered cloud service.

Multi-Tenancy & Self-Service

01 / 20

VCD → VCF Automation Migration Tool

This is the feature VCD-based CSPs have been waiting for. VCF 9.1 introduces a native migration path from VMware Cloud Director to VCF Automation. VMs are imported from OrgVDC resource pools directly into vSphere Namespaces. Supervisors, Clusters, Regions, Projects, and Namespaces are auto-created and mapped to existing VCD constructs. Network boundaries of OrgVDC are migrated to NSX VPC — preserving tenant isolation through the transition.

Why CSPs Care
Unblocks the single biggest migration concern for VCD-based providers. Automated construct mapping dramatically reduces migration effort, tenant downtime, and the professional services cost of transitioning to the VCF Automation operating model.

02 / 20

Self-Service Namespace Creation with Guardrails

Organization admins can now delegate vSphere Namespace creation to Project Admins on a self-service basis. The governance layer is granular: admins define which Regions, Namespace Classes, Connectivity Profiles, Subnets, Infrastructure Policies, VPCs, and Service Engine Groups are available to each project. Tenants consume within those boundaries without filing tickets.

Why CSPs Care
Every namespace creation ticket that disappears from a CSP’s queue is margin improvement. Self-service with admin-defined guardrails is the operational model CSPs need — tenant autonomy without infrastructure risk.

03 / 20

Upfront Pricing Estimates & Tenant Notifications

Tenants now see real-time pricing estimates before deploying catalog items, VMs, and VKS clusters. Consumption reports, infrastructure alerts, and critical operation notifications are surfaced directly in the VCF Automation UI. Providers configure which alerts and reports are visible to tenants.

Why CSPs Care
Transparent showback/chargeback is fundamental to CSP economics. When tenants see cost before they click “deploy,” billing disputes drop, resource waste decreases, and self-service confidence goes up.

04 / 20

Project-Scoped Content Libraries

A new form of content library scoped to specific projects within an organization. Admins can restrict VM image availability so that only the users and resources of a given project can access particular images. Canonical Ubuntu images are now available as validated, subscribed content — provider-controlled.

Why CSPs Care
Image governance per tenant project. CSPs curate approved OS images without cross-tenant leakage — essential for regulated tenants and for CSPs offering tiered service catalogs.

Networking & Tenant Isolation

05 / 20

VPC Connectivity Policies — Community, Promiscuous, Isolated

VCF 9.1 introduces connectivity policies that control inter-VPC communication within a tenant project — without firewall rules. Community: VPCs in the same community talk to each other. Promiscuous: talks to any VPC. Isolated: only communicates with promiscuous VPCs. These can be mixed within a project for precise segmentation.

Why CSPs Care
Multi-tier tenant networking (dev/staging/prod isolation, shared-services patterns) handled by policy rather than per-rule firewalls. Reduces CSP networking configuration overhead per tenant from hours to minutes.

06 / 20

Transit Gateway Advanced Connectivity

CTGW is now decoupled from Tier-0. VCF 9.1 supports HA mode per CTGW, multiple CTGWs and DTGWs per project, and multiple external connections per CTGW. For outbound traffic, tenants get full control over which Tier-0 is used, where SNAT is applied, and which External IP block is consumed.

Why CSPs Care
Per-project external connectivity with independent Tier-0 selection eliminates the shared gateway bottleneck. CSPs can model complex tenant topologies — multi-ISP, multi-region, dedicated uplinks — on shared infrastructure.

07 / 20

Distributed Transit Gateway with EVPN/VXLAN

Peer directly with the physical fabric using industry-standard EVPN/VXLAN. This decouples the control and data plane for north-south traffic — VMs get direct N/S connectivity without traffic tromboning through Edge appliances. No edge lifecycle, no edge provisioning, no edge scaling headaches.

Why CSPs Care
Edge VM sprawl is one of the top operational pain points at CSP scale. DTGW with EVPN/VXLAN eliminates it entirely for N/S traffic — better latency, fewer failure domains, dramatically simpler operations.

08 / 20

Virtual Network Appliances (VNA) — Edge-Free Network Services

A dedicated VNA Cluster now runs network services for Distributed External Connections: External IP (1:1 NAT), DHCP, NAT (SNAT/DNAT), VPC Outbound NAT (N:1 — new in 9.1), and NSX LB for Supervisor/VKS (new in 9.1) plus Avi VPC LB Plugin. Only NAT and LB traffic is redirected to VNAs — L2/L3 and External IP traffic remains fully distributed.

Why CSPs Care
Network services without deploying and managing Edge VMs per tenant. The distributed data-path keeps per-tenant traffic efficient while VNAs handle only the stateful services that need them.

09 / 20

TGW Span + Infoblox IPAM Integration

Transit Gateway Span constrains a TGW and its subnets to selected vCenter clusters — controlling where subnets are available, where workloads can be placed, and aligning DTGW spans with external connection VLANs. Separately, Infoblox integration discovers and maps Network Containers to external IP blocks, provisions subnets/IPs using Infoblox CIDRs, and auto-registers workload IPs and FQDNs.

Why CSPs Care
TGW Span gives CSPs physical network alignment per tenant cluster — critical for VLAN-constrained environments. Infoblox integration provides the single DDI source-of-truth that large CSPs already depend on, now natively integrated with VCF networking.

Storage & Data Efficiency

10 / 20

vSAN ESA Inline Compression (ZSTD) + Global Deduplication GA

vSAN 9.1 introduces a ZSTD-based inline compression algorithm tuned specifically for vSAN — delivering significantly higher data reduction ratios while balancing CPU utilization. Compression is now always-on. In parallel, vSAN Global Deduplication reaches GA, supporting between 3 and 64 hosts with improved processing efficiency. Crucially, Global Dedup is fully compatible with Data-at-Rest encryption — no negative impact on reduction ratios.

Why CSPs Care
Direct $/TB improvement. Better compression + dedup = higher tenant density per physical disk. This is fundamental to CSP storage margin economics, especially for VDI, database, and backup workload profiles.

11 / 20

Auto-RAID + Effective Capacity View

Auto-RAID automatically manages optimal resilience settings per cluster using a single “vSAN ESA Auto RAID Policy” in vCenter — dynamically adjusting as cluster size changes (4-host, 6-host stretched, 2-node, single-host bootstrap). The new “effective capacity” view replaces raw capacity statistics with usable capacity and simplified space-efficiency summaries covering dedup ratio, compression ratio, thin provisioning savings, and snapshot savings.

Why CSPs Care
No more manual storage policy tuning across hundreds of tenant clusters. Effective capacity view aligns with how CSPs bill and report storage — usable TB, not raw TB with overhead footnotes.

12 / 20

Native S3 Object Storage on vSAN — Technology Preview

Block, file, and S3-compatible object storage running on the same vSAN cluster. Multi-tenant object storage is provisioned and managed via VCF Automation or vSphere Supervisor. Scalable, resilient architecture courtesy of vSAN ESA. Available as Technology Preview in Patch 01 of VCF 9.1.

Why CSPs Care
A new service tier on existing hardware. CSPs can offer S3-compatible object storage to tenants without deploying separate storage infrastructure — opening up developer-oriented and AI/ML data-lake use cases.

Kubernetes & Containers

13 / 20

VKS: 500 Clusters per Supervisor + Fast Deploy

VKS now supports up to 500 Kubernetes clusters per Supervisor — a 2.6× scale increase over VCF 9.0. VKS 3.6 ships Kubernetes 1.35 (CNCF-certified, 24-month support). Fast Deploy leverages linked-clone (unencrypted VMs) and direct-mode (encrypted VMs) technologies to reduce cluster provisioning time by approximately 70% and upgrades by approximately 75%.

Why CSPs Care
Dramatically higher Kubernetes tenant density per control plane instance. Fast Deploy addresses burst scenarios common in VDI and retail — and reduces time-to-revenue for new K8s tenant onboarding from 37 minutes to 11 minutes.

14 / 20

Container Service — CaaS Without Kubernetes

Deploy isolated, secure containers directly on vSphere Pods within vSphere Namespaces — no full Kubernetes cluster required. UI-driven provisioning and lifecycle control. Supports StatefulSets with persistent volumes and multi-container pods. Based on the proven vSphere Pods technology with VM-level isolation.

Why CSPs Care
CSPs can offer a lightweight container service tier below full VKS — lower cost, faster deploy, familiar vSphere management. This broadens the addressable tenant market to teams that want containers but don’t need (or want to manage) Kubernetes.

Operations & Lifecycle

15 / 20

Unified Fleet IAM & Management

VCF 9.1 delivers end-to-end IAM with VCF-level roles across all components — vCenter, NSX, Operations, Automation, Logs, Networks, HCX, and Orchestration — all brokered through VIDB (Identity Broker). Unified password policies with vault integration, bulk certificate management (generate CSRs, renew, import across the fleet), and OAuth/API token access for programmatic automation. Custom VCF roles can be provisioned across vCenter and VCF instances.

Why CSPs Care
Single identity plane for the entire VCF estate. CSPs managing multi-instance fleets get consistent RBAC, password governance, and certificate rotation at scale — replacing the fragmented per-instance identity management that doesn’t survive operational audits.

16 / 20

Centralized LCM — 4× Parallel Upgrades

Lifecycle Management is now part of the VCF Services Platform with a unified software depot secured via OAuth token. Optimized precheck workflows and a 4× improvement in parallel cluster upgrade operations — centrally managed from VCF Operations. One place to download and manage binaries, and quickly assess health and upgrade readiness across the fleet.

Why CSPs Care
CSPs running hundreds of clusters can upgrade 4× faster in parallel. Single depot and centralized LCM eliminates the maintenance-window sprawl that plagues large CSP environments — turning a weekend-long upgrade cycle into an overnight operation.

17 / 20

Flexible Licensing — License Server + Aggregated Usage

VCF components are automatically licensed via vCenter when configured in connected mode. A dedicated license server offloads license logic from VCF Operations. Multiple licenses can be applied directly to a vCenter and its connected components. Aggregated license usage for ESX 8.x and 9.x. On-prem license appliance available for air-gapped or sovereign environments.

Why CSPs Care
CSPs with mixed-version estates (VCF 5.x through 9.x) get aggregated license management across generations. Override licenses support unique CSP scenarios — trial tenants, PoC environments, and tiered service offerings with differentiated entitlements.

Security & Cyber Recovery

18 / 20

On-Premises Cyber Recovery Clean Room

Full ransomware protection and recovery on customer-owned infrastructure — no cloud dependency. The solution extends vSAN Protection and Recovery to provide on-prem clean room capabilities with push-button vDefend-based network isolation, EDR integration (Carbon Black included by default, CrowdStrike BYOL supported), guided restore point selection, VM analysis and validation in the isolated environment, and orchestrated failback workflows.

Why CSPs Care
CSPs can offer “Cyber Recovery as a Service” as a premium tier — fully on-prem, data-sovereign, with clean room isolation that satisfies regulated industries prohibiting cloud-based recovery. The EDR vendor choice (Carbon Black or CrowdStrike) aligns with whatever the tenant already runs.

19 / 20

Security Posture Management & Compliance Automation

Fleet-wide compliance assessments using built-in benchmarks — enable benchmarks, assign to policies, clone and modify rules to suit requirements. Run assessments on-demand, view and filter results, export to PDF/CSV, and perform one-click remediation to infrastructure objects. Confidential Computing visibility through the SecOps dashboard (AMD SEV-SNP, Intel TDX). VCF-wide audit trails with standardized log architecture for security forensics.

Why CSPs Care
Automated compliance reporting for regulated tenants (FIPS 140-3, STIG, custom benchmarks). One-click remediation across the fleet reduces CSP audit preparation from weeks to hours. The audit trail becomes a sellable compliance artifact for tenants in financial services and government.

Edge

20 / 20

VCF Edge — 5,000 Hosts, 256 Parallel Upgrades, ZTP + GitOps

Fleet capacity doubled to 5,000 ESX hosts per instance. Parallel upgrade scale increased 4× from 64 to 256 clusters. Zero Touch Provisioning uses UEFI HTTPS Boot with TPM and Secure Boot support — hosts inherit desired-state image and configuration from the cluster, no TFTP required. Day-0 activation scripts configure vSphere clusters, Supervisor, and FLB. Argo CD-based GitOps provides pull-based workload delivery with drift detection and auto-correction. Flexible 1/2/3+ node topologies with full air-gap support.

Why CSPs Care
CSPs serving retail, telco, or industrial edge can scale to thousands of sites with lights-out ZTP and GitOps delivery. 256 parallel upgrades make fleet-wide patching operationally viable — a requirement for edge CSPs where site-by-site maintenance windows are physically impossible.

The CSP Takeaway

VCF 9.1 is a platform release, not just a feature release. The networking overhaul (DTGW, VNAs, VPC policies, EVPN/VXLAN) alone justifies the upgrade for any CSP running multi-tenant workloads. Layer on the VCD migration tool, self-service namespaces, storage economics improvements, and fleet-scale operations — and this is the release that brings VCF’s cloud operating model to parity with what CSPs have been building manually around VCD for years.
Share this:
X
Facebook
Like this:
Like Loading…
May 8, 2026
I Built a Tool to Stop YAML Hell During Cloud → VCF 9 VKS Migrations
Last quarter I was knee-deep in a customer migration project — moving production workloads from EKS to VMware vSphere Kubernetes Service (VKS) on VCF 9. The architecture made sense: consolidate onto VCF, get NSX VPC networking, vSAN storage, and ditch the monthly AWS bill. Standard stuff.

The Problem:

Migrating Kubernetes workloads between platforms means transforming hundreds of YAML files — changing image registries (ecr.aws.com → Harbor), remapping storage classes (gp3 → vSAN), stripping cloud annotations (IRSA, Azure Workload Identity), fixing deprecated APIs. I spent three days manually editing manifests with sed scripts and still missed edge cases.

The Solution:

I built this browser-based tool to automate the grunt work. It runs entirely in your browser — no backend, no data upload, works offline.

What It Does:

Upload your Kubernetes manifests, configure your VCF target environment (Harbor registry, vSAN StorageClass, Velero bucket), hit Analyze. The tool:
- Auto-fixes: Deprecated APIs, cloud annotations, image URLs, StorageClass refs, CSI drivers, node selectors
- Flags for review: Secrets (re-provision via Vault), LoadBalancers (NSX ALB sizing), HPA metrics, RWX PVCs
- Generates bundle: Transformed YAML, Velero commands, image mirror script, VolumeSnapshotClass, SC remap ConfigMap, step-by-step checklist
Built with AI assist (I provided migration domain knowledge, AI Assist wrote the transformation engine). Tested on real customer workloads migrating to VCF 9 with NSX VPC.

🚀 Try it below

// VMTECHIE.BLOG

EKS/AKS/OCP → VKS/VCF9

Upload K8s manifests → analyze cloud deps → transform for VKS on VCF 9 → download migration bundle

1UPLOAD

2ANALYSIS

3BUNDLE

UPLOAD & CONFIG

Export: kubectl get all,cm,secret,pvc,ingress,sa,pdb,hpa -n <ns> -o yaml

SOURCE PLATFORM

EKS

AKS

OCP

◫
DROP YAML FILES

or click · multi-file · .yaml .yml

OR PASTE YAML

Separate docs with ---

TARGET VKS CONFIG

HARBOR FQDN

On-prem Harbor registry

https://

HARBOR PROJECT

/

vSAN SC NAME

VELERO BUCKET URL

VELERO BUCKET NAME

TARGET NS (opt)

blank = keep original

ANALYSIS

Transformation complete.

ISSUES & CHANGES

MIGRATION BUNDLE

All files ready. Review tabs then download ZIP.

Upload YAML → Configure targets → Download migration bundle

Step-by-Step Usage:

Step 0 (Prerequisite): Export from Source Cluster
# Connect to your source EKS/AKS/OpenShift cluster kubectl config use-context my-eks-cluster # Export all resources from production namespace kubectl get all,configmaps,secrets,pvc,ingress,serviceaccounts,pdb,hpa \ -n production -o yaml > production-export.yaml # Repeat for each namespace you're migrating kubectl get all,cm,secret,pvc,ingress,sa,pdb,hpa -n staging -o yaml > staging-export.yaml
OpenShift users run this instead:
# oc get all,cm,secret,pvc,route,sa,pdb,hpa,deploymentconfigs,imagestreams \ -n production -o yaml > production-export.yaml
Step 1: Upload to Tool — Open the tool , drag production-export.yaml and staging-export.yaml files into the drop zone (or paste YAML content directly)

Step 2: Configure Target — Select source platform (EKS/AKS/OpenShift), fill in target VKS details:
- Harbor registry: harbor.corp.local
- Harbor project: migration
- vSAN StorageClass: vsan-default-storage-policy
- Velero bucket URL: https://minio.corp.local
- Velero bucket name: velero-backups
Step 3: Analyze — Click “Run Analysis” — tool processes YAML, shows results: X resources analyzed, Y errors, Z warnings, N auto-fixed

Step 4: Download Bundle — Click “Download Migration Bundle” → get vks-migration-bundle.zip with 6 pre-generated files ready to use

Known Limitations:
- Workload validation: Doesn’t check resource limits, affinity rules, or custom CRD schemas — manual review required
- Application code: Can’t detect SDK calls to AWS/Azure APIs in your app code — only infrastructure YAML
- Velero scripts: Templates, not production automation — test your backup/restore workflow thoroughly
- OpenShift Routes: Basic conversion only — complex regex paths or custom annotations need manual work
- Custom operators: Heavily customized or non-standard patterns may need review
- Edge cases exist: Assumes standard Kubernetes conventions — always dry-run before applying
Bottom line: This tool automates the 90% tedious bulk work. The 10% that needs your domain expertise still needs your domain expertise.

Disclaimer & Privacy:

This tool was built during my own migration work and is shared as-is for the community. It is not endorsed, supported, or affiliated with Broadcom, VMware, or my employer. All YAML processing happens entirely in your browser using JavaScript — no data is uploaded to any server, no backend exists, works completely offline. Always review transformed manifests carefully and run kubectl apply --dry-run=server against your VKS cluster before production deployment. This tool automates repetitive transformations but cannot replace human review of your specific workload requirements.
Share this:
X
Facebook
Like this:
Like Loading…
April 28, 2026

The Integration Debt Nobody Budgets For — And How VCF Eliminates It…

Optionality sounds powerful… until you have to operate it.

This is not a debate about which hypervisor is fastest or which Kubernetes distribution has the most GitHub stars. It is a more fundamental question: what does it cost your organisation to assemble a platform versus deploying one? And as AI workloads enter the data centre, that question has never carried higher stakes.

🔷 1. The Illusion of Flexibility

Modern infrastructure platforms arrive with a compelling pitch:

The Pitch

Choose your compute
Pick your storage
Define your networking
Add Kubernetes
Extend to AI later

At first glance, this looks like control. It reads like architectural maturity. It feels like optionality. The reality is subtler.

⚠️

Reality Check

What appears as flexibility often becomes integration responsibility. You are no longer just consuming a platform — you are building and maintaining one. The components are yours to choose. So is the glue, the upgrade matrix, and the 2am incident call when two of them disagree.

🔶 2. The Cost Nobody Invoices — Operational Fragmentation

Most infrastructure cost conversations stop at licensing. That is the wrong place to stop.

Organisations that assemble their stack from best-of-breed point products pay a tax that never appears on a single invoice. That tax is operational fragmentation — the compounding overhead of managing upgrade matrices, support escalations, skill silos, and integration glue between components that were never designed to coexist.

Hidden Costs of an Assembled Stack

🔁 Cross-component compatibility testing before every patch cycle
🔄 Coordinated upgrades across independently-released product versions
🧩 Integration gaps between tooling layers with no validated fix path
🛠 Multi-vendor troubleshooting with no single accountable party
📋 Separate training and certification paths per product silo
⚙️ Custom automation scripts that break on every minor version update

None of these costs appear on a rack-and-stack BOM. But they absolutely show up in headcount, MTTR, change failure rate, and the number of people needed on a change advisory board call to approve a routine patch.

📌

Key Insight

Complexity doesn’t disappear — it just moves. In optional models, it moves to the operator.

🔷 3. The Shift in What Matters

The success criteria for enterprise infrastructure has fundamentally changed.

Old Question

❌ “Do I have the best individual components?”

New Question

✅ “Can my platform run everything — consistently — at scale?”

Including enterprise VMs, Kubernetes workloads, and AI/ML pipelines — on the same operational model, under the same lifecycle management, enforcing the same security policy.

🔶 4. What “Integrated” Actually Means in a VCF Context

Integration is one of the most overloaded words in enterprise IT. Vendors routinely describe a collection of separately licensed, separately patched, separately supported products as an “integrated platform” because they share an API or a common UI skin. That is not integration — that is aggregation with a coat of paint.

True integration, as delivered by VMware Cloud Foundation, means something more fundamental. VCF is not a loose collection of components. It is an engineered system, built to operate as one.

Compute vSphere

Storage vSAN ESA

Networking NSX

Kubernetes vSphere Kubernetes Service

Operations VMware VCF Operations

Lifecycle VCF Ops in Conjuction with SDDC Manager

What integration actually delivers:

Single Bill of Materials: vSphere, vSAN ESA, NSX, VKS and VCF Ops are validated, tested, and shipped as a versioned unit. The interoperability matrix is solved by Broadcom — not by your operations team.
Unified Lifecycle Management: VCF Ops orchestrates Day-2 operations — patching, upgrades, cluster expansion — across all stack components in a single guided workflow.
Shared Policy Plane: NSX DFW, vSAN SPBM, and VKS Supervisor Namespaces consume the same identity and policy constructs. Security posture defined once propagates consistently across VM and container workloads.
Native AI & GPU Fabric: VCF 9’s NVIDIA AI Enterprise integration and VKS GPU scheduling work at the platform level — no bolt-on operator, no custom integration project.

✅

What This Enables

A single operational model across VMs, containers, and AI workloads — with one lifecycle, one policy plane, one support contract.

🔷 5. Optionality vs Integration — The Real Trade-Off

The choice is not between good and bad — it is between two fundamentally different operational philosophies. Here is what that looks like in practice.

Dimension	DIY / Assembled Stack	VMware Cloud Foundation
Architecture	Assembled — maximum component choice	Pre-integrated — engineered as a system
Upgrade Coordination	Manual — you own the BOM and compatibility matrix	Automated — VCF Ops in conjuction with SDDC Manager orchestrates end-to-end
Security Policy Consistency	Fragmented — per-layer silos, no enforcement parity	Unified — NSX DFW spans VM + container workloads
AI/GPU Scheduling	Custom — no native shared pool across VM + K8s	Native — VKS Supervisor + NVIDIA AIE integration
Sovereign / Air-Gap	Possible — but requires significant custom work	Designed — built for sovereign deployment patterns
Support Accountability	Multi-vendor — no single throat to choke	Single contract — one Broadcom support engagement
Day-0 Deployment	Weeks to months — integration work starts on day one	Hours — Cloud Builder automation handles bring-up
Operational Risk	Higher — integration gaps are your responsibility	Lower — Broadcom validates the full stack

The assembled model earns its place when flexibility and component choice genuinely matter. VCF earns its place when operational outcomes — upgrade coherence, policy consistency, Day-2 simplicity — are the priority. Know which problem you are actually solving.

🔶 6. Architect’s Take — LCM Is Where It Pays Off Most Visibly

💡

Scaling Principle

You don’t scale by increasing choice. You scale by reducing variability.

Lifecycle management is the unglamorous work that consumes a disproportionate share of infrastructure team capacity. Patching a fragmented 200-node environment with independent networking, storage, and compute upgrade cycles can absorb weeks of engineering time per quarter. That is time not spent on automation, capacity planning, or AI platform delivery.

VCF’s Ops Manager LCM workflow reduces this to a structured, guided operation:

Broadcom pre-validates the combined patch bundle across vSphere, vSAN, NSX, and VKS before release
VCF Ops Manager performs pre-check validation of cluster health, DRS rules, and NSX edge availability before any host enters maintenance mode
Rolling vMotion-aware patching keeps workloads running — no scheduled downtime windows for routine patches
Async patch support in VCF 9 lets you apply critical security fixes to individual components outside the full bundle cadence

Root Causes of Operational Failure at Scale

Operational inconsistency across teams and workload types
Upgrade risk from unvalidated cross-stack dependencies
Cross-stack debugging with no authoritative owner

An integrated platform directly addresses all three. For regulated industries — financial services, government, healthcare — the ability to demonstrate a coherent, auditable, single-vendor patch history across the entire stack is not an operational preference. It is a compliance requirement.

🔷 7. Why This Matters Even More for AI + Kubernetes

For years, the integration argument was primarily an operational efficiency argument. AI changes the calculus entirely.

GPU-accelerated AI training and inference workloads have characteristics that stress every boundary in a fragmented stack:

NUMA-aware scheduling must be consistent from the hypervisor layer through the container orchestrator. A mismatch breaks CPU–GPU affinity, and you leave 20–30% of GPU performance on the floor.
High-bandwidth east-west traffic between GPU nodes demands network policy enforcement without the overhead of a separately managed overlay.
Shared GPU pools serving both VM-based inference endpoints and Kubernetes training jobs require a scheduler that understands both resource models — which is precisely what VKS on VCF Supervisor delivers.
Observability continuity from vSphere Metrics through VCFVCF Operations to the Kubernetes layer means you can correlate a GPU memory spike in a training pod with the underlying ESXi host’s thermal profile — without stitching logs from three separate products.

Assembled Model

Each new capability — GPU workloads, multi-tenant K8s, high-perf storage — becomes a new integration point and a new failure domain

Integrated Model (VCF)

Each new capability is part of the same system — inherited policy, lifecycle, and observability included on day one

🚀

Faster deployment. Lower risk. Consistent operations across VM, container, and AI workloads.

🔶 8. The Platform Multiplier Effect

Here is the compounding argument that does not get made enough: integration creates a multiplier effect on every new capability you deploy.

When VKS lands in a VCF environment, it does not arrive as an isolated Kubernetes cluster. It inherits NSX micro-segmentation, vSAN SPBM storage policies, vSphere HA and DRS scheduling intelligence, and VCF Operations observability — on day one, without custom integration work. A standalone Kubernetes distribution requires weeks of effort to reach equivalent operational parity with the surrounding infrastructure.

The same logic applies to NVIDIA AI Enterprise on VCF, to VCF Automation (VCFA) for self-service provisioning, and to every future capability Broadcom ships as part of the platform. Each addition is additive — not additive-plus-integration-project.

Over a five-year horizon, this multiplier is where integrated platforms generate the most measurable TCO advantage.

🔷 9. When Integration Is the Wrong Answer

Intellectual honesty requires acknowledging this: integrated platforms are not universally the right answer.

⚖️

Be Honest With Your Context

VCF is optimised for organisations running mixed VM and container workloads at scale, in regulated or sovereign environments, where operational consistency and single-vendor accountability matter. If that profile does not match yours, acknowledge it.

If your organisation has a dominant public cloud strategy and on-premises infrastructure is genuinely residual, VCF’s operational depth may not be justified at small scale
If you have deep in-house expertise in specific open-source components and the engineering capacity to maintain integration glue, DIY can work — and can be cheaper at certain scales
If your primary requirement is developer-facing Kubernetes with no legacy VM estate, a lighter-weight distribution may be sufficient

Your architecture should match your actual operational context — not a vendor’s reference diagram.

🔶 10. Verdict

The goal is not to build infrastructure. The goal is to run applications — reliably and at scale.

🎯 Three Principles That Hold

✔ Integration matters more than optionality
✔ Consistency matters more than customization
✔ Operational simplicity matters more than theoretical flexibility

VMware Cloud Foundation represents this integrated approach — delivering a platform designed to run everything, not just host it. The components beneath — ESXi, vSAN ESA, NSX — are best-in-class. But the durable value is VCF Ops Manager, Supervisor Namespaces, and the unified policy plane that ties them together. That is the investment that compounds.

🔥 Final Thought

Enterprises don’t fail because they lack choice. They fail because they underestimate complexity. The right platform is the one that removes that complexity — not the one that distributes it. As infrastructure demands continue to grow — driven by AI workloads, sovereign mandates, and the accelerating pace of platform feature delivery — the organisations that have invested in integrated foundations will absorb that complexity without proportionally growing their operations teams. That is why the integration debt nobody budgets for is also the one that VCF was built to eliminate.

Further reading on vmtechie.blog: · VCF Fleet Sizer Tool · VCF Upgrade Path Planner

April 1, 2026

Why VCF with VKS is a Stronger Enterprise Choice Than KubeVirt

Why VMware VKS Is a Stronger Enterprise Choice Than KubeVirt | vmtechie.blog

KubeVirt is a capable open-source project and a legitimate choice in the right context. But when the workload is enterprise AI at scale — GPU clusters, production AI factories, regulated environments — the gap between VKS with VCF and KubeVirt is not a minor preference. It spans architecture, operations, governance, and enterprise transformation strategy.

PREMISE Let’s Be Honest About KubeVirt First

A technically credible argument never starts by dismissing the competition. KubeVirt is a real, production-used project with genuine strengths. Let’s acknowledge them honestly before making the VKS case.

Where KubeVirt genuinely wins: Cloud-native purists wanting a single Kubernetes control plane for everything. Cost-sensitive environments where ESXi licensing is a barrier. Dev/test scenarios where VM-grade isolation isn’t critical. Upstream OSS communities wanting full control over the stack. Teams with deep Kubernetes operational maturity who want to manage VMs and containers through a unified API.

If your organisation is already 100% Kubernetes-native with no enterprise VM workloads or compliance requirements, KubeVirt is a reasonable choice. That’s the honest truth. This is not a case of good vs bad — it is a case of enterprise integration vs architectural freedom.

But here’s the equally honest truth: for enterprise AI infrastructure — GPU clusters, DGX/HGX environments, production AI factories, regulated tenancy — VKS with VCF tends to hold a stronger position across most architectural and operational dimensions that matter to enterprise teams. Here’s the case, dimension by dimension.

00 The Core Difference: Integrated Platform vs Extension Model

Before diving into technical specifics, it’s worth understanding the conceptual gap — because it explains every practical difference that follows.

With VKS, Kubernetes is delivered as a built-in service on top of the VMware infrastructure stack. It is tightly integrated with vSphere, storage, networking, policy, and lifecycle management. It is designed as part of the platform — not added to it.

With KubeVirt, virtualisation is added into Kubernetes as an extension. It is an innovative approach, but it still means you are effectively layering VM functionality into an environment originally built for containers. In practice, VKS gives enterprises a unified operating model. KubeVirt often introduces more integration points, more dependencies, and more operational responsibility.

The directional difference: KubeVirt extends Kubernetes to run VMs. VKS extends a mature enterprise virtualisation platform to run Kubernetes properly. In production, that direction matters more than it appears on a whiteboard.

01 Hypervisor Architecture — Purpose-Built vs Added On

The most fundamental difference is architectural. KubeVirt layers VM capability onto a system designed for containers. VKS extends a hypervisor designed from day one to run workloads with hardware-level isolation.

KubeVirt Stack

Application / AI Workload

↓

QEMU/KVM Process

↓

Container (Pod)

↓

Kubernetes Node

↓

Linux Kernel

↓

Hardware

VKS with VCF Stack

Application / AI Workload

↓

Container / Kubernetes Pod

↓

VM (vSphere Supervisor)

↓

ESXi Microkernel (Type-1)

↓

Hardware

ESXi is a Type-1 bare-metal hypervisor — it runs directly on hardware with a microkernel architecture under 150MB in size. It was designed to do one thing exceptionally well: run workloads with deterministic performance and hardware isolation. VMs and containers on VKS are both first-class constructs — not one emulating the other.

The analogy: KubeVirt is running a city inside a shipping container. VKS is building a city on actual land. Each abstraction layer in KubeVirt compounds — adding latency, scheduling complexity, and failure domains that are less pronounced in a purpose-built hypervisor model.

02 GPU & AI Workload Performance — The Widest Gap

This is the dimension that matters most for anyone building NVIDIA AI infrastructure. The gap here is not marginal — it is architectural.

KubeVirt GPU Reality

GPU passthrough to VMs via KubeVirt requires VFIO/IOMMU — complex to configure, brittle in production, and requiring deep Linux kernel expertise. More critically:

No native MIG (Multi-Instance GPU) awareness — partitioning must be configured externally
GPU sharing across VMs and containers in the same cluster is operationally complex
No current equivalent of NVIDIA vGPU time-slicing with hardware-enforced QoS guarantees
The KubeVirt device plugin model does not yet integrate cleanly with MIG partition profiles

VKS with VCF with NVIDIA AI Enterprise

This is the explicitly certified, supported path for enterprise NVIDIA GPU deployments:

NVIDIA vGPU natively supported on ESXi — VMs get dedicated vGPU profiles (A100-40C, H100-80C) with hardware-enforced QoS [1]
MIG partitioning integrates cleanly — a single H100 can serve multiple Kubernetes pods and VMs simultaneously with hard partition isolation [2]
NVIDIA GPU Operator supports vSphere Supervisor as a validated deployment target
NVIDIA AI Enterprise is explicitly certified on vSphere — the recommended enterprise path for DGX/HGX production deployments [3]

// VKS — GPU resource request (clean, native)
resources:
  limits:
    nvidia.com/gpu: 1
    # vGPU profile enforced at hypervisor level
    # MIG partitioning transparent to workload
    # QoS guaranteed by ESXi scheduler

03 Security & Isolation — 20 Years vs 5 Years

Security is where enterprise architects lose sleep — and where VKS has the most compelling, battle-tested story.

KubeVirt’s Security Model

VM isolation in KubeVirt depends on the container runtime security boundary plus QEMU process isolation. A compromised container runtime (containerd, runc vulnerability) can potentially affect the QEMU process hosting the VM. Nested virtualisation increases the kernel attack surface. RBAC for VM operations is layered onto Kubernetes RBAC — not purpose-built for multi-tenant VM isolation.

VKS + NSX Security Model

ESXi’s VMX process isolation is 20+ years hardened. Each VM is fully isolated at the hypervisor level regardless of what happens in the container layer above. Beyond that:

NSX Distributed Firewall (DFW) applies microsegmentation at the vNIC level — every Kubernetes pod can have firewall policy enforced at the hypervisor, not just the overlay network [4]
vSphere Trust Authority and TPM integration provide cryptographic attestation of host state before VMs are allowed to run — KubeVirt currently has no comparable integrated mechanism
Regulatory compliance (PCI-DSS, HIPAA, SOC2) control mapping for vSphere is well-established and widely audited; equivalent mappings for KubeVirt environments are still maturing
ESXi security patches are coordinated and tested against the full vSphere stack — KubeVirt kernel updates require independent validation across the QEMU/KVM/container runtime chain

04 Day-2 Operations — Where the Pain Is

Every infrastructure architect knows that Day-1 deployment is 10% of the story. Day-2 operations — patching, upgrades, live migration, monitoring — is where you live for the next 3-5 years.

VCF / VKS Capability

KubeVirt Equivalent

vMotion — zero-downtime live migration

Basic VM migration (no storage vMotion)

VCF Lifecycle Manager — full stack upgrade

Manual Kubernetes + KubeVirt operator coordination

VCF Operations — unified VM + container observability

Separate toolchains (Prometheus + custom exporters)

VKS K8s upgrades decoupled from vCenter lifecycle

K8s + KubeVirt operator + host OS must be co-validated

vSphere Update Manager — coordinated patching

DIY patching across kernel, QEMU, CRI, CNI layers

SPBM — storage QoS policy across VMs + PVCs

CSI only, no differentiated storage QoS

VCF Lifecycle Manager manages the entire stack — ESXi, vCenter, NSX, vSAN, and Kubernetes cluster versions — in a single coordinated upgrade workflow. In KubeVirt environments, version skew between the Kubernetes release, KubeVirt operator version, QEMU version, and the host kernel is a recurring operational hazard that requires dedicated engineering effort to manage safely.

One of the most underappreciated advantages of VKS is that Kubernetes cluster upgrades are fully decoupled from vCenter upgrades. In practice, this means platform teams can roll out new Kubernetes versions — moving from 1.28 to 1.29 to 1.30 — independently, without waiting for a vCenter maintenance window or coordinating with the infrastructure team managing the underlying SDDC. Each Tanzu Kubernetes cluster has its own lifecycle, managed via the Supervisor and VCF LCM, with no hard dependency on the vCenter version for day-to-day Kubernetes updates. Compare this to KubeVirt, where the Kubernetes control plane, KubeVirt operator, and host OS are all tightly coupled — a Kubernetes minor version upgrade requires validating compatibility across all three layers simultaneously. For enterprises running multiple Kubernetes clusters across workload domains, VKS’s decoupled upgrade model is a significant operational advantage.

05 Networking — NSX vs CNI Complexity

Networking for AI workloads is not just about connectivity — it’s about bandwidth, latency, topology awareness, and security policy across a mixed VM and container estate.

KubeVirt Networking Complexity

VM network interfaces in KubeVirt are exposed as secondary interfaces via Multus — requiring careful co-ordination between multiple CNI plugins. SR-IOV for VM workloads requires manual IOMMU/VF configuration per node. There is no unified microsegmentation plane between VMs and pods — policy must be applied at multiple layers independently.

VKS + NSX — Unified Fabric

NSX provides a single network fabric for both VMs and Kubernetes pods. The same DFW policy engine applies to both. NSX Advanced Load Balancer (AVI) handles Kubernetes ingress and LoadBalancer services natively with full traffic visibility across both VM and container workloads. Critically for AI infrastructure: Geneve overlay with hardware offload to SmartNICs including BlueField DPUs — directly aligned with NVIDIA’s AI factory reference architecture.

06 Enterprise Transformation Reality — The Mixed Workload Problem

Most enterprise modernisation conversations get derailed by a false premise: that organisations are either “all VMs” or “all containers.” The reality, in virtually every large enterprise, is a persistent mix that will not resolve cleanly for years.

A typical enterprise estate in 2026 includes: traditional VM-based business applications, modern microservices and cloud-native workloads, packaged enterprise software with no container-native path, data platforms and stateful databases, and security or compliance-sensitive workloads requiring strict isolation guarantees. VKS is designed for this hybrid reality. It does not force everything into a Kubernetes-first abstraction before the organisation is ready for it.

The modernisation argument: VKS allows organisations to modernise without forcing them to abandon the operational model they already trust. Infrastructure teams keep using the VMware foundation they know — while platform teams gain access to Kubernetes in a way that feels native to the environment. That makes transformation more realistic, not just more aspirational.

Operational Risk — The Questions That Matter

When enterprises evaluate platforms, they often focus too much on feature checklists and not enough on operational risk. The real questions are not just “Can this run VMs and containers?” They are:

How hard is it to support at 2am when something breaks?
How predictable are upgrades across the full stack?
How many teams need to coordinate for a routine patch?
How many integration gaps need to be owned and maintained internally?
How fast can issues be isolated and root-caused in a mixed VM/container environment?

VKS reduces this risk because the platform is more cohesive — fewer seams between layers, fewer teams needed, fewer custom integrations to maintain. KubeVirt can be very attractive architecturally, but it assumes a higher level of Kubernetes operational maturity and a stronger tolerance for platform engineering complexity that most enterprise IT organisations do not have the staffing to sustain.

07 Governance & Private Cloud Readiness

For regulated industries, sovereign cloud environments, and enterprise private clouds, governance matters just as much as technology capability. Organisations need consistent policy, security boundaries, visibility, and controlled operations. They need to know who owns what, how workloads are deployed, and how infrastructure changes are managed.

This is where VMware’s enterprise DNA shows. VKS fits naturally into environments that require structure, compliance, and clear operational accountability:

Role-based access control unified across VMs, Kubernetes namespaces, and vSphere objects — one policy model, not two
Audit trails from vCenter and NSX cover both VM and container operations in a single log stream [5]
Change management integration — VCF’s API surface maps cleanly to ITSM platforms (ServiceNow, Jira Service Management)
Sovereign cloud readiness — vSphere’s tenancy model and encryption capabilities are mapped to GDPR, data residency, and sovereign cloud frameworks across APAC, EU, and regulated US sectors

KubeVirt can absolutely be used in serious environments — but it is more often the right fit for organisations that want deeper open-source flexibility and are comfortable owning more of the platform decisions themselves. For most enterprise private clouds, that is not a trade-off they are willing to make.

08 Head-to-Head Summary

Dimension	VKS with VCF	KubeVirt
Platform Model	✅ Integrated — Kubernetes is native to the stack	⚠️ Extension model — VMs added onto Kubernetes
GPU / AI Workloads	✅ vGPU, MIG, NVIDIA AI Enterprise certified	⚠️ VFIO passthrough, limited MIG integration
Security Isolation	✅ 20+ yr hardened VMX, NSX microsegmentation	⚠️ QEMU-in-container, larger attack surface
Live Migration	✅ vMotion — zero-downtime, storage + compute	⚠️ Functional but no storage vMotion equivalent
Lifecycle Management	✅ VCF LCM unified + K8s upgrades decoupled from vCenter	❌ K8s, KubeVirt operator & host OS must be co-validated
Networking	✅ NSX unified VM + container fabric + DPU offload	⚠️ Multus + multi-CNI complexity
Storage QoS	✅ SPBM across VMs + PVCs, vSAN ESA	⚠️ CSI only, no differentiated QoS
Mixed Workload Support	✅ Native — VMs and containers are co-equals	⚠️ Container-first; VMs require abstraction overhead
Governance & Compliance	✅ Unified RBAC, audit, PCI/HIPAA/SOC2 controls	⚠️ Immature compliance tooling, separate audit streams
Operational Risk	✅ Cohesive platform, fewer integration gaps	❌ Higher ownership burden, more seams to maintain
Observability	✅ Unified VM + container via VCF Operations	⚠️ Separate toolchains required
NVIDIA Certification Path	✅ Explicit NCP-AII / NVIDIA AI Enterprise support	❌ Not part of NVIDIA enterprise certification stack
Cost (Licensing)	⚠️ VCF licensing required	✅ Open source, no hypervisor licensing

// The Directional Argument

KubeVirt makes Kubernetes run VMs.
VKS makes a production-hardened hypervisor run Kubernetes.

When the workload is enterprise AI at scale, the foundation matters more than the interface. Choose your substrate based on the operational reality you’ll live with for the next five years.

CLOSING The Right Tool for the Right Job

KubeVirt will continue to evolve. The upstream community is active, and features like live migration and GPU support are maturing. For greenfield cloud-native organisations without legacy VM estates or strict compliance requirements, it deserves serious evaluation.

Where KubeVirt is the better fit: If your organisation is already deeply Kubernetes-native, your team has strong platform engineering capability, you want to avoid hypervisor licensing costs, and you are comfortable owning more of the integration decisions — KubeVirt is a legitimate and architecturally coherent choice. Open-source flexibility and a Kubernetes-first operating model are real advantages in the right context.

But for enterprise organisations running AI workloads on NVIDIA DGX/HGX infrastructure, managing regulated environments, and needing proven lifecycle tooling across a mixed VM and container estate — VKS with VCF backed by VCF offers a more mature, better-integrated, and lower-risk path. It is the architecture that has been most thoroughly validated for this use case in production enterprise environments.

The question was never “containers vs VMs.” The question is: what platform will reduce operational complexity rather than relocate it?

My view: VKS is the stronger enterprise choice. Not because KubeVirt lacks innovation. Not because Kubernetes is weak. But because VKS is aligned with enterprise operational reality — and in production, that alignment is what separates an exciting architecture from a platform you can actually sustain.

KubeVirt moves complexity from the hypervisor layer into your Kubernetes operations team. VKS distributes it across a tested, integrated platform with decades of enterprise hardening. For most organisations, that trade-off has a clear answer.

And in enterprise IT, that is often what separates an exciting architecture from a successful platform.

March 24, 2026

Planning a VMware Cloud Foundation 9.0 Upgrade? Start Here…
vmtechie.blog · Infrastructure Tools

I Built a VCF Upgrade
Path Planner — Here’s Why

Tool: VCF Upgrade Path Planner Covers: 8 upgrade paths Target: VCF 9.0 / 9.0.2

If you’ve ever had to plan a VMware Cloud Foundation upgrade from scratch, you know how scattered the information can be — KB articles here, TechDocs pages there, blog posts from different release cycles, and no single place that ties it all together into a clear, ordered sequence.

That frustration is exactly what drove me to build the VCF Upgrade Path Planner. As someone who works with VCF environments day-to-day and runs vmtechie.blog to share practical infrastructure knowledge with the community, I wanted to create something that gives engineers a solid starting point before they walk into a maintenance window — a tool that reflects real-world upgrade sequencing, not just the high-level marketing overview.

Example — vSphere 7.0 → VCF 9.0 upgrade journey

This planner covers eight upgrade paths — spanning vSphere 7.0, 7.0 U2/U3, 8.0, and 8.0 U2/U3 converge routes to VCF 9.0, the VCF 5.0 and 5.1/5.2 in-place upgrade paths, the 9.0.0/9.0.1 to 9.0.2 maintenance path, and a current-state check for VCF 9.0.2 — all linked directly to official Broadcom Knowledge Base articles, TechDocs pages, and VMware blog posts so you can verify every recommendation against authoritative source material.

All 8 Upgrade Paths Covered

§

Why I Built This

If you’ve ever had to plan a VMware Cloud Foundation upgrade from scratch, you know how scattered the information can be. KB articles here, TechDocs pages there, blog posts from different release cycles, and no single place that ties it all together into a clear, ordered sequence. That frustration is exactly what drove me to build the VCF Upgrade Path Planner. As someone who works with VCF environments day-to-day and runs vmtechie.blog to share practical infrastructure knowledge with the community, I wanted to create something that gives engineers a solid starting point before they walk into a maintenance window — a tool that reflects real-world upgrade sequencing, not just the high-level marketing overview.

This planner covers eight upgrade paths spanning vSphere 7.0, 7.0 U2/U3, 8.0, and 8.0 U2/U3 converge routes to VCF 9.0, the VCF 5.0 and 5.1/5.2 in-place upgrade paths, the 9.0.0/9.0.1 to 9.0.2 maintenance path, and a current-state check for VCF 9.0.2 — all linked directly to official Broadcom Knowledge Base articles, TechDocs pages, and VMware blog posts so you can verify everything against authoritative source material. A significant amount of research, testing, iteration, and community review has gone into getting the sequencing, version gates, and critical warnings right. That said, VCF is a complex and fast-moving platform, and I’m one person — so if you spot a step that’s missing, a version gate that’s wrong, or guidance that doesn’t match your experience in the field, please reach out and let me know. Every piece of feedback makes this tool better for everyone in the community.

🔗

Everything is sourced
Every step links directly to the relevant Broadcom KB, TechDocs page, or VMware blog post so you can verify each recommendation against authoritative source material before acting on it.

⚠️

Critical gates are flagged
Version gates, one-way doors, and ordering requirements — like the Aria Operations 8.18 gate, the NSX Edge OVF certificate expiry fix in 9.0.2, and the mandatory vLCM Baseline-to-Image transition — are surfaced prominently, not buried in footnotes.

§

How We Calculate Time, Risk & Effort

The complexity numbers shown in each upgrade path — estimated duration, risk score, and effort score — are not pulled from a vendor SLA document. They are practical estimates built from field experience with VCF environments of varying sizes and community input from engineers who have executed these upgrades in production. Here is how each metric is derived.

Example — VCF 5.0 → VCF 9.0 path

Duration

4–8w

weeks estimated

Risk Score

50

out of 100

Effort Score

65

out of 100

Duration

Estimated based on the number of sequential phases in the path, the number of components that require ordered upgrades (SDDC Manager → NSX → vCenter → ESXi is always serial, never parallel), and the realistic time each component upgrade takes in a mid-sized environment. Converge paths from vSphere carry additional time for pre-converge remediation, vLCM Baseline-to-Image transitions, and the VCF Installer workflow itself. Paths starting from VCF 5.0 carry extra time for the mandatory VCF 5.2 intermediate hop. These are conservative estimates — your actual duration will vary based on node count, hardware speed, precheck findings, change management windows, and whether you are running a lab or a production fleet.

💡

What is RDU (Reduced Downtime Upgrade)?
Starting with VCF 9.0, vCenter upgrades exclusively use Reduced Downtime Upgrade (RDU). Instead of upgrading in-place and taking the existing vCenter offline for the full duration, RDU deploys a brand-new temporary vCenter appliance alongside the existing one, migrates all configuration and inventory data across while the environment stays running, then decommissions the old appliance. The result is a much shorter management plane outage — typically just a few minutes for the final cutover rather than the extended downtime of a traditional in-place upgrade. In VCF 9.0.1+, the Installer automatically assigns a 169.254.x.x link-local IP address for the temporary appliance, so you no longer need to pre-stage a static IP on your management network in most environments. RDU is only required for major version jumps (e.g. 8.x → 9.x) — within-9.x maintenance updates use a regular in-place upgrade with no temporary appliance needed.

Risk Score

A relative measure from 0 to 100 that reflects how many irreversible transitions the path contains, how many components must be upgraded in strict sequence, and how much room there is to safely roll back if something goes wrong. A vSphere 7.0 converge path scores higher risk not because converge is inherently dangerous, but because it involves more one-way doors — once the VCF Installer runs and creates the management domain, you cannot unconverge back to standalone vSphere. Maintenance paths like 9.0.0 to 9.0.2 score low risk because they involve fewer components, shorter windows, and well-understood rollback via snapshot.

Effort Score

Reflects the total planning and execution workload — number of discrete steps, number of decisions that require engineer judgment rather than automation, number of separate maintenance windows required, and the degree of documentation and preparation needed before you can safely begin. A vSphere 7.0 to VCF 9.0 path scores high effort not because any single step is especially hard, but because the cumulative preparation — HCL checks, Baseline-to-Image transitions, ELM removal, VCF Installer staging, Aria Suite pre-work, workload domain imports — adds up to a substantial project even before the first upgrade window opens.
⏱️

Duration Factors

Sequential component count
Intermediate hops required
Pre-converge remediation
Workload domain count
Aria Suite pre-work

🎯

Risk Factors

One-way door transitions
Rollback constraints
NSX version direction rules
vCenter RDU complexity
ELM removal requirements

🏗️

Effort Factors

Total discrete steps
Judgment calls required
Separate change windows
Documentation prep
Depot configuration work
Upgrade Path Duration Risk Effort Risk Bar

All three scores scale relative to each other across the eight paths, so they are most useful as a comparison tool — if you are deciding between targeting VCF 9.0.0 or 9.0.1, or choosing whether to converge from vSphere 8.0 U3 versus waiting to patch to U3 first, the scores give you a quick read on the relative complexity trade-off. They are starting points for your own planning conversation, not guarantees — always validate your specific environment against official Broadcom documentation and run the SDDC Manager upgrade prechecks before committing to a maintenance window.

§

A Community Tool

VCF is a complex and fast-moving platform, and I’m one person. A significant amount of hardwork has gone into building and refining this planner — cross-referencing every step against official Broadcom documentation, KB articles, and VMware engineering blog posts, running it through multiple review cycles, and iterating on the content based on community feedback. But if you spot a step that’s missing, a version gate that’s wrong, or guidance that doesn’t match your experience in the field, please reach out and let me know. Drop a comment below or contact me directly — every piece of feedback makes this tool better for everyone in the community.

Spotted something missing or incorrect?

Drop a comment below or reach out directly. Your field experience makes this tool better for the whole community.
Leave Feedback ↓

🚀

Try the VCF Upgrade Path Planner

Open the tool directly on vmtechie.blog and generate your tailored upgrade plan in seconds.

Open the Planner →
Share this:
X
Facebook
Like this:
Like Loading…
March 17, 2026

How the VCF 9 Fleet Sizer Actually Works

A complete walkthrough of every calculation behind the tool — from raw NVMe capacity to ESA protection factors, NVMe memory tiering, and VCF licence entitlement. No black boxes.

What the tool sizes
Host specification inputs
Management VM stack
Compute sizing formula
vSAN ESA storage pipeline
Protection policies & PF table
Final host count & limiter
NVMe memory tiering
External storage mode
VCF licence entitlement
Principal storage options (KB 416270)
Assumptions & caveats

1. What the tool sizes

The VCF 9 Fleet Sizer calculates the minimum number of ESXi hosts required across a VMware Cloud Foundation deployment — one Management Domain and any number of VI Workload Domains. For each domain it independently determines whether CPU, memory, or storage is the binding constraint, and returns the host count driven by the most demanding dimension.

The sizer is built specifically for VCF 9 with vSAN ESA — the Express Storage Architecture that requires NVMe-only drives and operates as a single storage tier without a separate cache/capacity split. It also models external storage mode (Fibre Channel, NFS) where hosts are sized on compute and memory only, and a disaggregated NVMe memory tiering model unique to VCF 9.

⚠️ Planning aid only — not an official Broadcom tool. All outputs are estimates based on the inputs you provide. Validate every design against official Broadcom documentation, the VMware HCL, and field engineering guidance before procurement or deployment. Real-world DRR and vSAN overheads vary significantly by workload.

2. Host specification inputs

Every domain (management and each WLD) has an independent host specification. The tool does not assume all hosts are identical across domains — a management cluster might run 2×16c hosts while a production WLD uses 2×32c AI-optimised nodes.

Input	Default	Used in	Notes
CPU Qty	2	Core count, licensing	Sockets per host
Cores per CPU	16	Core count, licensing	Physical cores — no hyperthreading multiplier applied
RAM (GB)	1,024	Memory sizing	Total usable host RAM
NVMe Qty	6	Storage sizing	NVMe drives per host (vSAN ESA only)
NVMe Size (TB)	7.68	Storage sizing	TB decimal — converted to GB via ×1,000
CPU Oversubscription	2×	Usable vCPU	vCPU:pCPU ratio — applies before reserve
RAM Oversubscription	1×	Usable RAM	1× = no oversubscription. Rarely exceed 1× for RAM
Compute Reserve %	30%	Usable vCPU & RAM	Headroom withheld from placement (HA, overhead)

Raw capacity per host formulas:

			
Host Cores     = CPU Qty × Cores per CPU
Raw GB per Host = NVMe Qty × NVMe Size (TB) × 1,000

⚠️ No hyperthreading multiplier. The sizer deliberately does not multiply physical cores by 2 for hyperthreading. Logical thread counts are workload-specific and highly variable. Instead, the CPU oversubscription ratio gives you explicit control. A 2× ratio on a 32-core host models the same headroom as a 64-thread count at 1× — but you’re aware you’re making that choice.

3. Management VM stack

The Management Domain hosts a fixed stack of VCF infrastructure VMs. These are not user workloads — they are the control plane. Their combined vCPU, RAM, and disk demand is the entire sizing input for the management cluster. The tool carries an accurate per-component VM stack based on current VCF 9 T-shirt sizes from Broadcom documentation.

Component	Sizes	vCPU range	RAM range	Disk range
vCenter Server (Mgmt)	S / M / L / XL	4 – 24	21 – 58 GB	694 – 2,283 GB
NSX Manager	M / L / XL	6 – 24	24 – 96 GB	300 – 400 GB
NSX Edge	S / M / L / XL	2 – 16	4 – 64 GB	200 GB
NSX Global Manager	S / M / L / XL	4 – 24	16 – 96 GB	300 – 400 GB
Avi Load Balancer	S / M / L	8 – 24	24 – 48 GB	128 – 512 GB
vCenter Server (WLD)	S / M / L / XL	4 – 24	21 – 58 GB	694 – 2,283 GB
VCF Operations (SDDC Mgr)	S / M / L / XL	4 – 24	16 – 128 GB	274 GB
VCF Operations Collector	S / M	2 – 4	8 – 32 GB	144 GB
VCF Operations for Logs	S / M / L	12 – 48	24 – 96 GB	1,590 GB
VCF Operations for Networks	L / XL / XXL	12 – 48	24 – 96 GB	1,590 GB
VCF Net. Collector	M / L / XL / XXL	4 – 16	12 – 48 GB	200 – 300 GB
Identity Manager	Embedded / HA	0 – 32	0 – 64 GB	0 – 400 GB

Management sizing is deterministic: configure your component sizes, and the tool sums the total vCPU, RAM, and disk demand — no workload VM estimates needed.

4. Compute sizing formula

For Workload Domains, tenant demand is specified as VM count × per-VM averages for vCPU, RAM, and disk. Infrastructure VMs (NSX Edges, VKS Supervisor nodes) can optionally be included in the WLD demand totals. All demands are then sized against the host specification to determine the compute host floor.

WLD demand totals:

			
Demand vCPU = (VMs × vCPU/VM) + Infra vCPU
Demand RAM  = (VMs × RAM/VM)  + Infra RAM
Demand Disk = (VMs × Disk/VM) + Infra Disk

Usable capacity per host:

			
Usable vCPU/host = Host Cores × CPU Oversub × (1 − Reserve%)
Usable RAM/host  = Host RAM   × RAM Oversub  × (1 − Reserve%)

Compute host floors (evaluated independently):

			
CPU Hosts = ⌈ Demand vCPU / Usable vCPU per host ⌉
RAM Hosts = ⌈ Demand RAM  / Usable RAM  per host ⌉

Example: 200 VMs × 4 vCPU = 800 vCPU demand. Host: 2×16c = 32 physical cores × 2× oversub × 0.70 reserve factor = 44.8 usable vCPU/host. CPU Hosts = ⌈ 800 / 44.8 ⌉ = 18 hosts.

5. vSAN ESA storage pipeline

vSAN ESA storage sizing is a sequential pipeline of capacity transformations. Each stage adds overhead for a specific reason. Starting from raw VM disk demand, the pipeline applies data reduction, swap space, protection overhead, free space reserve, and growth buffer — in that order — to arrive at the total raw capacity required and therefore the storage host floor.

Pipeline stages:

			
Step 1 — VM Capacity GB  = Demand Disk GB ÷ DRR
                           (DRR = Dedup Ratio × Compression Ratio)
Step 2 — Swap GB         = Demand RAM GB × VM Swap%
                           (100% for mgmt, configurable for WLD)
Step 3 — Interim GB      = VM Capacity GB + Swap GB
Step 4 — Protected GB    = Interim GB × Protection Factor (PF)
Step 5 — With Free GB    = Protected GB × (1 + vSAN Free%)
Step 6 — Total Required  = With Free GB  × (1 + Growth%)

		

Storage host floor:

			
Effective Hosts      = Total Hosts − Failures to Tolerate
Per-Host Requirement = Total Required GB ÷ Effective Hosts
Storage Hosts        = ⌈ Total Required GB / Raw GB per Host ⌉ + Failures

Data Reduction Ratio (DRR)

The tool splits DRR into two separate inputs: Dedup Ratio and Compression Ratio. DRR = Dedup × Compression. Both default to 1.0 (no reduction) because real-world ratios depend entirely on data entropy — databases compress poorly, VDI golden images deduplicate extremely well. Using optimistic DRR values leads to undersized storage clusters.

⚠️ DRR above 2.0 is optimistic. Unless you have measured DRR from an equivalent workload in your environment, keep both ratios at 1.0. A DRR of 2.0 halves your storage host count. If the real-world ratio comes in at 1.2, you’ll need significantly more hosts than planned.

TiB conversion

The tool uses binary TiB throughout. NVMe drives are marketed in TB decimal (1 TB = 1,000 GB). Conversion: 1 TB = 1,000 GB = 0.9095 TiB. A 6× 7.68 TB host = approximately 41.9 TiB raw per host after conversion.

6. Protection policies & PF table

The Protection Factor (PF) is the storage overhead multiplier applied to usable data to account for redundancy. It is determined by your chosen RAID type, FTT (Failures to Tolerate), and for RAID-5, the stripe width. The tool enforces the minimum host count per policy.

Policy	PF	Min Hosts	FTT	Notes
RAID-5 2+1 FTT=1	1.50x	3	1	Default — best balance of protection and efficiency
RAID-5 4+1 FTT=1	1.25x	6	1	Lower overhead but needs 6+ hosts
RAID-6 4+2 FTT=2	1.5x	6	2	Two simultaneous drive failures tolerated
Mirror FTT=1	2.x	3	1	Simple mirror — highest rebuild performance
Mirror FTT=2	3.×	5	2	Three copies of every object
Mirror FTT=3	4.×	7	3	Maximum redundancy — very high storage cost

7. Final host count & limiter

The final host count is the maximum across four independent floors: CPU hosts, RAM hosts, storage hosts, and the policy minimum. The tool identifies which floor is binding and labels it the Limiter.

Final Hosts = max( CPU Hosts, RAM Hosts, Storage Hosts, Policy Min )

Limiter	Meaning	Common cause
Compute	CPU is the binding constraint	High vCPU density, low oversub ratio
Memory	RAM is the binding constraint	Memory-intensive workloads, RAM oversub at 1×
Storage	vSAN ESA capacity drives the count	Large disk demand, high PF, low DRR, insufficient NVMe
Policy	Protection policy min host count	Small cluster — compute fine but policy enforces minimum N hosts

When storage is the limiter, your NVMe capacity per host is insufficient to hold the protected dataset within the compute-determined host count. Solutions: increase NVMe drive count or size, relax the vSAN free% reserve, or accept a higher host count.

8. NVMe memory tiering (VCF 9)

VCF 9 introduces NVMe-backed memory tiering, where fast NVMe drives act as a memory extension. A partition of each NVMe drive is set aside as a memory tier — not storage — allowing effective RAM per host to exceed physical DRAM installed. This can reduce the host count when memory is the sizing constraint.

Tiering formulas:

			
Partition GB       = min( Drive GB,  DRAM × NVMe Ratio,  512 GB cap )
NVMe Ratio Used    = Partition GB ÷ Host DRAM GB
Effective Host RAM = Host DRAM × (1 + NVMe Ratio Used)
Tiered Demand R    = ( Eligible Demand ÷ (1 + NVMe Ratio Used) )
                     + Ineligible Demand

		

Key inputs: Eligibility % (what fraction of workload is not latency-sensitive), NVMe-to-DRAM ratio (GB of NVMe tier per GB of DRAM), and tier drive size (separate from vSAN data drives). The effective RAM and reduced demand figure feed back into the RAM host floor calculation.

⚠️ Tiering caveats. NVMe tiering suits read-heavy workloads with temporal locality. It is not appropriate for latency-sensitive databases, real-time analytics, or anything where memory bandwidth consistency matters. The eligibility % input requires honest assessment of your workload mix.

9. External storage mode

Both the Management Domain and each WLD can be toggled to External Array mode — modelling Fibre Channel or NFS as principal storage. In this mode, the vSAN ESA storage pipeline is bypassed entirely. Host count is determined by compute only, and the user supplies an estimated array capacity for documentation.

			
Final Hosts (ext) = max( CPU Hosts, RAM Hosts, Policy Min )
                    — Storage floor is removed

The Limiter can only be Compute, Memory, or Policy. No ESA capacity, PF, or per-host storage figures are calculated for external domains.

Entitlement impact

Every VCF core licence includes 1 TiB of vSAN raw storage entitlement. When a domain runs external storage, those cores are still licensed at the same cost but the bundled vSAN storage is unused.

Forfeited TiB = Licensed Cores × 1 TiB/core

For a 10-host domain with 2×32c hosts, that’s 640 TiB of vSAN entitlement forfeited — storage the customer is paying for but not using. The tool surfaces this inline, in the Fleet License Summary, and in the export report so the commercial impact is visible before procurement conversations begin.

10. VCF licence entitlement calculation

VCF 9 is licensed per core. The tool calculates total core count across the fleet and derives the vSAN storage entitlement bundled with those licences.

			
Mgmt Cores         = Mgmt Hosts  × Host Cores
WLD Cores          = Σ( WLD Hosts × Host Cores )
Entitlement (TiB)  = ( Mgmt Cores + WLD Cores ) × 1 TiB/core
Fleet vSAN Raw TiB = Σ( Hosts × NVMe Qty × NVMe TB × 0.9095 )
Add-on Required    = max( 0,  Fleet Raw TiB − Entitlement TiB )

		

If raw capacity exceeds entitlement, the difference is flagged as Add-on TiB Required — additional vSAN capacity licensing needed beyond what’s included in core licences. External storage domains exclude their array capacity from the fleet raw total.

11. Principal storage options in VCF 9 (KB 416270)

VCF 9 supports a broader set of principal storage options than previous versions. Some are available via standard greenfield workflows; others require the Converge workflow. This distinction matters — it affects automation, LCM, and Day 2 operations.

Storage Model	Mgmt Default	Mgmt Additional	VI WLD	Method
vSAN ESA	Principal	Principal	Principal	🟢 Greenfield
vSAN OSA	Principal	Principal	Principal	🟢 Greenfield
Storage Cluster (disagg. vSAN)	—	Principal	Principal	🟢 Greenfield
Compute-Only Cluster	—	Principal	Principal	🟢 Greenfield
Fibre Channel (FC)	Principal	Principal + Supp	Principal + Supp	🟢 Greenfield
NFS v3	Principal	Principal + Supp	Principal + Supp	🟢 Greenfield
iSCSI	Principal*	Principal*	Principal*	🔄 Converge
NFS v4.1	Principal*	Principal*	Principal*	🔄 Converge
FCoE	Principal*	Principal*	Principal*	🔄 Converge
NVMe/FC · NVMe/TCP · NVMe/RDMA	Principal*	Principal*	Principal*	🔄 Converge

* Via Converge workflow: deploy ESXi 9 → configure target datastore → deploy vCenter 9 → import into VCF 9 using Converge (management) or Import vCenter (WLD).

⚠️ Day 2 operations constraint: For non-LCM Day 2 operations (host commissioning, adding/removing hosts or clusters), perform the operation in vCenter first, then run Sync Inventory in VCF Operations. If this step is skipped, lifecycle management in VCF Operations will be blocked for those hosts and clusters.

Source: Broadcom KB Article 416270

12. Assumptions & caveats

Assumption	Detail
Single cluster per domain	Each WLD is modelled as one cluster. Multi-cluster WLDs are not supported.
Homogeneous hosts	All hosts within a domain use the same spec. Mixed-node clusters are not modelled.
vSAN ESA only	The storage pipeline models ESA only. vSAN OSA has different overhead characteristics.
Growth is a flat buffer	Growth % is applied once, not compounded year-over-year. Add headroom manually for multi-year plans.
VM Swap fixed at 100% for mgmt	The management domain’s swap requirement is not user-configurable.
No stretched cluster modelling	Stretched clusters double host count and require witness nodes — not currently modelled.
Flat DRR across all data	A single DRR applies to the entire disk demand. Mixed workloads with varying compressibility are not modelled per-VM.
No explicit vSAN CPU/RAM overhead	vSAN ESA consumes a small amount of host CPU and memory. Include this in your Compute Reserve % input.

🚫 Not an official Broadcom tool. This sizer is an independent planning aid built by vmtechie.blog. It is not endorsed by or affiliated with Broadcom. All figures are estimates. Validate every design against official Broadcom TechDocs, VMware HCL, and field engineering guidance before procurement or deployment.

March 1, 2026

VCF 9 Fleet Planning Sizer
After several VCF design sessions—navigating management domains, ESA policies, and the new core-based licensing—one thing became clear: we have plenty of docs, but we need more interactive clarity. I built the VCF 9 Fleet Planning Sizer (ESA Only) to help architects model environments quickly.

🔷 VCF 9 Fleet Planning Sizer (ESA Only)

👉 Try it here: https://sizer.vmtechie.blog/

This is an independent planning calculator designed to help architects model:
- Infrastructure VM footprint (Supervisor, Edge, etc.)
- Management Domain sizing
- Multiple Workload Domains
- ESA storage behavior
- DRR (Dedup × Compression realism)
- Failure domain modeling (0 / N+1 / N+2)
- Core-based licensing visibility
- vSAN entitlement vs raw consumption
Why I Built This Tool

Designing VCF 9 isn’t just about adding up VMs. It’s about navigating the “Triple Constraint”: Compute, ESA Storage, and Licensing. In real architecture discussions, we constantly ask:
- What is actually limiting this cluster?
- CPU, Memory, or Storage?
- How many hosts do we really need?
- What does FTT=2 + RAID-6 really do to capacity?
- Are we oversizing?
- Are we license constrained?
- What happens if I add Supervisor HA?
- What does N-2 failure tolerance mean in practice?
Spreadsheets can answer parts of this, but they don’t show the dynamic interaction between policy, compute, and ESA, This tool tries to do that.

Management Domain Sizing

The calculator starts with:

🔹 Hardware Profile
- CPUs per host
- Cores per CPU
- RAM per host
- NVMe quantity & size
- Minimum host count
🔹 Policy Inputs
- CPU oversubscription
- Memory oversubscription
- Host reserve %
- FTT & RAID policy
- vSAN free space %
- Dedup & compression
- VM Swap Used %
- Failure modeling
How It Calculates Management Hosts
1. Compute usable vCPU per host
2. Compute usable RAM per host
3. Apply reserve factor
4. Compare demand from full Management VM stack
5. Determine limiter (Compute / Memory / Storage)
6. Calculate ESA protected storage requirement
7. Apply failure domain logic
8. Final host count = max(CPU, RAM, Storage, Minimum Hosts)
You immediately see:
- Demand vs Capacity
- Protection Factor
- ESA storage breakdown
- Core licenses required
- Raw TiB consumed
Full Management VM Stack Modeling

The tool includes:
- SDDC Manager
- vCenter
- NSX Manager
- NSX Edge
- AVI
- VCF Operations
- Log Insight
- Network Insight
- Identity
- Custom VMs
Each with T-shirt sizing.

ESA Storage Model

ESA math is often misunderstood,The calculator models:
VM Capacity = (VM disks + infra disks) / DRRSwap = Provisioned RAM × Swap %Interim Total = VM Capacity + SwapProtected = Interim × Protection Factor+ Free Space Reserve+ Growth %Storage Hosts = ceil(total / per-host raw capacity + failures)
Protection Factor examples:

Policy FTT Protection Factor
RAID-1 1 2.0
RAID-1 2 3.0
RAID-5 1 1.5
RAID-5 2 1.75
RAID-6 2 1.5

Workload Domains (Where It Gets Interesting)

You can add multiple WLDs.

Each WLD has:

🔹 Tenant Demand
- VM count
- vCPU per VM
- RAM per VM
- Disk per VM
- Growth %
🔹 Policy + Planning
- CPU/Mem oversub
- FTT + RAID
- Reserve %
- Free space %
- Dedup × Compression
- VM Swap Used %
- Failure Domain (0 / N+1 / N+2)
Limiter Visualization + Health Model

Each WLD shows:
- Compute limiter
- Memory limiter
- Storage limiter
- Utilization %
- Health badge:
  - 🟢 Healthy
  - 🟡 Tight
  - 🔵 Oversized
This gives immediate architectural intuition.

Licensing Visibility (Core-Based)

The calculator also models:
- Management core licenses
- Workload core licenses
- Total fleet cores
- Entitlement (1 TiB per core)
- Required add-on capacity
What Makes This Different?

This tool is:

✔ ESA-focused
✔ Policy-aware
✔ Failure-domain realistic
✔ Multi-domain capable
✔ Licensing visible
✔ Architecture-driven

It’s not just math. It reflects real design conversations.

⚠️ Important Disclaimer

This calculator is:
- Independent
- Not an official Broadcom / VMware tool
- Not endorsed by my employer
- Intended as a planning aid only
Always validate against:
- Official documentation
- HCL
- Field engineering guidance
🧑‍💻 Who Is This For?
- VCF Architects
- Cloud Platform Leads
- Infrastructure Engineers
- Pre-sales Architects
- Capacity planners
- Anyone doing ESA-based VCF 9 designs
🚀 Try It

👉 Live here:

https://sizer.vmtechie.blog

If you test it, I’d love feedback

Final Thoughts

Architecture clarity reduces risk.This tool is my contribution to making VCF 9 planning:

More transparent.
More realistic.
More engineer-friendly.
Share this:
X
Facebook
Like this:
Like Loading…
February 24, 2026
VCF 9 – Updating the Supervisor Service
Supervisor and VKS clusters are built using a common Kubernetes distribution core, but their Kubernetes versions are delivered differently. Starting with VCF 9, Supervisor Kubernetes releases are delivered independently of vCenter. You can update the Supervisor version by deploying a release from the Supervisor Content Library. In this blog post, we will walk through the Supervisor update process step by step. Let’s get started!

Create and Configure a Subscribed Content Library for Supervisor Images

For vSphere Supervisor, VMware publishes Supervisor images through a content delivery network (CDN). To enable or upgrade vSphere Supervisor, you can create a Subscribed Content Library that synchronizes with the Supervisor release images.

You can configure the content library in either Immediate or On-Demand synchronization mode. Note that immediate synchronization from the public CDN may require more time and consume additional disk space.
- Log in to vCenter as a vSphere administrator.
- From the Home menu, select Content Libraries
- Click Create
- Provide a name for the library (for example, supervisor update library) and click Next.
- On the Configure Content Library page, select Subscribed Content Library.
- Subscription URL is : https://wp-content.vmware.com/supervisor/v1/latest/lib.json, URL is posted here : https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/vsphere-supervisor-installation-and-configuration/updating-vsphere-supervisor/updating-the-vsphere-with-tanzu-environment/create-a-supervisor-asynchrounious-releases-content-library.html
- In the Download content section, select the synchronization mode of the content library and click Next
- When prompted, accept the SSL certificate thumbprint.The thumbprint will remain stored on your system until the subscribed content library is removed from the inventory
- Apply Security Policy click Next
- On the Add storage page, select a datastore as a storage location for the content library contents and click Next.
- Review the details and click Finish
Assign the content library to the vSphere Supervisor platform
- on vCenter go to Home menu, select Supervisor Management
- Select Content Distribution.
- On the Supervisor Images Library card, click Assign
- Select the Content Library that created above and click Assign
- The new content library begins synchronizing, which may take some time. After synchronization is complete, the new Supervisor Kubernetes versions included in the images will appear under the Updates tab
Apply Updates
- Select the Available Version you want to update to. For example: v1.30.10+vmware.1-fips-vsc9.0.0.0100. ⚠️ Updates must be applied incrementally. You cannot skip versions (e.g., upgrading directly from 1.28 to 1.30). The correct sequence is 1.28 → 1.29 → 1.30.
- Select a Supervisor to update and click Apply Updates
The system runs a series of pre-checks to verify the compatibility of the different components against the Supervisor Kubernetes version to which you want to update.

Learn which are the pre-checks that are run before updating the supervisor and how to troubleshoot in case of errors resulting from failed pre-checks, HERE

When the pre-checks are completed successfully, you can update the Supervisor.

Upgrading the VMware vSphere Supervisor service is a critical step in maintaining a secure, stable, and feature-rich VMware Cloud Foundation environment. By following best practices—planning incremental updates, leveraging subscribed content libraries, and validating compatibility at every stage—administrators can ensure minimal downtime while keeping workloads and Kubernetes clusters up to date. Regular Supervisor upgrades not only enhance platform capabilities but also strengthen the foundation for running modern applications, containers, and cloud-native services efficiently and reliably.
Share this:
X
Facebook
Like this:
Like Loading…
September 25, 2025
VCF Automation – Tenant Management
In today’s multi-tenant cloud environments, VMware Cloud Foundation Automation (VCFA) offers a robust layered architecture that seamlessly bridges enterprise-grade infrastructure management with developer-ready self-service capabilities.

By clearly separating responsibilities—from VMware Cloud Service Providers who manage the physical and virtual infrastructure, to organization administrators who allocate resources, and finally to developers who consume them—VCFA enables efficient resource governance, operational consistency, and scalability. This structured approach not only supports multi-tenancy and workload isolation but also accelerates innovation by empowering end users to deploy applications and services quickly within well-defined boundaries.

Why Tenant Management Matters?

Tenant management is more than just dividing resources—it’s about ensuring cost efficiency, security, scalability, and compliance in a shared infrastructure. In VCFA, these capabilities allow VMware Cloud Service Providers to maximize utilization without compromising performance or governance for individual tenants.

Key concepts to understand from both the Provider and Tenant perspectives:

Projects

Projects control user access to namespaces and user ownership of provisioned resources. All organizations are created with a default project. The default project is empty and does not have any namespaces or users.

Example: A VMware Cloud Service Provider might assign a dedicated project to each customer department for clearer billing and isolation.

Regions

The Regions page lists all the regions where the organization has a quota in. Organizations can have a quota in one or many regions. Your provider administrator assigns the regional quota to your organization. Quota in a region can come from one or many vSphere Zones within that region.

Example: A global enterprise hosted by a VMware Cloud Service Provider might have quotas in Asia and Europe to ensure low-latency access for local teams.

Namespace Class

Namespace classes are templates for namespace provisioning. These templates can be used to standardize namespace attributes, like utilization limits, reservations, VM classes, storage classes, and content libraries. organizations comes preconfigured with three default namespace classes (small, medium, and large), which are meant to serve as example templates. The only different attributes among these built-in templates are the CPU and Memory limits. Administrators can use these templates as-is or can modify them to suit their needs.

Namespace

Projects are the central construct for organizing and allocating infrastructure resources to tenants or teams. As the organization administrator, you manage and distribute infrastructure by assigning namespaces to projects. When configuring a project, you must add at least one namespace so that users within the project can begin provisioning workloads such as virtual machines, VMware Kubernetes Service (VKS) clusters, or other supported resources. Namespaces act as scoped resource pools, defining limits for CPU, memory, and storage to ensure fair allocation and performance consistency. Each namespace is tied to a Virtual Private Cloud (VPC) and a namespace class, which in turn is associated with at least one zone to determine placement and availability. This structure not only enforces resource governance but also enables automation workflows to deploy consistently within predefined boundaries. All organizations are created with a default project, which is initially empty and contains no namespaces or users, providing a baseline starting point for configuration.

Example: A tenant of a VMware Cloud Service Provider might create separate namespaces for development and production to avoid accidental resource conflicts.

Virtual Private Clouds (VPCs)

A Virtual Private Cloud (VPC) in VMware Cloud Foundation Automation (VCFA) offers an isolated networking environment that can be associated with one or more namespaces. Organizations can create multiple VPCs and assign each to specific namespaces based on workload or isolation requirements.

Each VPC is an independent network and supports three types of IP address spaces, each offering different levels of reachability:
- Private CIDRs: These addresses are internal to the VPC and are not routable outside without NAT. They are managed by the VPC administrator and do not need to be globally unique, allowing reuse across multiple VPCs.
- TGW Private IP Blocks: These IP blocks are scoped at the organization level and are advertised through the Transit Gateway (TGW) within the organization. Organization admins define these blocks, and project admins can allocate subnets from them for their VPCs. This enables direct communication between VPCs in the same organization using the TGW Private IP space.
- External IP Blocks: Managed by the provider admin, these IPs enable outbound access through Source NAT. Organization admins can assign subnets from provider-defined external blocks, giving workloads external connectivity while still using internal addressing.
You can choose to deploy a separate VPC per namespace for stricter isolation, or share a VPC across namespaces where network separation is not required.

Transit Gateways

Each organization has a transit gateway which provides connectivity to the provider gateway within the organization. One or more VPCs are connected to the transit gateway, and that connection is defined by a VPC connectivity profile. Each VPC has connected workloads and a private subnet. SNAT rules translate addresses from this private subnet to a public address in the IP spaces block. This infrastructure enables the organization and its workloads to connect to external networks.

You can view what transit gateways are available to your organization on the Manage & Govern > Networking > Transit Gateways page.

IP Management

Provider can use IP Spaces to manage their IP address allocation needs. IP Spaces provide a structured approach to allocating public IP addresses to different organizations, enabling connectivity to external networks.

An IP space consists of a set of CIDR blocks that are reserved, these CIDRs must be dedicated to and used by organization administrators as they configure services. An IP space can only be IPv4.

Organization administrators can create and manage the private IP blocks within their organization. there tenant can view external IP address blocks assigned to this organization by a provider. You can also create and view private TGW IP address blocks for the entire organization to use. Finally, you can view private VPC IP address blocks that are applicable to specific VPCs.

In essence, VMware Cloud Foundation Automation’s tenant management capabilities provide a structured, role-based framework for organizing projects, namespaces, VPCs, transit gateways, and IP resources. By aligning provider and tenant responsibilities, VMware Cloud Service Providers ensure secure isolation, consistent governance, and streamlined automation—empowering organizations to scale efficiently while maintaining full control over infrastructure and networking resources.
Share this:
X
Facebook
Like this:
Like Loading…
August 12, 2025

Author: vmtechie

VKS on VCF 9.1 What Actually Changed & Why It Matters

⚡ Cast of Characters ⚡

Share this:

Like this:

VCD → VCF Automation Migration Tool

Self-Service Namespace Creation with Guardrails

Upfront Pricing Estimates & Tenant Notifications

Project-Scoped Content Libraries

VPC Connectivity Policies — Community, Promiscuous, Isolated

Transit Gateway Advanced Connectivity

Distributed Transit Gateway with EVPN/VXLAN

Virtual Network Appliances (VNA) — Edge-Free Network Services

TGW Span + Infoblox IPAM Integration

vSAN ESA Inline Compression (ZSTD) + Global Deduplication GA

Auto-RAID + Effective Capacity View

Native S3 Object Storage on vSAN — Technology Preview

VKS: 500 Clusters per Supervisor + Fast Deploy

Container Service — CaaS Without Kubernetes

Unified Fleet IAM & Management

Centralized LCM — 4× Parallel Upgrades

Flexible Licensing — License Server + Aggregated Usage

On-Premises Cyber Recovery Clean Room

Security Posture Management & Compliance Automation

VCF Edge — 5,000 Hosts, 256 Parallel Upgrades, ZTP + GitOps

The CSP Takeaway

Share this:

Like this:

The Problem:

The Solution:

What It Does:

🚀 Try it below

EKS/AKS/OCP → VKS/VCF9

Step-by-Step Usage:

Known Limitations:

Disclaimer & Privacy:

Share this:

Like this:

🔷 1. The Illusion of Flexibility

🔶 2. The Cost Nobody Invoices — Operational Fragmentation

🔷 3. The Shift in What Matters

🔶 4. What “Integrated” Actually Means in a VCF Context

What integration actually delivers:

🔷 5. Optionality vs Integration — The Real Trade-Off

🔶 6. Architect’s Take — LCM Is Where It Pays Off Most Visibly

🔷 7. Why This Matters Even More for AI + Kubernetes

🔶 8. The Platform Multiplier Effect

🔷 9. When Integration Is the Wrong Answer

🔶 10. Verdict

Share this:

Like this:

PREMISE Let’s Be Honest About KubeVirt First

00 The Core Difference: Integrated Platform vs Extension Model

01 Hypervisor Architecture — Purpose-Built vs Added On

02 GPU & AI Workload Performance — The Widest Gap

KubeVirt GPU Reality

VKS with VCF with NVIDIA AI Enterprise

03 Security & Isolation — 20 Years vs 5 Years

KubeVirt’s Security Model

VKS + NSX Security Model

04 Day-2 Operations — Where the Pain Is

05 Networking — NSX vs CNI Complexity

KubeVirt Networking Complexity

VKS + NSX — Unified Fabric

06 Enterprise Transformation Reality — The Mixed Workload Problem

Operational Risk — The Questions That Matter

07 Governance & Private Cloud Readiness

08 Head-to-Head Summary

CLOSING The Right Tool for the Right Job

Share this:

Like this:

I Built a VCF UpgradePath Planner — Here’s Why

Why I Built This

Everything is sourced

Critical gates are flagged

How We Calculate Time, Risk & Effort

Duration

What is RDU (Reduced Downtime Upgrade)?

Risk Score

Effort Score

I Built a VCF Upgrade
Path Planner — Here’s Why