Explore 2025 Session Recap – INVB1158LV
Are you looking to maximize AI/ML performance in your virtualized environment? At VMware Explore 2025, I attended a compelling session — INVB1158LV: Accelerating AI Workloads: Mastering vGPU Management in VMware Environments — that unpacked how to effectively configure and scale GPUs for AI workloads in vSphere.

This blog post shares key takeaways from the session and outlines how to use vGPU, MIG, and Passthrough to achieve optimal performance for AI inference and training on VMware Cloud Foundation 9.0.
vGPU Configuration Options in VMware vSphere
🔹 1. DirectPath I/O (Passthrough)
- A dedicated GPU is assigned to a single VM or containerized workload.
- Ideal for maximum performance and full GPU access (e.g., LLM training).
- No sharing or resource fragmentation.
🔹 2. NVIDIA vGPU – Time Slicing Mode
- Shares one physical GPU across multiple VMs.
- Each VM gets 100% of GPU cores for a slice of time, while memory is statically partitioned.
- Supported on all NVIDIA GPUs.
- Useful for efficient GPU sharing, especially for model inference and dev/test setups.
✅ Example profiles: grid_a100-8c, grid_a100-4-20c
🔹 3. Multi-Instance GPU (MIG)
- Available on NVIDIA Ampere & Hopper (e.g., A100, H100).
- Splits GPU into isolated hardware slices (compute + memory).
- Offers deterministic performance and better isolation.
- Best for multi-tenant AI inference, production-grade deployments.
✅ Example profiles: MIG 1g.5gb, MIG 2g.10gb, MIG 3g.20gb
✅ Assignable via vSphere UI with profiles like grid_a100-3-20c
Time Slicing vs. MIG – When to Use What?
| Mode | Best For | Sharing Type |
|---|---|---|
| Time Slicing | LLM training, dev/test environments | Time-shared |
| MIG | Production inference, multitenancy | Spatial (hardware) |
| Passthrough | Maximum performance for single workload | Not shared |
Smarter vMotion for AI Workloads in VCF 9.0
One of the standout improvements presented during session INVB1158LV was the vMotion optimization for VMs using vGPUs. With vSphere 8.0 U3 and VMware Cloud Foundation 9.0, the way vMotion handles GPU memory has been completely reengineered to minimize downtime (stun time) during live migration.
Instead of migrating all GPU memory during the VM stun phase, 70% of the vGPU cold data is now pre-copied in the pre-copy stage, and only the final 30% is checkpointed during stun. This greatly accelerates live migration even for massive LLM workloads running on multi-GPU systems.
📊 Example results with Llama 3.1 models:
- Migrating a VM using 2×H100 GPUs (144 GB vGPU memory) saw stun time drop from 24.5s to just 6.3s.
- Migrating a large model on 8×H100 (576 GB) now completes in 21s, compared to 325s for a power-off-and-reload approach — that’s a 15× improvement.
These enhancements make zero-downtime AI infrastructure upgrades and scaling possible, even for large language model deployments





