Transform a single H100 into a fleet of isolated GPUs. Stop wasting cycles on idle hardware and start guaranteeing Quality of Service (QoS) through hardware-level isolation.
Traditional multi-tenancy relies on time-slicing (context switching), which causes latency jitter. MIG provides hardware isolation for memory, cache, and compute cores.
Follow this step-by-step walkthrough based on real terminal outputs. We will go from a raw GPU state to fully partitioned instances.
An H100 (80GB) typically has 7 Compute Slices and 8 Memory Slices. Mix and match profiles to fill the GPU.
When to choose MIG over other virtualization methods.
Comparison of p99 Latency under load (Lower is better).
Total system throughput running 7 simultaneous jobs.
MIG works best for workloads that are memory bandwidth constrained or small enough to not need a full H100 (e.g., BERT-Large Inference, Jupyter Notebooks). If your model needs NVLink to scale across GPUs, disable MIG.
Changing MIG profiles requires the GPU to be idle. In production, use Kubernetes (MIG Strategy: Mixed) to advertise different node labels for different slice sizes, rather than re-slicing on the fly.
Standard `nvidia-smi` gives limited visibility inside a slice. Use NVIDIA DCGM (Data Center GPU Manager) to profile metrics like `gr_engine_active` per MIG instance to ensure you aren't under-utilizing your slices.