Mastering NVIDIA MIG: A Deep Dive into GPU Partitioning

Why Multi-Instance GPU (MIG)?

Traditional multi-tenancy relies on time-slicing (context switching), which causes latency jitter. MIG provides hardware isolation for memory, cache, and compute cores.

Legacy: Time Slicing

Job A

Job B

Job A

Job B

✕ Shared L2 Cache (Thrashing risk)
✕ Unpredictable Latency
✕ Single Error crashes all jobs

The MIG Way: Spatial Partitioning

Job A Dedicated HW

Job B Dedicated HW

Job C Dedicated HW

✓ Isolated Memory Bandwidth & Cache
✓ Deterministic Latency (Ideal for Inference)
✓ Fault Isolation

Implementation Guide

Configuring MIG on H100

Follow this step-by-step walkthrough based on real terminal outputs. We will go from a raw GPU state to fully partitioned instances.

user@server: ~

Partition Planning Calculator

An H100 (80GB) typically has 7 Compute Slices and 8 Memory Slices. Mix and match profiles to fill the GPU.

Utilization 0 / 7 Compute Slices

GPU EMPTY

*Note: Exact memory slice mapping can vary slightly by SKU. 1g.10gb is the most common granular unit.

Performance Implications

When to choose MIG over other virtualization methods.

Inference Latency Stability

Comparison of p99 Latency under load (Lower is better).

Insight: MIG maintains flat latency even as other instances are loaded. Time-sliced GPUs exhibit spikes due to context switching contention.

Aggregate Throughput

Total system throughput running 7 simultaneous jobs.

Insight: MIG allows 7 jobs to run in parallel, achieving near-linear scaling (approx 95-98% of bare metal) compared to sequential execution.

Optimization & Best Practices

1

Compute vs Memory Bound

MIG works best for workloads that are memory bandwidth constrained or small enough to not need a full H100 (e.g., BERT-Large Inference, Jupyter Notebooks). If your model needs NVLink to scale across GPUs, disable MIG.

2

Dynamic Reconfiguration

Changing MIG profiles requires the GPU to be idle. In production, use Kubernetes (MIG Strategy: Mixed) to advertise different node labels for different slice sizes, rather than re-slicing on the fly.

3

Monitoring with DCGM

Standard `nvidia-smi` gives limited visibility inside a slice. Use NVIDIA DCGM (Data Center GPU Manager) to profile metrics like `gr_engine_active` per MIG instance to ensure you aren't under-utilizing your slices.