// the find
XuehaiPan/nvitop
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
nvitop is a terminal UI for monitoring NVIDIA GPUs, built on direct NVML bindings rather than parsing nvidia-smi output. It ships as both an interactive htop-style monitor and a Python library you can embed in training scripts, with extras like a CUDA device selector (nvisel) and a Prometheus exporter. The target audience is ML researchers and anyone managing multi-GPU Linux or Windows machines.
Querying NVML directly instead of screen-scraping nvidia-smi means the numbers are accurate and the overhead is low. The nvisel tool fills a real gap — scripting CUDA_VISIBLE_DEVICES selection based on free memory and utilization thresholds is something most teams hack together themselves. The Python API is well-designed: Device.all(), device.processes(), and take_snapshots() are clean entry points that actually work in callbacks for PyTorch Lightning or Keras. The Prometheus exporter plus the included Grafana dashboard JSON means you can wire this into existing infrastructure without building your own metrics pipeline.
Windows support exists but curses on Windows (via PDCurses/windows-curses) is notoriously inconsistent, and the docs acknowledge mouse support may not work — so the 'portable' claim has an asterisk. The library API has no async support; collect_in_background uses threads, which is fine until you're in an async training loop and you'd rather use asyncio. The nvitop-exporter is a separate sub-package with its own pyproject.toml and setup.py, so versioning between the two can drift if you're pinning both. MIG (Multi-Instance GPU) support is mentioned but the output falls back to UUID strings for identification, which breaks scripts that expect integer device indices.