// the find

NVIDIA-NeMo/Curator

★ 1,613 · Python · Apache-2.0 · updated Jun 2026

Scalable data pre processing and curation toolkit for LLMs

NeMo Curator is NVIDIA's GPU-accelerated data pipeline toolkit for building LLM training datasets — text, image, video, and audio. It's the same tooling used internally for Nemotron, so the scale claims are credible. Aimed at ML engineers running pre-training or fine-tuning pipelines who need something between 'ad-hoc scripts' and 'build your own distributed system'.

GPU-accelerated fuzzy deduplication via RAPIDS cuDF/cuML is the real differentiator — the 16x speedup over CPU alternatives on RedPajama v2 is independently verifiable and addresses a genuine bottleneck in LLM data prep. The pipeline abstraction (stages declare resource requirements, executor auto-scales replicas) handles CPU/GPU work overlap cleanly and avoids the usual hand-tuned parallelism mess. Multi-modality in one framework means you're not stitching together five different tools for a multimodal training run. The Nemotron-CC recipe is a full reproducible end-to-end pipeline from Common Crawl to dataset, which is worth more than most tutorials.

Hard NVIDIA GPU dependency for the interesting parts — CPU mode exists but you're essentially just using Dask with extra steps. The XennaExecutor (Cosmos-Xenna) is the 'production default' but feels proprietary and under-documented compared to the Ray backends, which creates lock-in risk. The architecture pivot from Ray-first to Xenna-first happened in early 2026 and the migration docs acknowledge breaking changes; adopters from a year ago are looking at non-trivial rewrites. Video and audio pipelines officially require the NGC Docker container because of codec dependencies, which means your CI and local dev environment need Docker just to run a subset of the stages.

View on GitHub →