// the find

datajuicer/data-juicer

★ 6,526 · Python · Apache-2.0 · updated Jun 2026

Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

Data-Juicer is a Python data processing framework for ML dataset preparation, covering the full spectrum from raw text cleaning to multimodal (image/video/audio) curation for pre-training, fine-tuning, and synthetic data generation. It's backed by Alibaba, has real production use, and scales via Ray from a laptop to thousand-node clusters. The target audience is ML engineers building or cleaning datasets for foundation model training.

The operator library is genuinely impressive — 200+ composable ops covering text, image, audio, video, and multimodal pipelines, each independently usable or chainable via YAML recipes that are version-controllable and reproducible. The Ray integration is first-class, not bolted on: there are Ray-native deduplicators, partitioned executors with fault tolerance, and Ray+vLLM inference pipelines, so the same pipeline code runs locally or distributed without rewriting. OP fusion (2-10x speedup by merging compatible operators into single passes) is a concrete performance win that most similar frameworks ignore. The NeurIPS '25 Spotlight paper and active release cadence (v1.5.2 in May 2026) signal this is maintained with actual research rigor behind it.

The dependency footprint is massive and historically painful — even after v1.5.2 moved Ray, spaCy, and av to optional extras, a full install pulls in heavy CV/ML deps (YOLO, SAM2, MMPose, DWPose) that are hard to pin and frequently break on non-Linux platforms; Windows users will hit pain fast. The YAML config system is powerful but underdocumented for complex cases — the config_all.yaml is effectively the reference, which means you're reading source to understand behavior. The LLM operator naming (`llm_*`) went through a rename in v1.5.2, so community recipes and StackOverflow answers referencing older operator names will silently use stale APIs. There's no built-in lineage tracking across pipeline runs — you can trace changed samples within a run, but there's no way to audit which version of a recipe produced a given dataset artifact without rolling your own.

View on GitHub → Homepage ↗