// the find
huggingface/trl
Train transformer language models with reinforcement learning.
TRL is HuggingFace's library for post-training LLMs using RLHF techniques: SFT, GRPO, DPO, PPO, KTO, and a growing list of others. It's the de facto standard for anyone doing preference training or RL-based fine-tuning on top of HuggingFace models, from researchers replicating DeepSeek-R1-style training to practitioners doing basic instruction tuning.
- Algorithm breadth is genuinely impressive — DPO, GRPO, RLOO, PPO, KTO, CPO, ORPO, online DPO, reward modeling, PRM, and more, all with consistent Trainer-style APIs that share the same distributed training plumbing.
- vLLM integration for generation during GRPO/online methods is a meaningful engineering win — offloading generation to a separate vLLM server dramatically speeds up the RL training loop without custom infrastructure.
- PEFT/LoRA/QLoRA support is first-class throughout, not bolted on, which means you can run most algorithms on consumer hardware with quantized models.
- The test suite is substantial, including invariant tests that check training numerics against reference outputs, which is unusual for ML libraries and catches silent regressions in loss calculations.
- The experimental namespace is a dumping ground — A2PO, GSPO, DPPO, SSD, SDFT, SDPO, GolD, MiniLLM trainers all live there with no-stability guarantees, making it hard to know which methods are production-ready vs. research sketches that might disappear.
- GRPO memory requirements and stability with large models are still painful in practice; the library provides the knobs (DeepSpeed ZeRO3, gradient checkpointing) but offers minimal guidance on which combinations actually work for 70B+ scale, and users regularly hit OOM or divergence with default settings.
- Reward function API for GRPO is simple but limited — it's a Python callable per completion, which makes it awkward to implement rewards that require batched inference (e.g., a reward model) without managing your own batching outside the trainer.
- The sheer number of trainers (30+) means maintenance is thin for anything outside the core SFT/DPO/GRPO trio — older trainers like PPO and KTO have seen minimal updates and carry compatibility debt with newer transformers versions.