// the find

bigscience-workshop/petals

★ 10,193 · Python · MIT · updated Sep 2024

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Petals lets you run and fine-tune massive LLMs (Llama 3.1 405B, Mixtral, Falcon) by splitting layers across volunteer machines in a peer-to-peer swarm, BitTorrent-style. You contribute GPU blocks, others fill in the rest, and the whole thing surfaces as a standard HuggingFace model. It's aimed at researchers and hobbyists who want to work with models that won't fit on any hardware they can afford.

The API is genuinely clever — you get a `model.generate()` that looks like a local HuggingFace model but is routing activations across the internet in real time. Prompt tuning works end-to-end over the distributed layer graph, which is not a trivial engineering problem. The routing layer (`sequence_manager.py`) handles dynamic peer availability, so the swarm can lose nodes mid-inference and recover. There's a real academic paper (ACL 2023) with benchmarks, not just marketing claims.

The last commit was September 2024 and the public swarm health is entirely volunteer-dependent — if nobody is hosting your model's layers right now, you're stuck. Latency over the internet means 4-6 tokens/sec best case, which is fine for async tasks but frustrating interactively. Privacy is a real problem they acknowledge: intermediate activations pass through strangers' GPUs, and the mitigation advice is basically 'use a private swarm' — which defeats the point. The project also predates most of the current open-weight model landscape (Qwen, Gemma, DeepSeek) and has no support for them.

View on GitHub → Homepage ↗