// the find

databrickslabs/dolly

★ 10,793 · Python · Apache-2.0 · updated Jun 2023

Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform

Dolly is Databricks' 2023 attempt to produce a commercially-licensable instruction-following LLM by fine-tuning EleutherAI's Pythia-12b on 15k human-written instruction pairs. It was historically significant as one of the first openly licensed models you could actually ship in a product. It is now a museum piece — last commit June 2023, and every meaningful open-weights model released since then beats it.

The databricks-dolly-15k dataset (CC-BY-SA) was genuinely useful and got used widely by the community for fine-tuning other models. The training code is clean and readable — deepspeed configs per GPU family is a practical touch. Commercial licensing was a real differentiator at the time. The README is honest about limitations, which is rare.

Dead project — no commits in 3 years, the LLM landscape has moved so far past this that adopting it makes no sense. The README's own list of things it can't do (math, dates, enumeration, factual recall) covers most of what instruction-following is actually for. Requires 8 A100s to train, which is inaccessible to most individuals or small teams. The fine-tuning corpus came entirely from Databricks employees, so the model's priors are narrow and demographically skewed.

View on GitHub → Homepage ↗