finds.dev← search

// the find

spotify/klio

★ 870 · Python · Apache-2.0 · updated Jan 2024

Smarter data pipelines for audio.

Klio is Spotify's internal framework for running audio and binary file processing pipelines at scale, built on top of Apache Beam. It wraps Beam with opinionated conventions around GCP (Pub/Sub, GCS, Dataflow), adds audio-specific transforms via librosa/soundfile, and provides a CLI for job scaffolding and deployment. If you're processing millions of audio files on GCP and don't want to wire up Beam yourself, this is what Spotify built to solve that.

The GCS skip-and-resume pattern is the most useful thing here: if a file has already been processed and the output exists in GCS, Klio skips it automatically, which means re-runs after partial failures don't reprocess everything. The protobuf-based message format (KlioMessage wrapping an entity_id) gives you a consistent contract across jobs in a pipeline graph, so chaining jobs via Pub/Sub actually works cleanly. The CLI scaffolding (`klio job create`) generates a working Dockerfile, klio-job.yaml, and transform stubs that are immediately runnable on Dataflow — the boilerplate you'd otherwise spend a day writing. Built-in profiling support (`klio job profile`) for memory and CPU per-transform is genuinely useful when you're debugging a slow step in a long pipeline.

Last commit was January 2024 and the pace before that was already slowing — this is effectively in maintenance mode, which matters a lot if you hit a bug in the Beam version pinning or need Python 3.12+ support. The GCP lock-in is total: Pub/Sub for events, GCS for data, Dataflow for runners — running this on anything else (Flink, local Spark, AWS) requires rewriting the I/O layer yourself, which defeats the point. The monorepo with five separately-versioned packages (klio, klio-cli, klio-exec, klio-audio, klio-core) creates real dependency hell during upgrades; the changelog shows versions drifting out of sync in practice. Documentation is thorough but assumes you're already comfortable with Apache Beam's programming model — if you're not, you'll hit the Beam learning curve first and Klio's conventions second.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →