// the find

ankurchavda/streamify

★ 880 · Python · updated Apr 2022

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

A portfolio/learning project that wires together Kafka, Spark Streaming, Airflow, dbt, and BigQuery to build a fake music-streaming analytics pipeline on GCP. It exists primarily as a capstone project for the DataTalksClub data engineering course — not production software. Good for people who want a reference architecture to study or a starting point to adapt.

The end-to-end architecture is coherent: events flow Kafka → Spark → GCS → BigQuery → dbt → dashboard, and each handoff is documented. Terraform provisions the GCP infra cleanly, so you can spin it up and tear it down without manual console clicks. The dbt models are properly structured with dims and a fact table rather than a single denormalized dump. The YouTube walkthrough plus per-component setup docs in /setup/ make it easier to follow than most similar projects.

Last pushed April 2022 — Spark Streaming, Airflow, and dbt have all had significant releases since then and the pinned versions are stale. The dbt tests/ directory is empty, and the README's own TODO list calls this out, so data quality is entirely untested. All dbt models do full refresh, meaning the pipeline gets slower linearly with data volume — this is fine for a toy dataset but breaks immediately with real scale. Data Studio (now Looker Studio) dashboards are not included in the repo, so the 'final result' screenshot is as far as you can get without rebuilding it yourself.

View on GitHub →