// the find
ssp-data/practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A learning project that walks through a complete data engineering pipeline using real estate listings as the domain. Covers scraping, lakehouse storage with Delta Lake, orchestration via Dagster, and visualization with Superset. Aimed at people who want a concrete, end-to-end example rather than isolated tutorials.
The stack choices have aged well — Dagster, Delta Lake, and MinIO are all still production-relevant in 2024, which is rare for a project started in 2020. The author replaced PySpark with delta-rs after admitting Spark was painful to set up locally; that kind of honest revision is useful to see. The CDC and UPSERT patterns via Delta Lake are the most instructive part — these are the concepts that trip up most DE beginners. The project structure mirrors what a real pipeline looks like: separate scraping, processing, and visualization layers rather than one giant notebook.
The Kubernetes deployment angle is mostly aspirational — the quick-start drops you into local MinIO and Dagster dev mode, and there's no actual K8s manifests in this repo (they're in a separate devops repo with a link). The scraper is tightly coupled to one German real estate site; anyone outside that region gets a learning exercise with no usable data. Test coverage is thin: the test files exist but they're more exploratory scripts than actual assertions. Apache Druid is still in the architecture diagram but the migration notes are sparse on what replacing it means for someone following along.